How to avoid users entering unwanted texts in the input form in the website? [duplicate] - javascript

How do I prevent XSS (cross-site scripting) using just HTML and PHP?
I've seen numerous other posts on this topic but I have not found an article that clear and concisely states how to actually prevent XSS.

Basically you need to use the function htmlspecialchars() whenever you want to output something to the browser that came from the user input.
The correct way to use this function is something like this:
echo htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
Google Code University also has these very educational videos on Web Security:
How To Break Web Software - A look at security vulnerabilities in
web software
What Every Engineer Needs to Know About Security
and Where to Learn It

One of the most important steps is to sanitize any user input before it is processed and/or rendered back to the browser. PHP has some "filter" functions that can be used.
The form that XSS attacks usually have is to insert a link to some off-site javascript that contains malicious intent for the user. Read more about it here.
You'll also want to test your site - I can recommend the Firefox add-on [XSS Me]. Looks like Easy XSS is now the way to go.

Cross-posting this as a consolidated reference from the SO Documentation beta which is going offline.
Problem
Cross-site scripting is the unintended execution of remote code by a web client. Any web application might expose itself to XSS if it takes input from a user and outputs it directly on a web page. If input includes HTML or JavaScript, remote code can be executed when this content is rendered by the web client.
For example, if a 3rd party side contains a JavaScript file:
// http://example.com/runme.js
document.write("I'm running");
And a PHP application directly outputs a string passed into it:
<?php
echo '<div>' . $_GET['input'] . '</div>';
If an unchecked GET parameter contains <script src="http://example.com/runme.js"></script> then the output of the PHP script will be:
<div><script src="http://example.com/runme.js"></script></div>
The 3rd party JavaScript will run and the user will see "I'm running" on the web page.
Solution
As a general rule, never trust input coming from a client. Every GET parameter, POST or PUT content, and cookie value could be anything at all, and should therefore be validated. When outputting any of these values, escape them so they will not be evaluated in an unexpected way.
Keep in mind that even in the simplest applications data can be moved around and it will be hard to keep track of all sources. Therefore it is a best practice to always escape output.
PHP provides a few ways to escape output depending on the context.
Filter Functions
PHPs Filter Functions allow the input data to the php script to be sanitized or validated in many ways. They are useful when saving or outputting client input.
HTML Encoding
htmlspecialchars will convert any "HTML special characters" into their HTML encodings, meaning they will then not be processed as standard HTML. To fix our previous example using this method:
<?php
echo '<div>' . htmlspecialchars($_GET['input']) . '</div>';
// or
echo '<div>' . filter_input(INPUT_GET, 'input', FILTER_SANITIZE_SPECIAL_CHARS) . '</div>';
Would output:
<div><script src="http://example.com/runme.js"></script></div>
Everything inside the <div> tag will not be interpreted as a JavaScript tag by the browser, but instead as a simple text node. The user will safely see:
<script src="http://example.com/runme.js"></script>
URL Encoding
When outputting a dynamically generated URL, PHP provides the urlencode function to safely output valid URLs. So, for example, if a user is able to input data that becomes part of another GET parameter:
<?php
$input = urlencode($_GET['input']);
// or
$input = filter_input(INPUT_GET, 'input', FILTER_SANITIZE_URL);
echo 'Link';
Any malicious input will be converted to an encoded URL parameter.
Using specialised external libraries or OWASP AntiSamy lists
Sometimes you will want to send HTML or other kind of code inputs. You will need to maintain a list of authorised words (white list) and un-authorized (blacklist).
You can download standard lists available at the OWASP AntiSamy website. Each list is fit for a specific kind of interaction (ebay api, tinyMCE, etc...). And it is open source.
There are libraries existing to filter HTML and prevent XSS attacks for the general case and performing at least as well as AntiSamy lists with very easy use.
For example you have HTML Purifier

In order of preference:
If you are using a templating engine (e.g. Twig, Smarty, Blade), check that it offers context-sensitive escaping. I know from experience that Twig does. {{ var|e('html_attr') }}
If you want to allow HTML, use HTML Purifier. Even if you think you only accept Markdown or ReStructuredText, you still want to purify the HTML these markup languages output.
Otherwise, use htmlentities($var, ENT_QUOTES | ENT_HTML5, $charset) and make sure the rest of your document uses the same character set as $charset. In most cases, 'UTF-8' is the desired character set.
Also, make sure you escape on output, not on input.

Many frameworks help handle XSS in various ways. When rolling your own or if there's some XSS concern, we can leverage filter_input_array (available in PHP 5 >= 5.2.0, PHP 7.)
I typically will add this snippet to my SessionController, because all calls go through there before any other controller interacts with the data. In this manner, all user input gets sanitized in 1 central location. If this is done at the beginning of a project or before your database is poisoned, you shouldn't have any issues at time of output...stops garbage in, garbage out.
/* Prevent XSS input */
$_GET = filter_input_array(INPUT_GET, FILTER_SANITIZE_STRING);
$_POST = filter_input_array(INPUT_POST, FILTER_SANITIZE_STRING);
/* I prefer not to use $_REQUEST...but for those who do: */
$_REQUEST = (array)$_POST + (array)$_GET + (array)$_REQUEST;
The above will remove ALL HTML & script tags. If you need a solution that allows safe tags, based on a whitelist, check out HTML Purifier.
If your database is already poisoned or you want to deal with XSS at time of output, OWASP recommends creating a custom wrapper function for echo, and using it EVERYWHERE you output user-supplied values:
//xss mitigation functions
function xssafe($data,$encoding='UTF-8')
{
return htmlspecialchars($data,ENT_QUOTES | ENT_HTML401,$encoding);
}
function xecho($data)
{
echo xssafe($data);
}

<?php
function xss_clean($data)
{
// Fix &entity\n;
$data = str_replace(array('&','<','>'), array('&amp;','&lt;','&gt;'), $data);
$data = preg_replace('/(&#*\w+)[\x00-\x20]+;/u', '$1;', $data);
$data = preg_replace('/(&#x*[0-9A-F]+);*/iu', '$1;', $data);
$data = html_entity_decode($data, ENT_COMPAT, 'UTF-8');
// Remove any attribute starting with "on" or xmlns
$data = preg_replace('#(<[^>]+?[\x00-\x20"\'])(?:on|xmlns)[^>]*+>#iu', '$1>', $data);
// Remove javascript: and vbscript: protocols
$data = preg_replace('#([a-z]*)[\x00-\x20]*=[\x00-\x20]*([`\'"]*)[\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2nojavascript...', $data);
$data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2novbscript...', $data);
$data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*-moz-binding[\x00-\x20]*:#u', '$1=$2nomozbinding...', $data);
// Only works in IE: <span style="width: expression(alert('Ping!'));"></span>
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?expression[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?behaviour[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:*[^>]*+>#iu', '$1>', $data);
// Remove namespaced elements (we do not need them)
$data = preg_replace('#</*\w+:\w[^>]*+>#i', '', $data);
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
// we are done...
return $data;
}

You are also able to set some XSS related HTTP response headers via header(...)
X-XSS-Protection "1; mode=block"
to be sure, the browser XSS protection mode is enabled.
Content-Security-Policy "default-src 'self'; ..."
to enable browser-side content security. See this one for Content Security Policy (CSP) details: http://content-security-policy.com/
Especially setting up CSP to block inline-scripts and external script sources is helpful against XSS.
for a general bunch of useful HTTP response headers concerning the security of you webapp, look at OWASP: https://www.owasp.org/index.php/List_of_useful_HTTP_headers

Use htmlspecialchars on PHP. On HTML try to avoid using:
element.innerHTML = “…”;
element.outerHTML = “…”;
document.write(…);
document.writeln(…);
where var is controlled by the user.
Also obviously try avoiding eval(var),
if you have to use any of them then try JS escaping them, HTML escape them and you might have to do some more but for the basics this should be enough.

The best way to protect your input it's use htmlentities function.
Example:
htmlentities($target, ENT_QUOTES, 'UTF-8');
You can get more information here.

Related

How to safely XSS encode untrusted data coming from PHP through AJAX injected into the DOM via javascript?

I felt pretty confident with XSS prevention with an older setup we had on our site ... we were using OWASP's XSS mitigation functions for stroking out user supplied data from a database (we inject values into DB directly via prepared statements, no encoding takes place till output time) and printing it via (simplified for readability):
show.php
print "<li>";
print "<a href='page?id=".xssafe($row->TRUSTED_VALUE)."'>".xssafe($row->UNTRUSTED_VALUE)."</a>";
print "</li>";
For numerous reasons, scalability, pagination, flexibility, we're switching to an AJAX oriented scheme. Instead of printing out these LI blocks directly, we AJAX them in immediately on page load (technically $(document).ready()) and let the client via javascript & jQuery handle everything. I'm concerned about this approach as I've read a ton on the subject and am still not confident in how to maintain XSS security.
Our new setup is this:
retrieve.php (I originally still had the xssafe() wrappers, but read that I should just use json_encode())
$data['TRUSTED_VALUE'] = $row->TRUSTED_VALUE; // 123
$data['UNTRUSTED_VALUE'] = $row->UNTRUSTED_VALUE; // who knows?
header('Content-Type: application/json');
print json_encode($data);
show.php
<script src="show.js"></script>
show.js
$.ajax({
url: 'retrive.php',
dataType: 'json',
data: {page: pageNum},
success: loadLI
});
function loadLI() {
data = response.data;
var li = document.createElement('li');
var anchor = document.createElement('a');
anchor.setAttribute('href', 'page?id='+encodeURIComponent(data.TRUSTED_VALUE));
anchor.appendChild(document.createTextNode(data.UNTRUSTED_VALUE));
li.appendChild(anchor);
}
Should I keep the xssafe() wrapper functions in our retrieve.php script, then json_encode, then inject those values via Javascript? Or is our new setup safe? Or is there a better way to do this? Thanks.
What you're doing appears safe.
createTextNode creates a text node on the page - JavaScript will handle the encoding internally for you.
setAttribute will set an attribute on the page - the same applies here, the parameter is taken as a strongly typed value and it shouldn't be possible to escape it using malicious code.
Should I keep the xssafe() wrapper functions in our retrieve.php
script, then json_encode
So, no.

Will this method prevent from xss [duplicate]

How do I prevent XSS (cross-site scripting) using just HTML and PHP?
I've seen numerous other posts on this topic but I have not found an article that clear and concisely states how to actually prevent XSS.
Basically you need to use the function htmlspecialchars() whenever you want to output something to the browser that came from the user input.
The correct way to use this function is something like this:
echo htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
Google Code University also has these very educational videos on Web Security:
How To Break Web Software - A look at security vulnerabilities in
web software
What Every Engineer Needs to Know About Security
and Where to Learn It
One of the most important steps is to sanitize any user input before it is processed and/or rendered back to the browser. PHP has some "filter" functions that can be used.
The form that XSS attacks usually have is to insert a link to some off-site javascript that contains malicious intent for the user. Read more about it here.
You'll also want to test your site - I can recommend the Firefox add-on [XSS Me]. Looks like Easy XSS is now the way to go.
Cross-posting this as a consolidated reference from the SO Documentation beta which is going offline.
Problem
Cross-site scripting is the unintended execution of remote code by a web client. Any web application might expose itself to XSS if it takes input from a user and outputs it directly on a web page. If input includes HTML or JavaScript, remote code can be executed when this content is rendered by the web client.
For example, if a 3rd party side contains a JavaScript file:
// http://example.com/runme.js
document.write("I'm running");
And a PHP application directly outputs a string passed into it:
<?php
echo '<div>' . $_GET['input'] . '</div>';
If an unchecked GET parameter contains <script src="http://example.com/runme.js"></script> then the output of the PHP script will be:
<div><script src="http://example.com/runme.js"></script></div>
The 3rd party JavaScript will run and the user will see "I'm running" on the web page.
Solution
As a general rule, never trust input coming from a client. Every GET parameter, POST or PUT content, and cookie value could be anything at all, and should therefore be validated. When outputting any of these values, escape them so they will not be evaluated in an unexpected way.
Keep in mind that even in the simplest applications data can be moved around and it will be hard to keep track of all sources. Therefore it is a best practice to always escape output.
PHP provides a few ways to escape output depending on the context.
Filter Functions
PHPs Filter Functions allow the input data to the php script to be sanitized or validated in many ways. They are useful when saving or outputting client input.
HTML Encoding
htmlspecialchars will convert any "HTML special characters" into their HTML encodings, meaning they will then not be processed as standard HTML. To fix our previous example using this method:
<?php
echo '<div>' . htmlspecialchars($_GET['input']) . '</div>';
// or
echo '<div>' . filter_input(INPUT_GET, 'input', FILTER_SANITIZE_SPECIAL_CHARS) . '</div>';
Would output:
<div><script src="http://example.com/runme.js"></script></div>
Everything inside the <div> tag will not be interpreted as a JavaScript tag by the browser, but instead as a simple text node. The user will safely see:
<script src="http://example.com/runme.js"></script>
URL Encoding
When outputting a dynamically generated URL, PHP provides the urlencode function to safely output valid URLs. So, for example, if a user is able to input data that becomes part of another GET parameter:
<?php
$input = urlencode($_GET['input']);
// or
$input = filter_input(INPUT_GET, 'input', FILTER_SANITIZE_URL);
echo 'Link';
Any malicious input will be converted to an encoded URL parameter.
Using specialised external libraries or OWASP AntiSamy lists
Sometimes you will want to send HTML or other kind of code inputs. You will need to maintain a list of authorised words (white list) and un-authorized (blacklist).
You can download standard lists available at the OWASP AntiSamy website. Each list is fit for a specific kind of interaction (ebay api, tinyMCE, etc...). And it is open source.
There are libraries existing to filter HTML and prevent XSS attacks for the general case and performing at least as well as AntiSamy lists with very easy use.
For example you have HTML Purifier
In order of preference:
If you are using a templating engine (e.g. Twig, Smarty, Blade), check that it offers context-sensitive escaping. I know from experience that Twig does. {{ var|e('html_attr') }}
If you want to allow HTML, use HTML Purifier. Even if you think you only accept Markdown or ReStructuredText, you still want to purify the HTML these markup languages output.
Otherwise, use htmlentities($var, ENT_QUOTES | ENT_HTML5, $charset) and make sure the rest of your document uses the same character set as $charset. In most cases, 'UTF-8' is the desired character set.
Also, make sure you escape on output, not on input.
Many frameworks help handle XSS in various ways. When rolling your own or if there's some XSS concern, we can leverage filter_input_array (available in PHP 5 >= 5.2.0, PHP 7.)
I typically will add this snippet to my SessionController, because all calls go through there before any other controller interacts with the data. In this manner, all user input gets sanitized in 1 central location. If this is done at the beginning of a project or before your database is poisoned, you shouldn't have any issues at time of output...stops garbage in, garbage out.
/* Prevent XSS input */
$_GET = filter_input_array(INPUT_GET, FILTER_SANITIZE_STRING);
$_POST = filter_input_array(INPUT_POST, FILTER_SANITIZE_STRING);
/* I prefer not to use $_REQUEST...but for those who do: */
$_REQUEST = (array)$_POST + (array)$_GET + (array)$_REQUEST;
The above will remove ALL HTML & script tags. If you need a solution that allows safe tags, based on a whitelist, check out HTML Purifier.
If your database is already poisoned or you want to deal with XSS at time of output, OWASP recommends creating a custom wrapper function for echo, and using it EVERYWHERE you output user-supplied values:
//xss mitigation functions
function xssafe($data,$encoding='UTF-8')
{
return htmlspecialchars($data,ENT_QUOTES | ENT_HTML401,$encoding);
}
function xecho($data)
{
echo xssafe($data);
}
<?php
function xss_clean($data)
{
// Fix &entity\n;
$data = str_replace(array('&','<','>'), array('&amp;','&lt;','&gt;'), $data);
$data = preg_replace('/(&#*\w+)[\x00-\x20]+;/u', '$1;', $data);
$data = preg_replace('/(&#x*[0-9A-F]+);*/iu', '$1;', $data);
$data = html_entity_decode($data, ENT_COMPAT, 'UTF-8');
// Remove any attribute starting with "on" or xmlns
$data = preg_replace('#(<[^>]+?[\x00-\x20"\'])(?:on|xmlns)[^>]*+>#iu', '$1>', $data);
// Remove javascript: and vbscript: protocols
$data = preg_replace('#([a-z]*)[\x00-\x20]*=[\x00-\x20]*([`\'"]*)[\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2nojavascript...', $data);
$data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2novbscript...', $data);
$data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*-moz-binding[\x00-\x20]*:#u', '$1=$2nomozbinding...', $data);
// Only works in IE: <span style="width: expression(alert('Ping!'));"></span>
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?expression[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?behaviour[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
$data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:*[^>]*+>#iu', '$1>', $data);
// Remove namespaced elements (we do not need them)
$data = preg_replace('#</*\w+:\w[^>]*+>#i', '', $data);
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
// we are done...
return $data;
}
You are also able to set some XSS related HTTP response headers via header(...)
X-XSS-Protection "1; mode=block"
to be sure, the browser XSS protection mode is enabled.
Content-Security-Policy "default-src 'self'; ..."
to enable browser-side content security. See this one for Content Security Policy (CSP) details: http://content-security-policy.com/
Especially setting up CSP to block inline-scripts and external script sources is helpful against XSS.
for a general bunch of useful HTTP response headers concerning the security of you webapp, look at OWASP: https://www.owasp.org/index.php/List_of_useful_HTTP_headers
Use htmlspecialchars on PHP. On HTML try to avoid using:
element.innerHTML = “…”;
element.outerHTML = “…”;
document.write(…);
document.writeln(…);
where var is controlled by the user.
Also obviously try avoiding eval(var),
if you have to use any of them then try JS escaping them, HTML escape them and you might have to do some more but for the basics this should be enough.
The best way to protect your input it's use htmlentities function.
Example:
htmlentities($target, ENT_QUOTES, 'UTF-8');
You can get more information here.

Share page via url in safe way?

My index.php is making ajax post calls to ajax.php and getting echoed json as result, then parsed and displayed with js. So it is POST. I want to be able to share that result, via link like mydomain/index.php?q=foo&q1=foo1.
Here is basic pseudo-code scenario that isn't safe, I want suggestion how to achieve this in safe manner?
//index.php
//js
$.post('ajax.php', querystring, function(){
collect_result = result;
});
//ajax.php
parse($_POST);
echo json_encode(result);
// I want to be able to share result in way
http://.../index.php?q=foo&q1=foo1
//index.php
if(!empty($_GET['q]))
$querystr = http_build_query($_GET, '', '&');
<div id="div1" style="display:none"><?php echo $querystr; ?></div>
//then get it wuth jquery and make ajax.post()
$.post('ajax.php', $('#div1').html(), function(){
collect_result = result;
});
//BUT THEN USER IS ABLE TO DIRECT INJECT CODE INTO MY HTML (XSS)
//IS THERE SAFE WAY TO DO THIS SHARING VIA LINK???
Not sure if you refer to this, but as mentioned in w3schools:
The htmlspecialchars() function converts special characters to HTML entities. This means that it will replace HTML characters like < and > with < and >. This prevents attackers from exploiting the code by injecting HTML or Javascript code (Cross-site Scripting attacks) in forms.
I work with ajax on a daily basis dealing with this dilema pretty often. There are times where I have to even put password and user in a url. When sensitive data gets visible like that, I encrypt the variables. I use a php encryption method with a special key, and I decode it when I receive the variable by POST. If your are interested in this method you can look at the cbc encryption/decryption. I am sure there are others but cbc seems to be the safest. (be sure to enable the mcrypt in the php).

How to access the DOM of a user selected web address

I need to do what a bookmarklet does but from my page directly.
I need to pull the document.title property of a web page given that url.
So say a user types in www.google.com, I want to be able to somehow pull up google.com maybe in an iframe and than access the document.title property.
I know that bookmarklets ( the javacript that runs from bookmark bar ) can access the document.title property of any site the user is on and then ajax that information to the server.
This is essentially what I want to do but from my web page directly with out the use of a bookmarklet.
According to This question
You can achive this using PHP, try this code :
<?php
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
//Example:
echo getTitle("http://www.washingtontimes.com/");
?>
However, i assume it is possible to read file content with JS and do the same logic of searching for the tags.
Try searching here
Unfortunately, its not that easy. For security reasons, JavaScript is not allowed to access the document object of a frame or window that is not on the same domain. This sort of thing has to be done with a request to a backend PHP script that can fetch the requested page, go through the DOM, and retrieve the text in the <title> tag. If you don't have that capability, what you're asking will be much harder.
Here is the basic PHP script, which will fetch the page and use PHP's DOM extension to parse the page's title:
<?php
$html = file_get_contents($_GET["url"]);
$dom = new DOMDocument;
$dom->loadXML($html);
$titles = $dom->getElementsByTagName('title');
foreach ($titles as $title) {
echo $title->nodeValue;
}
?>
Demo: http://www.dstrout.net/pub/title.htm
You could write a server side script that would retrieve the page for you (i.e. using curl) and pars the dom and return the desired properties as json. Then call it with ajax.

How do I render javascript from another site, inside a PHP application?

What I'm trying to do is read a specific line from a webpage from inside of my PHP application. This is my experimental setup thus far:
<?php
$url = "http://www.some-web-site.com";
$file_contents = file_get_contents($url);
$findme = 'text to be found';
$pos = strpos($file_contents, $findme);
if ($pos == false) {
echo "The string '$findme' was not found in the string";
} else {
echo "The string '$findme' was found in the string";
echo " and exists at position $pos";
}
?>
The "if" statements contain echo operators for now, this will change to database operators later on, the current setup is to test functionality.
Basically the problem is, with using this method any java on the page is returned as script. What I need is the text that the script is supposed to render inside the browser. Is there any way to do this within PHP?
What I'm ultimately trying to achieve is updating stock from within an ecommerce site via reading the stock level from the site's supplier. The supplier does not use RSS feeds for this.
cURL does not have a javascript parser. as such, if the content you are trying to read is placed in the page via Javascript after initial page render, then it will not be accesible via cURL.
The result of the script is supposed executed and return back to your script.
PHP doesn't support any feature about web browser itself.
I suggest you try to learn about "web crawler" and "webbrowsers" which are included in .NET framework ( not PHP )
so that you can use the exec() command in php to call it.
try to find out the example code of web crawler and web browsers on codeproject.com
hope it works.
You can get the entire web page as a file like this:
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://example.com/page.htm');
$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file: '.$my_file);
fwrite($handle, $returned_content);
Then I suppose you can use a class such as explained in this link below as a guide to separate the javascript from the html (its in the head tags usually). for linked(imported) .js files you would have to repeat the function for those urls, and also for linked/imported css. You can also grab images if you need to save them as files.
http://www.digeratimarketing.co.uk/2008/12/16/curl-page-scraping-script/

Categories

Resources