Universal website crawler using PHP [duplicate] - javascript

This question already has answers here:
How to show google.com in an iframe?
(9 answers)
Closed 9 years ago.
I want to create a universal website crawler using PHP.
By using my web application, a user will input any URL, will provide input on what he needs to get from given site and will click on Start button.
Then my web application will begin to get data from source website.
I am loading the page in iframe and using jQuery I get class and tags name of specific area from user.
But when I load external website like ebay or amazon etc it does not work, as these site are restricted. Is there any way to resolve this issue, so I can load any site in iFrame? Or is there any alternative to what I want to achieve?
I am actually inspired by mozenda, a software developed in .NET, http://www.mozenda.com/video01-overview/.
They load a site in a browser control and it's almost the same thing.

You can't crawl a site on the client-side if the target website is returning the "X-Frame-Options: SAMEORIGIN" response header (see #mc10's duplicate link in the question comments). You must crawl the target site using server-side functionality.
The following solution might be suitable if wget has all of the options that you need. wget -r will recursively crawl a site and download the documents. It has many useful options, like translating absolute embedded urls to relative, local ones.
Note: wget must be installed in your system for this to work. I don't know which operating system you're running this on, but on Ubuntu, it's sudo apt-get install wget to install wget.
See: wget --help for additional options.
<?php
$website_url = $_GET['user_input_url'];
//doesn't work for ipv6 addresses
//http://php.net/manual/en/function.filter-var.php
if( filter_var($website_url, FILTER_VALIDATE_URL) !== false ){
$command = "wget -r " + escapeshellarg( $website_url );
system( $command );
//iterate through downloaded files and folders
}else{
//handle invalid url
}

You can sub in what element you're looking for in the second foreach loop within the following script. As is the script will gather up the first 100 links on cnn's homepage and put them in a text file named "cnnLinks.txt" in the same folder in which this file is located.
Just change the $pre, $base, and $post variables to whatever url you want to crawl! I separated them like that to change through common websites faster.
<?php
set_time_limit(0);
$pre = "http://www.";
$base = "cnn";
$post = ".com";
$domain = $pre.$base.$post;
$content = "google-analytics.com/ga.js";
$content_tag = "script";
$output_file = "cnnLinks.txt";
$max_urls_to_check = 100;
$rounds = 0;
$domain_stack = array();
$max_size_domain_stack = 1000;
$checked_domains = array();
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($domain);
$found = false;
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}
$checked_domains[$domain] = $found;
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
unset($domain_stack[0]);
$domain_stack = array_values($domain_stack);
$rounds++;
}
$found_domains = "";
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}
file_put_contents($output_file, $found_domains);
?>

Take a look at using the file_get_contents function in PHP.
You may have better success in retrieving the HTML for a given site like this:
$html = file_get_contents('http://www.ebay.com');

Related

How to Detect If JS is Running in Website Builder?

I want to display forums inside websites where my javascript (and HTML and CSS) is embedded, but if the javascript is running inside a website builder, I just want to have some text telling the user their forums are installed here (in the embedded DIV) and not try to display any forums. My only idea is to look at the URL and if I see a known website builder, then run the website builder code, but I would need a large list of all website builder URLs. Does anyone have such a list or is there a better solution? My current code looks like this:
var hostURL = window.location.href;
if (hostURL == "about:srcdoc") hostURL = window.parent.location.href;
if (hostURL.indexOf("websites.godaddy.com") > -1 || // godaddy
hostURL.indexOf(".preview.editmysite.com") > -1) { // weebly
displayWebsiteBuilderInfo();
return;
}
Here's what I did, but I'm not sure if it's a good solution (and it's not a solution for the original question):
In the PHP code that handles the request to get the forums data I read the content at the referer URL (comes from the client - window.location.href) to see if the javascript is there. If it's not there, assume the request came from a website builder. Then if isWebsiteBuilder is true back at the client, call displayWebsiteBuilderInfo();
Here's the PHP code:
$siteContent = #file_get_contents($referer);
$siteContent = htmlspecialchars_decode($siteContent);
$idx = strpos($siteContent, "<script async src=\"https://www.bubblecritic.com/js/embed/the_js.js\"></script>");
if ($idx === false) $isWebsiteBuilder = true;

Reading directory contents from a client's computer - PHP

I want to recursively read contents of a folder chosen by client on my site.
I have used opendir() and scandir() but they are unable to read directory contents from client's computer.
Is there any way that I can read the file names from visitor's directory.
function ListIn($dir, $prefix = '') {
$dir = rtrim($dir, '\\/');
$result = array();
$directory = opendir($dir);
foreach (scandir($directory) as $f) {
if ($f !== '.' and $f !== '..') {
if (is_dir("$dir/$f")) {
$result = array_merge($result, ListIn("$dir/$f", "$prefix$f/"));
} else {
$result[] = $prefix.$f;
}
}
}
return $result;
}
I need to implement this in either php or javascript.
This is possible with the storage apis provided by javascript nowadays.
http://www.html5rocks.com/en/tutorials/file/dndfiles/
However if you need raw read/write access, I suggest you read up on the chrome platform apis.
You cannot do this with PHP or any other server side technology.
You might be able to do this with a browser plugin or a flash app.
Ask yourself why you want to do this?

How to scrape page links from dynamic web page using PHP?

I'd like to scrape the actual the dynamically created URLs in this web page's menu using PHP:
http://groceries.iceland.co.uk/
I have previously used something like this:
<?php
$baseurls = array("http://groceries.iceland.co.uk/");
foreach ($baseurls as $source)
{
$html = file_get_contents($source);
$start = strpos($html,'<nav id="mainNavigation"');
$end = strpos($html,'</nav>',$start);
$mainarea = substr($html,$start,$end-$start);
$dom = new DOMDocument();
#$dom->loadHTML($mainarea);
// grab all the urls on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
}
?>
but it's not doing the job for this particular page. For example, my code returns a url such as:
groceries.iceland.co.uk//frozen-chips-and-potato-products
but I want it to give me:
groceries.iceland.co.uk//frozen/chips-and-potato-products/c/FRZCAP?q=:relevance&view=list
The browser adds "/c/FRZCAP?q=:relevance&view=list" to the end and this is what I want.
Hope you can help
Thanks
Edit: Just to confirm, I had at look at the website you're trying to scrape with JavaScript turned off and it appears that the Mainnav urls are generated using JavaScript, so you will be unable to scrape the page without using a headless browser.
Per #Sam and #halfer's comments, if you need to scrape a site that has dynamic URLs generated by JavaScript then you will need to use a scraper that supports JavaScript.
If you want to do the bulk of your development in PHP, then I recommend not trying to use a headless browser via PHP and instead relying on a service that can scrape a JavaScript rendered page and return the contents for you.
The best one that I've found, and one that we use in our projects, is https://phantomjscloud.com/

Get webpage and read throug it using javascript

Hi i have a quick question, say that you would like to connect to a website and search it for what links it contains, how do you do this with javascript?
I would like to do something like this
Var everythingAdiffrentPageContains = //Go to some link ex www.msn.se and store it in this variable
var pageLinks = []; var anchors = everythingAdiffrentPageContains.getElementsByTagName('a');
var numAnchors = anchors.length;
for(var i = 0; i < numAnchors; i++) {
pageLinks.push(anchors[i].href);
}
We can assume here that we have acces rights to the site so this is not of a concern.
In other words I would like to go to some site and store all that sites Hyperlinks in an array, how would you do this in javascript?
Thanks
EDIT since pointed out Im not trying to connect to another domain. Im trying to connect to another apache webserver inside my lan that hosts a website that I would like to scan for links.
Unfornuatley I do not have PHP on my webserver :/ But a simple javascript would do it
for example go to X:/folder/example.html
Read it, and store the links
Unfortunately - You can't do this. "We can assume here that we have acces rights to the site"...that's a false assumption from a JavaScript point of view, if the page is on another domain. You simply can't access content on another domain (not HTML content anyway) via JavaScript. It's prevented by the same-origin policy, in place for several security reasons.
I suggest you to use a JS framework that helps you to retrieve elements and do stuff with DOM easily.
For example using mootools you could achieve this writing some code like this:
var req = new Request.HTML({
url:'./retrieve.php?url=YOURURL', //create a server script to "retrieve" the html of another domain page
onSuccess: function(tree,DOMelements) {
var links = [];
DOMelements.getElements('a').each(function(element){
links.push(element.get('href'));
});
}
});
req.send();
The retrieve.php page should be written for example in this way:
<?php
$url = $_GET['url'];
header('Content-type: application/xml');
echo file_get_contents($url);
?>

Relative urls for Javascript files

I have some code in a javascript file that needs to send queries back to the server. The question is, how do I find the url for the script that I am in, so I can build a proper request url for ajax.
I.e., the same script is included on /, /help, /whatever, and so on, while it will always need to request from /data.json. Additionally, the same site is run on different servers, where the /-folder might be placed differently. I have means to resolve the relative url where I include the Javascript (ez-publish template), but not within the javascript file itself.
Are there small scripts that will work on all browsers made for this?
For this I like to put <link> elements in the page's <head>, containing the URLs to use for requests. They can be generated by your server-side language so they always point to the right view:
<link id="link-action-1" href="${reverse_url ('action_1')}"/>
becomes
<link id="link-action-1" href="/my/web/root/action-1/"/>
and can be retrieved by Javascript with:
document.getElementById ('link-action-1').href;
document.location.href will give you the current URL, which you can then manipulate using JavaScript's string functions.
There's no way that the client can determine the webapp root without being told by the server as it has no knowledge of the server's configuration. One option you can try is to use the base element inside the head element, getting the server to generate it dynamically rather than hardcoding it (so it shows the relevant URL for each server):
<base href="http://path/to/webapp/root/" />
All URLs will then be treated as relative to this. You would therefore simply make your request to /data.json. You do however need to ensure that all other links in the application bear this in mind.
If the script knows its own filename, you can use document.getElementsByTagName(). Iterate through the list until you find the script that matches yours, and extract the full (or relative) url that way.
Here's an example:
function getScriptUrl ( name ) {
var scripts = document.getElementsByTagName('script');
var re = RegExp("(\/|^)" + name + "$");
var src;
for( var i = 0; i < scripts.length; i++){
src = scripts[i].getAttribute('src');
if( src.match(re) )
return src;
}
return null;
}
console.log( 'found ' + getScriptUrl('demo.js') );
Take into consideration that this approach is subject to filename collisions.
I include the following code in my libraries main entry point (main.php):
/**
* Build current url, depending on protocal (http/https),
* port, server name and path suffix
*/
$site_root = 'http';
if (isset($_SERVER["HTTPS"]) && $_SERVER["HTTPS"] == "on")
$site_root .= "s";
$site_root .= "://" . $_SERVER["SERVER_NAME"];
if ($_SERVER["SERVER_PORT"] != "80")
$site_root .= ":" . $_SERVER["SERVER_PORT"];
$site_root .= $g_config["paths"]["site_suffix"];
$g_config["paths"]["site_root"] = $site_root;
$g_config is a global array containing configuration options. So site_suffix might look like: "/sites_working/thesite/public_html" on your development box, and just "/" on a server with a virtual host (domain name).
This method is also good, because if somebody types in the IP address of your development box, it will use that same IP address to build the path to the javascript folder instead of something like "localhost," and if you use "localhost" it will use "localhost" to build the URL.
And because it also detects SSL, you wont have to worry about weather your resources will be sent over HTTP or HTTPS if you ever add SSL support to your server.
Then, in your template, either use
<link id="site_root" href="<?php echo $g_config["paths"]["site_root"] ?>"/>
Or
<script type = "text/javascript">
var SiteRoot = "<?php echo $g_config["paths"]["site_root"]; ?>";
</script>
I suppose the latter would be faster.

Categories

Resources