How to scrape page links from dynamic web page using PHP? - javascript

I'd like to scrape the actual the dynamically created URLs in this web page's menu using PHP:
http://groceries.iceland.co.uk/
I have previously used something like this:
<?php
$baseurls = array("http://groceries.iceland.co.uk/");
foreach ($baseurls as $source)
{
$html = file_get_contents($source);
$start = strpos($html,'<nav id="mainNavigation"');
$end = strpos($html,'</nav>',$start);
$mainarea = substr($html,$start,$end-$start);
$dom = new DOMDocument();
#$dom->loadHTML($mainarea);
// grab all the urls on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
}
?>
but it's not doing the job for this particular page. For example, my code returns a url such as:
groceries.iceland.co.uk//frozen-chips-and-potato-products
but I want it to give me:
groceries.iceland.co.uk//frozen/chips-and-potato-products/c/FRZCAP?q=:relevance&view=list
The browser adds "/c/FRZCAP?q=:relevance&view=list" to the end and this is what I want.
Hope you can help
Thanks

Edit: Just to confirm, I had at look at the website you're trying to scrape with JavaScript turned off and it appears that the Mainnav urls are generated using JavaScript, so you will be unable to scrape the page without using a headless browser.
Per #Sam and #halfer's comments, if you need to scrape a site that has dynamic URLs generated by JavaScript then you will need to use a scraper that supports JavaScript.
If you want to do the bulk of your development in PHP, then I recommend not trying to use a headless browser via PHP and instead relying on a service that can scrape a JavaScript rendered page and return the contents for you.
The best one that I've found, and one that we use in our projects, is https://phantomjscloud.com/

Related

How to parse javascript variable array embedded in http://up-for-grabs.net/#/?

I am trying to parse http://up-for-grabs.net/#/ to get its content in CSV file using powershell. I have written below code till now
$URL = "http://up-for-grabs.net/#/"
$HTML = Invoke-WebRequest -Uri $URL
$script_blocks = $HTML.ParsedHtml.getElementsByTagName("script") | Where{ $_.type -eq ‘text/javascript’ }
$content = ""
foreach ($script_block in $script_blocks)
{
if($script_block.innerHTML -ne $null -and `
$script_block.innerHTML.trim().StartsWith("var files"))
{
$content = $script_block.innerHTML.trim()
}
}
Looking further in the content, it seems like a variable array embedded in JavaScript whose initial lines are formatted as follows. Its array with no spaces or new lines which are my creation to improve readability.
<script type="text/javascript">
var files = {
"aspnet-razor-4":{"name":"ASP.NET Razor 4","desc":"Parser and code generator for CSHTML files used in view pages for MVC web apps.","site":"https://github.com/aspnet/Razor","tags":["Microsoft","ASP.NET","Razor","MVC"], "upforgrabs":{"name":"up-for-grabs","link":"https://github.com/aspnet/Razor/labels/up-for-grabs"}},
"fsharpdatadbpedia":{"name":"FSharp.Data.DbPedia","desc":"FSharp.Data.DbPedia - An F# type provider for DBpedia","site":"https://github.com/fsprojects/FSharp.Data.DbPedia","tags":[".NET","DbPedia","F#"],"upforgrabs":{"name":"up-for-grabs","link":"https://github.com/fsprojects/FSharp.Data.DbPedia/labels/up-for-grabs"}},
"makesharp":{"name":"Make#","desc":"Use C# scripts to automate the building process","site":"https://github.com/sapiens/MakeSharp","tags":[".Net","C#","make","build","automation","tools"],"upforgrabs":{"name":"up-for-grabs","link":"https://github.com/sapiens/MakeSharp/labels/up-for-grabs"}},
"stateprinter":{"name":"StatePrinter","desc":"Automating unittest asserts and ToString() coding.","site":"https://github.com/kbilsted/StatePrinter","tags":["TDD","Unit Testing","TDD",".NET","C#","ToString","Debugging"],"upforgrabs":{"name":"Help wanted","link":"https://github.com/kbilsted/StatePrinter/labels/Help%20wanted"}}
</script>
This immediately is followed by
var projects = new Array();
for (var fileName in files) {
projects.push(files[fileName]);
}
How can I achieve similar quick parsing in powershell without writing big code with string tokenization.
After some research, I figured out that this is a JSON content for which powershell cmdlet ConvertFrom-Json needs to be used. I do not want to copy the whole script here. Please look at this GitHub location to see how to use this cmdlet effectively. Basically, you need to remember that object returned by this cmdlet is a custom object which need to be enumerated to get various properties. Its not an array, so only foreach will work to uncover the content. A small code sample is below
$file_json = $file_string | ConvertFrom-Json
$delim = " ; "
foreach ($item in $file_json | gm)
{
$props = $file_json.$($item.Name)
if($props.MemberType) {continue}
$row = $props.name.ToString()
$row += $delim + $props.desc.ToString()
$row += $delim + $props.site.ToString()
}
Searching for this cmdlet on web will give you more details on how to deal with this conversion.

Laravel 4 render() PDF with Javascript

I need create a PDF from an HTML view rendering with Javascript Apis
This is my code in PHP Laravel 4
$con = Contrato::find($id);
$html = (string) View::make('contratos.contratopdf')->with('con',$con)->render();
return PDF::load(utf8_decode($html), 'A5', 'landscape')->show();
In the view I have this script
<script src="http://test.rentacar.cl/js/lib/jquery.signaturepad.min.js"></script>
This js change the Dom in a normal HTML but when I show in PDF don't work.
Thanks!!
PhantomJS works nicely, take a look at this package: https://github.com/jonnnnyw/php-phantomjs
Full documentation here: http://jonnnnyw.github.io/php-phantomjs/
Example:
<?php
use JonnyW\PhantomJs\Client;
$client = Client::getInstance();
$request = $client->getMessageFactory()->createCaptureRequest('http://google.com');
$response = $client->getMessageFactory()->createResponse();
$file = '/path/to/save/your/screen/capture/file.jpg';
$request->setCaptureFile($file);
$client->send($request, $response);
From my experience of dompdf(very famous pdf library),those library don't support JS interpreting(because the php don't have the power to interpret the javascript),here are two workaound I think you can try.
1.Server side rendering html,move your dom manipulation to php in order to create the dom element.(Recommended)
2.Use php and phantomJS.The php call phantomJS to capture the screen of html and saved the screenshot to pdf.

Php & code injection

The website for a client of mine continues to be "hacked" (I didn't do the website).The hacked pages contain a js script that loads an image and audio from youtube (Lol). Every page was modified and every page has a "news banner" .I'm pretty sure the problem is this part
<?php
$ul = new NewsList;
$ul->Load(3);
if($ul->Current() == null){ ?>
<?php }
else{
for(; $ul->Current() != null; $ul->Next()){
$new = $ul->Current();
the complete implementation of this NewsList : http://pastebin.com/WuWjcJ4p
I'm not a php programmer so I don't get where the problem is....I'm not asking that someone going to explain every line, maybe only an advice , thank you
Sounds like an SQL injection.
I believe the loadById() method is injectable (depending on how you call it).
Here is a way to strengthen it :
function LoadById($id){
$this->news = array();
$this->current = 0;
$this->total = 0;
$ndb = new NewsDB('news');
$result = $ndb->_query("SELECT * FROM ".$ndb->table." WHERE id = " . intval($id));
$new = mysql_fetch_assoc($result);
$n = new News($new['id'], $new['titolo'], $new['data'], $new['contenuto'], $new['img']);
array_push($this->news, $n);
unset($n);
$this->total = 1;
}
Someone might have stolen the passwords from administration using this security flaw and edited the articles from the back-office.
So I suggest you change this code, then change the passwords, delete all php sessions, and finally edit your articles to remove this "news banner".
Note that it might as well be a stored XSS.
Do you have a system which allows to comment the news?

Universal website crawler using PHP [duplicate]

This question already has answers here:
How to show google.com in an iframe?
(9 answers)
Closed 9 years ago.
I want to create a universal website crawler using PHP.
By using my web application, a user will input any URL, will provide input on what he needs to get from given site and will click on Start button.
Then my web application will begin to get data from source website.
I am loading the page in iframe and using jQuery I get class and tags name of specific area from user.
But when I load external website like ebay or amazon etc it does not work, as these site are restricted. Is there any way to resolve this issue, so I can load any site in iFrame? Or is there any alternative to what I want to achieve?
I am actually inspired by mozenda, a software developed in .NET, http://www.mozenda.com/video01-overview/.
They load a site in a browser control and it's almost the same thing.
You can't crawl a site on the client-side if the target website is returning the "X-Frame-Options: SAMEORIGIN" response header (see #mc10's duplicate link in the question comments). You must crawl the target site using server-side functionality.
The following solution might be suitable if wget has all of the options that you need. wget -r will recursively crawl a site and download the documents. It has many useful options, like translating absolute embedded urls to relative, local ones.
Note: wget must be installed in your system for this to work. I don't know which operating system you're running this on, but on Ubuntu, it's sudo apt-get install wget to install wget.
See: wget --help for additional options.
<?php
$website_url = $_GET['user_input_url'];
//doesn't work for ipv6 addresses
//http://php.net/manual/en/function.filter-var.php
if( filter_var($website_url, FILTER_VALIDATE_URL) !== false ){
$command = "wget -r " + escapeshellarg( $website_url );
system( $command );
//iterate through downloaded files and folders
}else{
//handle invalid url
}
You can sub in what element you're looking for in the second foreach loop within the following script. As is the script will gather up the first 100 links on cnn's homepage and put them in a text file named "cnnLinks.txt" in the same folder in which this file is located.
Just change the $pre, $base, and $post variables to whatever url you want to crawl! I separated them like that to change through common websites faster.
<?php
set_time_limit(0);
$pre = "http://www.";
$base = "cnn";
$post = ".com";
$domain = $pre.$base.$post;
$content = "google-analytics.com/ga.js";
$content_tag = "script";
$output_file = "cnnLinks.txt";
$max_urls_to_check = 100;
$rounds = 0;
$domain_stack = array();
$max_size_domain_stack = 1000;
$checked_domains = array();
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($domain);
$found = false;
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}
$checked_domains[$domain] = $found;
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
unset($domain_stack[0]);
$domain_stack = array_values($domain_stack);
$rounds++;
}
$found_domains = "";
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}
file_put_contents($output_file, $found_domains);
?>
Take a look at using the file_get_contents function in PHP.
You may have better success in retrieving the HTML for a given site like this:
$html = file_get_contents('http://www.ebay.com');

Get webpage and read throug it using javascript

Hi i have a quick question, say that you would like to connect to a website and search it for what links it contains, how do you do this with javascript?
I would like to do something like this
Var everythingAdiffrentPageContains = //Go to some link ex www.msn.se and store it in this variable
var pageLinks = []; var anchors = everythingAdiffrentPageContains.getElementsByTagName('a');
var numAnchors = anchors.length;
for(var i = 0; i < numAnchors; i++) {
pageLinks.push(anchors[i].href);
}
We can assume here that we have acces rights to the site so this is not of a concern.
In other words I would like to go to some site and store all that sites Hyperlinks in an array, how would you do this in javascript?
Thanks
EDIT since pointed out Im not trying to connect to another domain. Im trying to connect to another apache webserver inside my lan that hosts a website that I would like to scan for links.
Unfornuatley I do not have PHP on my webserver :/ But a simple javascript would do it
for example go to X:/folder/example.html
Read it, and store the links
Unfortunately - You can't do this. "We can assume here that we have acces rights to the site"...that's a false assumption from a JavaScript point of view, if the page is on another domain. You simply can't access content on another domain (not HTML content anyway) via JavaScript. It's prevented by the same-origin policy, in place for several security reasons.
I suggest you to use a JS framework that helps you to retrieve elements and do stuff with DOM easily.
For example using mootools you could achieve this writing some code like this:
var req = new Request.HTML({
url:'./retrieve.php?url=YOURURL', //create a server script to "retrieve" the html of another domain page
onSuccess: function(tree,DOMelements) {
var links = [];
DOMelements.getElements('a').each(function(element){
links.push(element.get('href'));
});
}
});
req.send();
The retrieve.php page should be written for example in this way:
<?php
$url = $_GET['url'];
header('Content-type: application/xml');
echo file_get_contents($url);
?>

Categories

Resources