Get webpage and read throug it using javascript - javascript

Hi i have a quick question, say that you would like to connect to a website and search it for what links it contains, how do you do this with javascript?
I would like to do something like this
Var everythingAdiffrentPageContains = //Go to some link ex www.msn.se and store it in this variable
var pageLinks = []; var anchors = everythingAdiffrentPageContains.getElementsByTagName('a');
var numAnchors = anchors.length;
for(var i = 0; i < numAnchors; i++) {
pageLinks.push(anchors[i].href);
}
We can assume here that we have acces rights to the site so this is not of a concern.
In other words I would like to go to some site and store all that sites Hyperlinks in an array, how would you do this in javascript?
Thanks
EDIT since pointed out Im not trying to connect to another domain. Im trying to connect to another apache webserver inside my lan that hosts a website that I would like to scan for links.
Unfornuatley I do not have PHP on my webserver :/ But a simple javascript would do it
for example go to X:/folder/example.html
Read it, and store the links

Unfortunately - You can't do this. "We can assume here that we have acces rights to the site"...that's a false assumption from a JavaScript point of view, if the page is on another domain. You simply can't access content on another domain (not HTML content anyway) via JavaScript. It's prevented by the same-origin policy, in place for several security reasons.

I suggest you to use a JS framework that helps you to retrieve elements and do stuff with DOM easily.
For example using mootools you could achieve this writing some code like this:
var req = new Request.HTML({
url:'./retrieve.php?url=YOURURL', //create a server script to "retrieve" the html of another domain page
onSuccess: function(tree,DOMelements) {
var links = [];
DOMelements.getElements('a').each(function(element){
links.push(element.get('href'));
});
}
});
req.send();
The retrieve.php page should be written for example in this way:
<?php
$url = $_GET['url'];
header('Content-type: application/xml');
echo file_get_contents($url);
?>

Related

How to Detect If JS is Running in Website Builder?

I want to display forums inside websites where my javascript (and HTML and CSS) is embedded, but if the javascript is running inside a website builder, I just want to have some text telling the user their forums are installed here (in the embedded DIV) and not try to display any forums. My only idea is to look at the URL and if I see a known website builder, then run the website builder code, but I would need a large list of all website builder URLs. Does anyone have such a list or is there a better solution? My current code looks like this:
var hostURL = window.location.href;
if (hostURL == "about:srcdoc") hostURL = window.parent.location.href;
if (hostURL.indexOf("websites.godaddy.com") > -1 || // godaddy
hostURL.indexOf(".preview.editmysite.com") > -1) { // weebly
displayWebsiteBuilderInfo();
return;
}
Here's what I did, but I'm not sure if it's a good solution (and it's not a solution for the original question):
In the PHP code that handles the request to get the forums data I read the content at the referer URL (comes from the client - window.location.href) to see if the javascript is there. If it's not there, assume the request came from a website builder. Then if isWebsiteBuilder is true back at the client, call displayWebsiteBuilderInfo();
Here's the PHP code:
$siteContent = #file_get_contents($referer);
$siteContent = htmlspecialchars_decode($siteContent);
$idx = strpos($siteContent, "<script async src=\"https://www.bubblecritic.com/js/embed/the_js.js\"></script>");
if ($idx === false) $isWebsiteBuilder = true;

Can i scrape this site using just node?

im very new to JavaScript so be patient.
I've been trying to scrape a site and get all the product URLs in a list that i will use later in other function like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio");
function getURLS(url) {
request(url, function(err, resp, body){
var linklist = [];
$ = cheerio.load(body);
var links = $('#productResults a');
for(valor in links) {
if(links[valor].attribs && links[valor].attribs.href && linklist.indexOf(links[valor].attribs.href) == -1){
linklist.push(links[valor].attribs.href);
}
}
var extended_links = [];
linklist.forEach(function(link){
extended_link = 'https://www.fromuthtennis.com/frm/' + link;
extended_links.push(extended_link);
})
console.log(extended_links);
})
};
This does work unless you go to the second page of items like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio"); //etc...
As far as i know this happens because the content on the page is loaded dynamically.
To get the contents of the page i believe i need to use PhantomJS because that would allow me to get the html code after the page has been fully loaded, so i installed the phantomjs-node module. I want to use NodeJS to get the URL list because the rest of my code is written on it.
I've been reading a lot about PhantomJS but using the phantomjs-node is tricky and i still don't understand how could i get the URL list using it because i'm very new to JavaScript or coding in general.
If someone could guide me a little bit i'd appreciate it a lot.
Yes, you can. That page looks like it implements Google's Ajax Crawling URL.
Basically it allows websites to generate crawler friendly content for Google. Whenever you see a URL like this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]
You need to convert it to this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx?_escaped_fragment_=Filter%3D%5Bpagenum%3D2*ava%3D1%5D
The conversion is simply take the base path: https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx, add a query param _escaped_fragment_ who's value is URL fragment Filter=[pagenum=2*ava=1] encoded into Filter%3D%5Bpagenum%3D2*ava%3D1%5D using standard URI encoding.
You can read the full specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
Note: This does not apply to all websites, only websites that implement Google's Ajax Crawling URL. But you're in luck in this case
You can see any product you want without using dynmic content using this url:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID={product_id}
For example to see product 37023:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID=37023
All you have to do is for(var productid=0;prodcutid<40000;productid++) {request...}.
Another approach is to use phantom module. (https://www.npmjs.com/package/phantom). It will let you run phantom command directly from your NodeJS app

How to look up URL names in Javascript

How can you use Javascript to parse out the URL of a page?
First of all, you have to decide whether you want to do this on the client or on the server. On the server, you can load the XML and use XPath to locate the part of the XML DOM tree that contains the site:
//site/name[text() = 'Blah00']
When using JavaScript on the client, a better solution would be to have a server which keeps the current status per site (use a database or some in-memory structure). Then use AJAX requests to ask the server for the information for a certain site. jQuery will make your life much easier.
I have solved this:
<script>
function mySiteURL() {
var myURL = window.location.href;
var dashIndex = myURL.lastIndexOf("-");
var dotIndex = myURL.lastIndexOf(".");
var result = myURL.substring(dashIndex + 1, dotIndex);
}
</script>

How to scrape page links from dynamic web page using PHP?

I'd like to scrape the actual the dynamically created URLs in this web page's menu using PHP:
http://groceries.iceland.co.uk/
I have previously used something like this:
<?php
$baseurls = array("http://groceries.iceland.co.uk/");
foreach ($baseurls as $source)
{
$html = file_get_contents($source);
$start = strpos($html,'<nav id="mainNavigation"');
$end = strpos($html,'</nav>',$start);
$mainarea = substr($html,$start,$end-$start);
$dom = new DOMDocument();
#$dom->loadHTML($mainarea);
// grab all the urls on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
}
?>
but it's not doing the job for this particular page. For example, my code returns a url such as:
groceries.iceland.co.uk//frozen-chips-and-potato-products
but I want it to give me:
groceries.iceland.co.uk//frozen/chips-and-potato-products/c/FRZCAP?q=:relevance&view=list
The browser adds "/c/FRZCAP?q=:relevance&view=list" to the end and this is what I want.
Hope you can help
Thanks
Edit: Just to confirm, I had at look at the website you're trying to scrape with JavaScript turned off and it appears that the Mainnav urls are generated using JavaScript, so you will be unable to scrape the page without using a headless browser.
Per #Sam and #halfer's comments, if you need to scrape a site that has dynamic URLs generated by JavaScript then you will need to use a scraper that supports JavaScript.
If you want to do the bulk of your development in PHP, then I recommend not trying to use a headless browser via PHP and instead relying on a service that can scrape a JavaScript rendered page and return the contents for you.
The best one that I've found, and one that we use in our projects, is https://phantomjscloud.com/

How to get hyperlink ids in another html file

am doing a project it requires a web site.on this site i have to darw state diagram for hyperlinks.that is how the hyperlinks are attached to one another on a site.am using html.how to get hyperlink id in another html file.i know about document.getElementById.
Thanks inadvance
That would require a way to access another HTML file through AJAX, which is not possible if it isn't on your domain or if CORS isn't enabled.
There's however quite a few things you could do:
Use your own server-side as proxy for fetching the HTML file.
Do the processing on the server-side and let JavaScript plot the data.
Do everything on the server-side.
If you'd like to get the ID's of a link you should use a HTML parser. Modern browsers include a such, it's called DOMParser. You'd do something like this:
var parser = new DOMParser();
var doc = parser.parseFromString(yourHTMLSource, 'text/html');
var links = doc.getElementsByTagName('a');
for(var i = 0, length = links.length; i < length; i++) {
links[i].getAttribute('id'); // -> Returns the ID of the link, if any
}
As I remember it, IE doesn't support this, but has it's own module for HTML parsing with some different methods, but still relatively easy to use.

Categories

Resources