is it possible to write web crawler in javascript? - javascript

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.
If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).
If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)

If you use server-side javascript it is possible.
You should take a look at node.js
And an example of a crawler can be found in the link bellow:
http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

Google's Chrome team has released puppeteer on August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).
It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:
const browser = await puppeteer.launch({executablePath: '/path/to/Chrome'});
An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();

We could crawl the pages using Javascript from server side with help of headless webkit. For crawling, we have few libraries like PhantomJS, CasperJS, also there is a new wrapper on PhantomJS called Nightmare JS which make the works easier.

There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend's friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.
Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hijacked and used as a public proxy by 3rd party if you are not carefull.
Good luck, maybe you make a friend or two in the process like I did :-)

My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.
Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom's powers)

This is what you need http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodb
They use NodeJS, MongoDB and ExtJs as GUI

yes it is possible
Use NODEJS (its server side JS)
There is NPM (package manager that handles 3rd party modules) in nodeJS
Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)

There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.
here an example:
if you have urls like these:
http://www.example.com/products/pages/1
http://www.example.com/products/pages/2
then you can use something like this to open all pages containing product list(execute this manually)
var j = 0;
for(var i=1;i<5;i++)
{
setTimeout(function(){
j = j + 1;
window.open('http://www.example.com/products/pages/ + j, '_blank');
}, 15000 * i);
}
then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.
http://www.example.com/products/pages/*
and then a script for each product page to extract data and call a webservice passing data and close window and so on.

I made an example javascript crawler on github.
It's event driven and use an in-memory queue to store all the resources(ie. urls).
How to use in your node environment
var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');
// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();
Here I'm just showing you 2 core method of a javascript crawler.
Crawler.prototype.run = function() {
var crawler = this;
process.nextTick(() => {
//the run loop
crawler.crawlerIntervalId = setInterval(() => {
crawler.crawl();
}, crawler.crawlInterval);
//kick off first one
crawler.crawl();
});
crawler.running = true;
crawler.emit('start');
}
Crawler.prototype.crawl = function() {
var crawler = this;
if (crawler._openRequests >= crawler.maxListenerCurrency) return;
//go get the item
crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
if (queueItem) {
//got the item start the fetch
crawler.fetchQueueItem(queueItem, index);
} else if (crawler._openRequests === 0) {
crawler.queue.complete((err, completeCount) => {
if (err)
throw err;
crawler.queue.getLength((err, length) => {
if (err)
throw err;
if (length === completeCount) {
//no open Request, no unfetcheditem stop the crawler
crawler.emit("complete", completeCount);
clearInterval(crawler.crawlerIntervalId);
crawler.running = false;
}
});
});
}
});
};
Here is the github link https://github.com/bfwg/node-tinycrawler.
It is a javascript web crawler written under 1000 lines of code.
This should put you on the right track.

You can make a web crawler driven from a remote json file that opens all links from a page in new tabs as soon as each tab loads except ones that have already been opened. If you set up a with a browser extension running in a basic browser (nothing runs except the web browser and an internet config program) and had it shipped and installed somewhere with good internet, you could make a database of webpages with an old computer. That would just need to retrieve the content of each tab. You could do that for about $2000, contrary to most estimates for search engine costs. You'd just need to basically make your algorithm provide pages based on how much a term appears in the innerText property of the page, keywords, and description. You could also set up another PC to recrawl old pages from the one-time database and add more. I'd estimate it would take about 3 months and $20000, maximum.

Axios + Cheerio
You can do this with axios and cheerios. Check axios docs for response format.
const cheerio = require('cheerio');
const axios = require('axios');
//crawl
//get url
var url = 'http://amazon.com';
axios.get(url)
.then((res) => {
//response format
var body = res.data;
var statusCode = res.status;
var statusText = res.statusText;
var headers = res.headers;
var request = res.request;
var config = res.config;
//jquery
let $ = cheerio.load(body);
//example
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
}).catch(function (e) {
console.log(e);
});
Node-Fetch + Cheerio
You can do the same thing with node-fetch and cheerio.
fetch(url, {
method: "GET",
}).then(function(response){
//response
var html = response.text();
//return
return html;
})
.then(function(res) {
//response html
var html = res;
//jquery
let $ = cheerio.load(html);
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
})
.catch((error) => {
console.error('Error:', error);
});

Related

How to create an app that opens external URL then scroll the page down?

I want to create an app using nodejs & Gulpjs that opens a specific URL then scroll the page of the URL down to the end of the page, is that possible ?
Here is my code inside gulpfile.js
const {series, src, dest} = require('gulp');
var fs = require('fs');
var path = require('path');
var open =require('open');
var https = require('https');
async function getLinks(params) {
var pageLink = 'https://youtube.com';
var links = [];
open(pageLink, {app: 'chrome'});
https.get(pageLink, (res) => {
let rawHtml = '';
res.on('data', (chunk) => { rawHtml += chunk; });
res.on('end', () => {
try {
console.log(rawHtml);
} catch (e) {
console.error(e.message);
}
});
});
}
exports.default = getLinks;
I would be thankful for some help!
I do not believe that without the use of extensions or some external program exactly what you are looking for is currently possible, but there are some potential alternatives that may help you accomplish some of what you are aiming to do.
ID's
If the page you are linking to has ID's for the element you want to link to you can append that to the end of URL. For example if you wanted to link to an element with the id of #pricing your link would look like this:
https://example.com#pricing
Obviously this is only useful on some pages, and only with elements that have ID's.
Text Fragments
These are slightly closer to what you may be looking for in that it allows you to link to anywhere on a page, regardless of whether the element has an ID or not. Here is an example of how you would link to the More information... text on example.com:
http://example.com/#:~:text=more%20information...
Unfortunately this still has some cons, chiefly in the browser support arena. According to caniuse only Chromium browsers currently supports the feature. (That is 71% of users though).

Can i scrape this site using just node?

im very new to JavaScript so be patient.
I've been trying to scrape a site and get all the product URLs in a list that i will use later in other function like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio");
function getURLS(url) {
request(url, function(err, resp, body){
var linklist = [];
$ = cheerio.load(body);
var links = $('#productResults a');
for(valor in links) {
if(links[valor].attribs && links[valor].attribs.href && linklist.indexOf(links[valor].attribs.href) == -1){
linklist.push(links[valor].attribs.href);
}
}
var extended_links = [];
linklist.forEach(function(link){
extended_link = 'https://www.fromuthtennis.com/frm/' + link;
extended_links.push(extended_link);
})
console.log(extended_links);
})
};
This does work unless you go to the second page of items like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio"); //etc...
As far as i know this happens because the content on the page is loaded dynamically.
To get the contents of the page i believe i need to use PhantomJS because that would allow me to get the html code after the page has been fully loaded, so i installed the phantomjs-node module. I want to use NodeJS to get the URL list because the rest of my code is written on it.
I've been reading a lot about PhantomJS but using the phantomjs-node is tricky and i still don't understand how could i get the URL list using it because i'm very new to JavaScript or coding in general.
If someone could guide me a little bit i'd appreciate it a lot.
Yes, you can. That page looks like it implements Google's Ajax Crawling URL.
Basically it allows websites to generate crawler friendly content for Google. Whenever you see a URL like this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]
You need to convert it to this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx?_escaped_fragment_=Filter%3D%5Bpagenum%3D2*ava%3D1%5D
The conversion is simply take the base path: https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx, add a query param _escaped_fragment_ who's value is URL fragment Filter=[pagenum=2*ava=1] encoded into Filter%3D%5Bpagenum%3D2*ava%3D1%5D using standard URI encoding.
You can read the full specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
Note: This does not apply to all websites, only websites that implement Google's Ajax Crawling URL. But you're in luck in this case
You can see any product you want without using dynmic content using this url:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID={product_id}
For example to see product 37023:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID=37023
All you have to do is for(var productid=0;prodcutid<40000;productid++) {request...}.
Another approach is to use phantom module. (https://www.npmjs.com/package/phantom). It will let you run phantom command directly from your NodeJS app

Scraping authenticated website in node.js

I want to scrape my college website (moodle) with node.js but I haven't found a headless browser able to do it. I have done it in python in just 10 lines of code using RoboBrowser:
from robobrowser import RoboBrowser
url = "https://cas.upc.edu/login?service=https%3A%2F%2Fatenea.upc.edu%2Fmoodle%2Flogin%2Findex.php%3FauthCAS%3DCAS"
browser = RoboBrowser()
browser.open(url)
form = browser.get_form()
form['username'] = 'myUserName'
form['password'] = 'myPassword'
browser.submit_form(form)
browser.open("http://atenea.upc.edu/moodle/")
print browser.parsed
The problem is that the website requires authentication. Can you help me? Thanks!
PD: I think this can be useful https://www.npmjs.com/package/form-scraper but I can't get it working.
Assuming you want to read a 3rd party website, and 'scrape' particular pieces of information, you could use a library such as cheerio to achieve this in Node.
Cheerio is a "lean implementation of core jQuery designed specifically for the server". This means that given a String representation of a DOM (or part thereof), cheerio can traverse it in much the same way as jQuery can.
An example from Max Ogden show how you can use the request module to grab HTML from a remote server and then pass it to cheerio:
var $ = require('cheerio')
var request = require('request')
function gotHTML(err, resp, html) {
if (err) return console.error(err)
var parsedHTML = $.load(html)
// get all img tags and loop over them
var imageURLs = []
parsedHTML('a').map(function(i, link) {
var href = $(link).attr('href')
if (!href.match('.png')) return
imageURLs.push(domain + href)
})
}
var domain = 'http://substack.net/images/'
request(domain, gotHTML)
Selenium has support for multiple languages and multiple platforms and multiple browsers.

Real-time basic web analytics with Javascript

I need to develop an in-house real-time analytics solution (similar to GA or mixpanel for example) that collects:
Information from the website itself ­­(URL)
Information from the user’s browser ­­(lang, device, OS etc..)
Information from the referring source etc..
.. and sends this data to the server with a single-pixel image request. Similar to how GA and other solutions work:
Google Analytics works by the inclusion of a block of JavaScript code
on pages in your website. When users to your website view a page, this
JavaScript code references a JavaScript file which then executes the
tracking operation for Analytics. The tracking operation retrieves
data about the page request through various means and sends this
information to the Analytics server via a list of parameters attached
to a single-pixel image request.
I wonder if there's any open source project available that does this part which I could use as base to build further. There's Piwik but its too feature-packed and too heavy for my requirement.
Edited to add: I'm doing something specific with the data, otherwise I'd just use the existing solutions.
Try
var img = new Image;
img.width = img.height = "1px";
var res = window.navigator;
var data = {};
var _plugins = {};
Array.prototype.slice.call(navigator.plugins).forEach(function(v, k) {
_plugins[v.name.toLowerCase().replace(/\s/, "-")] = {
"name": v.name,
"description": v.description,
"filename": v.filename
}
});
delete res.plugins && delete res.mimeTypes;
data.url = window.location.href;
data.ref = document.referrer;
data.nav = res;
data._plugins = _plugins;
// set `img` `dataset` with `data` ,
// send `img` to server , decode `img` `dataset` at server
img.dataset.stats = JSON.stringify(data);
var img = new Image;
img.width = img.height = "1px";
var res = window.navigator;
var data = {};
var _plugins = {};
Array.prototype.slice.call(navigator.plugins).forEach(function(v, k) {
_plugins[v.name.toLowerCase().replace(/\s/, "-")] = {
"name": v.name,
"description": v.description,
"filename": v.filename
}
});
delete res.plugins && delete res.mimeTypes;
data.url = window.location.href;
data.ref = document.referrer;
data.nav = res;
data._plugins = _plugins;
img.dataset.stats = JSON.stringify(data);
document.write(
img.dataset.stats
);
There are 2 big solutions for open source analytics.
Piwik as you mentioned is a well documented and pretty mature solution. Drilling down the code, how Piwik makes things come around will give you some insights.
Open Web Analytics is the other big player on the game. A more simplified tool which will help you understand how basic tracking is made.
Depending on the data you want to track I would also suggest taking a look on this tutorial which uses sockets in order to track real time data.
Least but not last you can also check what Crazy Egg does if you want to track down user's interactivity.

Determining session ID from WebDriverJS

I'm trying to run WebDriverJS on the browser, but the documentation is somewhat vague on how to get it controlling the host browser. Here, it says:
Launching a browser to run a WebDriver test against another browser is a tad redundant
(compared to simply using node). Instead, using WebDriverJS in the browser is intended for
automating the browser actually running the script. This can be accomplished as long as the > URL for the server and session ID for the browser are known. While these values may be
passed to the builder directly, they may also be defined using the wdurl and wdsid
"environment variables", which are parsed from the loading page's URL query data:
<!-- Assuming HTML URL is /test.html?wdurl=http://localhost:4444/wd/hub&wdsid=foo1234 -->
<!DOCTYPE html>
<script src="webdriver.js"></script>
<input id="input" type="text"/>
<script>
// Attaches to the server and session controlling this browser.
var driver = new webdriver.Builder().build();
var input = driver.findElement(webdriver.By.tagName('input'));
input.sendKeys('foo bar baz').then(function() {
assertEquals('foo bar baz',
document.getElementById('input').value);
});
</script>
I want to open up my test page from Node.js, and then run the commands included in the client-side script. However, I don't know how I would be able to extract the session ID (wdsid query parameter) when I build the session. Does anyone have any idea?
Finally figured it out after a lot of experimentation and reading through the WebDriverJS source code.
var webdriver = require('./assets/webdriver');
var driver = new webdriver.Builder().
usingServer('http://localhost:4444/wd/hub').
withCapabilities({
'browserName': 'chrome',
'version': '',
'platform': 'ANY',
'javascriptEnabled': true
}).
build();
var testUrl = 'http://localhost:3000/test',
hubUrl = 'http://localhost:4444/wd/hub',
sessionId;
driver.session_.then(function(sessionData) {
sessionId = sessionData.id;
driver.get(testUrl + '?wdurl=' + hubUrl + '&wdsid=' + sessionId);
});
driver.session_ returns a Promise object which will contain the session data and other information upon instantiation. Using .then(callback(sessionData)) will allow you to manipulate the data as you wish.
My Selenium version: 4.1.0
Get sesion_id with await
const get_session_id = async (SeleniumDriver) =>
{
const res1 = await SeleniumDriver.getSession();
return res1.getId();
}
call in other function:
await get_session_id(SeleniumDriver);

Categories

Resources