I am web scraping using NodeJS/typescript.
I have a problem using puppeteer where I get the fully rendered page (which I verify by running await page.content()). I printed the content and found that it had 26 'a' tags (links). However, when I search with puppeteer, I only get 20.
What is more strange is that sometimes I will get all the 'a' tags on the page and sometimes it gets less 'a' tags than on the page - all without changing the code! It seems to be kind of random.
I've seen some suggestions online saying to use a waitForElement method or something along those lines. Basically, before searching for tags, it ensures an element is on the page. I don't think this would help in my case because clearly puppeteer is getting everything it needs as shown by the await page.content() method.
Does anyone know why this may be happening? Thanks! A simplified snippet of my code is below.
const getLinksFromPage = async (
browser: puppeteer.Browser,
url: string
) => {
const page = await browser.newPage();
const curLink = book.sportsURLs[pageIndex];
await page.goto(url, { waitUntil: 'networkIdle0'});
const html = await page.content(); // this code gets the content and prints it
console.log(html); // so I can verify number of 'a' tags
const rawLinks = await page.$$eval('a', (elements: Element[]) => {
return elements
.map((element: Element) => element.getAttribute('href')!)
});
await page.close();
return rawLinks
};
I have a github pages project that I am trying to create. I've got it working great on local, but of course when I publish it it fails.
The problem is in this bit of javascript, which is supposed to pull some data from a json file in the repo to build the contents of a certain page:
(async function(){
const response = await fetch(`https://GITUSER.github.io/GITREPO/tree/gh-pages/data/file.json`);//Error gets thrown here, because the asset does not exist in the current code state.
const docData = await response.json();
const contentTarget = document.getElementById('doc-target');
const tocTarget = document.getElementById('toc-target')
createContent(tocTarget,contentTarget,docData);
})();
Now, the problem is that pages won't load the asset because it doesn't know that it needs it until the function is called. Is there a way to get this asset loaded by pages so it can be called by the fetch API? Or is this beyond the capabilities of github pages?
Edited: Added some additional code for context.
Try using raw.githubusercontent.com like this
(async function(){
const response = await fetch('https://raw.githubusercontent.com/{username}/{repo}/{branch}/{file}')
const docData = await response.json();
const contentTarget = document.getElementById('doc-target');
const tocTarget = document.getElementById('toc-target')
createContent(tocTarget,contentTarget,docData);
})();
And it would work
So, I am using Puppeteer (a headless browser) to scrape through a website, and when I access that url, how can I load jQuery to use it inside my page.evaluate() function.
All I have now is a .js file and I'm running the code below. It goes to my URL as intended until I get an error on page.evaluate() since it seems like it's not loading the jQuery as I thought it would from the code on line 7: await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
Any ideas how I can load jQuery correctly here, so that I can use jQuery inside my page.evaluate() function?
(async() => {
let url = "[website url I'm scraping]"
let browser = await puppeteer.launch({headless:false});
let page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle2'});
// code below doesn't seem to load jQuery, since I get an error in page.evaluate()
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
await page.evaluate( () => {
// want to use jQuery here to do access DOM
var classes = $( "td:contains('Lec')")
classes = classes.not('.Comments')
classes = classes.not('.Pct100')
classes = Array.from(classes)
});
})();
You are on the right path.
Also I don't see any jQuery code being used in your evaluate function.
There is no document.getElement function.
The best way would to be to add a local copy of jQuery to avoid any cross origin errors.
More details can be found in the already answered question here.
UPDATE: I tried a small snippet to test jquery. The puppeteer version is 10.4.0.
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto('https://google.com',{waitUntil: 'networkidle2'});
await page.addScriptTag({path: "jquery.js"})
await page.evaluate( () => {
let wrapper = $(".L3eUgb");
wrapper.css("background-color","red");
})
await page.screenshot({path:"hello.png"});
await browser.close();
})();
The screenshot is
So the jquery code is definitely working.
Also check if the host website doesn't have a jQuery instance already. In that case you would need to use jquery noConflict
$.noConflict();
Fixed it!
I realized I forgot to include the code where I did some extra navigation clicks after going to my initial URL, so the problem was from adding the script tag to my initial URL instead of after navigating to my final destination URL.
I also needed to use
await page.waitForNavigation({waitUntil: 'networkidle2'})
before adding the script tag so that the page was fully loaded before adding the script.
I am trying to scrape specific string on webpage below :
https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;
The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->
"View Page source"):
name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0"
I am using "puppeteer" and below is my code :
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//await page.goto('https://example.com');
const response = await page.goto("My-url-above");
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
console.log(await response.text());
console.log(await page.content());
await browser.close();
})()
But I cannot find the strings I am looking for in response.text() or page.content().
Am I using the wrong methods in page ?
How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?
If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):
<select
class="hprt-nos-select"
name="nr_rooms_4377601_232287150_0_1_0"
data-component="hotel/new-rooms-table/select-rooms"
data-room-id="4377601"
data-block-id="4377601_232287150_0_1_0"
data-is-fflex-selected="0"
id="hprt_nos_select_4377601_232287150_0_1_0"
aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>
You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:
await page.waitForSelector('.hprt-nos-select', { timeout: 0 });
BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).
You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.
I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page
Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.
If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).
If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)
If you use server-side javascript it is possible.
You should take a look at node.js
And an example of a crawler can be found in the link bellow:
http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx
Google's Chrome team has released puppeteer on August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).
It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:
const browser = await puppeteer.launch({executablePath: '/path/to/Chrome'});
An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
We could crawl the pages using Javascript from server side with help of headless webkit. For crawling, we have few libraries like PhantomJS, CasperJS, also there is a new wrapper on PhantomJS called Nightmare JS which make the works easier.
There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend's friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.
Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hijacked and used as a public proxy by 3rd party if you are not carefull.
Good luck, maybe you make a friend or two in the process like I did :-)
My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.
Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom's powers)
This is what you need http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodb
They use NodeJS, MongoDB and ExtJs as GUI
yes it is possible
Use NODEJS (its server side JS)
There is NPM (package manager that handles 3rd party modules) in nodeJS
Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)
There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.
here an example:
if you have urls like these:
http://www.example.com/products/pages/1
http://www.example.com/products/pages/2
then you can use something like this to open all pages containing product list(execute this manually)
var j = 0;
for(var i=1;i<5;i++)
{
setTimeout(function(){
j = j + 1;
window.open('http://www.example.com/products/pages/ + j, '_blank');
}, 15000 * i);
}
then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.
http://www.example.com/products/pages/*
and then a script for each product page to extract data and call a webservice passing data and close window and so on.
I made an example javascript crawler on github.
It's event driven and use an in-memory queue to store all the resources(ie. urls).
How to use in your node environment
var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');
// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();
Here I'm just showing you 2 core method of a javascript crawler.
Crawler.prototype.run = function() {
var crawler = this;
process.nextTick(() => {
//the run loop
crawler.crawlerIntervalId = setInterval(() => {
crawler.crawl();
}, crawler.crawlInterval);
//kick off first one
crawler.crawl();
});
crawler.running = true;
crawler.emit('start');
}
Crawler.prototype.crawl = function() {
var crawler = this;
if (crawler._openRequests >= crawler.maxListenerCurrency) return;
//go get the item
crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
if (queueItem) {
//got the item start the fetch
crawler.fetchQueueItem(queueItem, index);
} else if (crawler._openRequests === 0) {
crawler.queue.complete((err, completeCount) => {
if (err)
throw err;
crawler.queue.getLength((err, length) => {
if (err)
throw err;
if (length === completeCount) {
//no open Request, no unfetcheditem stop the crawler
crawler.emit("complete", completeCount);
clearInterval(crawler.crawlerIntervalId);
crawler.running = false;
}
});
});
}
});
};
Here is the github link https://github.com/bfwg/node-tinycrawler.
It is a javascript web crawler written under 1000 lines of code.
This should put you on the right track.
You can make a web crawler driven from a remote json file that opens all links from a page in new tabs as soon as each tab loads except ones that have already been opened. If you set up a with a browser extension running in a basic browser (nothing runs except the web browser and an internet config program) and had it shipped and installed somewhere with good internet, you could make a database of webpages with an old computer. That would just need to retrieve the content of each tab. You could do that for about $2000, contrary to most estimates for search engine costs. You'd just need to basically make your algorithm provide pages based on how much a term appears in the innerText property of the page, keywords, and description. You could also set up another PC to recrawl old pages from the one-time database and add more. I'd estimate it would take about 3 months and $20000, maximum.
Axios + Cheerio
You can do this with axios and cheerios. Check axios docs for response format.
const cheerio = require('cheerio');
const axios = require('axios');
//crawl
//get url
var url = 'http://amazon.com';
axios.get(url)
.then((res) => {
//response format
var body = res.data;
var statusCode = res.status;
var statusText = res.statusText;
var headers = res.headers;
var request = res.request;
var config = res.config;
//jquery
let $ = cheerio.load(body);
//example
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
}).catch(function (e) {
console.log(e);
});
Node-Fetch + Cheerio
You can do the same thing with node-fetch and cheerio.
fetch(url, {
method: "GET",
}).then(function(response){
//response
var html = response.text();
//return
return html;
})
.then(function(res) {
//response html
var html = res;
//jquery
let $ = cheerio.load(html);
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
})
.catch((error) => {
console.error('Error:', error);
});