So, I am using Puppeteer (a headless browser) to scrape through a website, and when I access that url, how can I load jQuery to use it inside my page.evaluate() function.
All I have now is a .js file and I'm running the code below. It goes to my URL as intended until I get an error on page.evaluate() since it seems like it's not loading the jQuery as I thought it would from the code on line 7: await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
Any ideas how I can load jQuery correctly here, so that I can use jQuery inside my page.evaluate() function?
(async() => {
let url = "[website url I'm scraping]"
let browser = await puppeteer.launch({headless:false});
let page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle2'});
// code below doesn't seem to load jQuery, since I get an error in page.evaluate()
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
await page.evaluate( () => {
// want to use jQuery here to do access DOM
var classes = $( "td:contains('Lec')")
classes = classes.not('.Comments')
classes = classes.not('.Pct100')
classes = Array.from(classes)
});
})();
You are on the right path.
Also I don't see any jQuery code being used in your evaluate function.
There is no document.getElement function.
The best way would to be to add a local copy of jQuery to avoid any cross origin errors.
More details can be found in the already answered question here.
UPDATE: I tried a small snippet to test jquery. The puppeteer version is 10.4.0.
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto('https://google.com',{waitUntil: 'networkidle2'});
await page.addScriptTag({path: "jquery.js"})
await page.evaluate( () => {
let wrapper = $(".L3eUgb");
wrapper.css("background-color","red");
})
await page.screenshot({path:"hello.png"});
await browser.close();
})();
The screenshot is
So the jquery code is definitely working.
Also check if the host website doesn't have a jQuery instance already. In that case you would need to use jquery noConflict
$.noConflict();
Fixed it!
I realized I forgot to include the code where I did some extra navigation clicks after going to my initial URL, so the problem was from adding the script tag to my initial URL instead of after navigating to my final destination URL.
I also needed to use
await page.waitForNavigation({waitUntil: 'networkidle2'})
before adding the script tag so that the page was fully loaded before adding the script.
Related
I am web scraping using NodeJS/typescript.
I have a problem using puppeteer where I get the fully rendered page (which I verify by running await page.content()). I printed the content and found that it had 26 'a' tags (links). However, when I search with puppeteer, I only get 20.
What is more strange is that sometimes I will get all the 'a' tags on the page and sometimes it gets less 'a' tags than on the page - all without changing the code! It seems to be kind of random.
I've seen some suggestions online saying to use a waitForElement method or something along those lines. Basically, before searching for tags, it ensures an element is on the page. I don't think this would help in my case because clearly puppeteer is getting everything it needs as shown by the await page.content() method.
Does anyone know why this may be happening? Thanks! A simplified snippet of my code is below.
const getLinksFromPage = async (
browser: puppeteer.Browser,
url: string
) => {
const page = await browser.newPage();
const curLink = book.sportsURLs[pageIndex];
await page.goto(url, { waitUntil: 'networkIdle0'});
const html = await page.content(); // this code gets the content and prints it
console.log(html); // so I can verify number of 'a' tags
const rawLinks = await page.$$eval('a', (elements: Element[]) => {
return elements
.map((element: Element) => element.getAttribute('href')!)
});
await page.close();
return rawLinks
};
I have a NodeJS Typescript project, and I am trying to get all the 'p' tags from a dynamically rendered website (not STATIC HTML but instead makes multiple requests to backend to get some data and render webpage). I am using typescript and have ["es6", "dom"] in my lib, and I have the following code (this is all my code in the project so far):
import puppeteer from 'puppeteer';
const getLinks = async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://webscraper.io/test-sites', { waitUntil: 'networkidle0' });
const links = await page.evaluate(() => document.querySelectorAll('p'));
console.log(links);
await browser.close();
}
However, I keep getting undefined when I print links. I assume this is because the program can't find any 'p' tags. Why is this?
Note: the url provided is just an example. I have tried across multiple different sites and I still get undefined.
Any help is appreciated! Thanks!
Don't use page.evaluate to get elements, use waitForSelector/waitForXpath/$x/$$ instead (see Puppeteer doc to know the differences between them: https://devdocs.io/puppeteer/index#pageselector-1):
const links: ElementHandle[] = await mainPage.$$("p");
I'm trying to figure out a way to scrape next page link from a webpage using xpath within puppeteer. When I execute the script, I can see that the script gets gibberish result even when the xpath is correct. How can I fix it?
const puppeteer = require("puppeteer");
const base = "https://www.timesbusinessdirectory.com";
let url = "https://www.timesbusinessdirectory.com/company-listings";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
page.waitForSelector(".company-listing");
const nextPageLink = await page.$x("//a[#aria-label='Next'][./span[#aria-hidden='true'][contains(.,'Next')]]", item => item.getAttribute("href"));
url = base.concat(nextPageLink);
console.log("========================>",url)
await browser.close();
})();
Current output:
https://www.timesbusinessdirectory.comJSHandle#node
Expected output:
https://www.timesbusinessdirectory.com/company-listings?page=2
First of all, there's a missing await on page.waitForSelector(".company-listing");. Not awaiting this defeats the point of the call entirely, but it could be that it incidentally works since the very strict waitUntil: "networkidle2" covers the selector you're interested in anyway, or the xpath is statically present (I didn't bother to check).
Generally speaking, if you're using waitForSelector right after a page.goto, waitUntil: "networkidle2" only slows you down. Only keep it if there's some content you need on the page other than the waitForSelector target, otherwise you're waiting for irrelevant requests that are pulling down images, scripts and data potentially unrelated to your primary target. If it's a slow-loading page, then increasing the timeout on your waitFor... is the typical next step.
Another note is that it's sort of odd to waitForSelector on some CSS target, then try to select an xpath immediately afterwards. It seems more precise to waitForXPath, then call $x on the exact same xpath pattern twice.
Next, let's look at the docs for page.$x:
page.$x(expression)
expression <string> Expression to evaluate.
returns: <Promise<Array<ElementHandle>>>
The method evaluates the XPath expression relative to the page document as its context node. If there are no such elements, the method resolves to an empty array.
Shortcut for page.mainFrame().$x(expression)
So, unlike evaluate, $eval and $$eval, $x takes 1 parameter and resolves to an elementHandle array. Your second parameter callback doesn't get you the href like you think -- this only works on eval-family functions.
In addition to consulting the docs, you can also console.log the returned value to confirm the behavior. The JSHandle#node you're seeing in the URL isn't gibberish, it's the stringified form of the JSHandle object and provides information you can cross-check against the docs.
The solution is to grab the first elementHandle from the array returned by the function and then evaluate on that handle using your original callback:
const puppeteer = require("puppeteer");
const url = "https://www.timesbusinessdirectory.com/company-listings";
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.goto(url);
const xp = `//a[#aria-label='Next']
[./span[#aria-hidden='true'][contains(.,'Next')]]`;
await page.waitForXPath(xp);
const [nextPageLink] = await page.$x(xp);
const href = await nextPageLink.evaluate(el => el.getAttribute("href"));
console.log(href); // => /company-listings?page=2
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
As an aside, there's also el => el.href for grabbing the href attribute. .href includes the base URL here, so you won't need to concatenate. In general, behavior differs beyond delivering the absolute vs relative path, so it's good to know about both options.
I am trying to scrape specific string on webpage below :
https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;
The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->
"View Page source"):
name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0"
I am using "puppeteer" and below is my code :
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//await page.goto('https://example.com');
const response = await page.goto("My-url-above");
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
console.log(await response.text());
console.log(await page.content());
await browser.close();
})()
But I cannot find the strings I am looking for in response.text() or page.content().
Am I using the wrong methods in page ?
How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?
If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):
<select
class="hprt-nos-select"
name="nr_rooms_4377601_232287150_0_1_0"
data-component="hotel/new-rooms-table/select-rooms"
data-room-id="4377601"
data-block-id="4377601_232287150_0_1_0"
data-is-fflex-selected="0"
id="hprt_nos_select_4377601_232287150_0_1_0"
aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>
You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:
await page.waitForSelector('.hprt-nos-select', { timeout: 0 });
BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).
You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.
I'm trying to select and console.log() the NodeList of all the links of a website in the terminal. However I'm unable to do so accessing certain websites - google.com, facebook.com, instagram.com.
I know that elements are there, because I can certainly log them in the actual Chromium console, which loads separately, using document.querySelectorAll('a'). But when I'm trying to extract and log links in Node terminal, using
const links = await page.evaluate(() => document.querySelectorAll('a'))
console.log(links)
I get undefined
However, this is not the case for the most websites, for example yahoo.com, linkedin.com, where my code works. Here it is:
const URL = 'https://instagram.com/';
const scrape = async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({
width: 1240,
height: 680
});
await page.goto(URL, { waitUntil: 'domcontentloaded' });
await page.waitFor(6000);
const links = await page.evaluate(() => document.querySelectorAll('a'));
console.log(links);
await page.screenshot({
path: 'ig.png'
});
await browser.close();
};
I tried adding bypassBotDetectionSystem() function, as suggested in this article, but it didn't work. I don't think that that is the issue, because like I said, I can easily navigate stuff in the Chromium.
Thanks for help!
You are trying to return the DOM elements with the page.evaluate method, but this is impossible because If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined as in your case.
You can use the page.$$ method instead if you want to get an array of the ElementHandle.
Example:
const links = await page.$$('a'); // returns <Promise<Array<ElementHandle>>>
But if you want just get all values of attribute (e.g. href) you can take the page.$$eval method, it runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction
Example:
const hrefs = await page.$$eval('a', links => links.map(link => link.href));
console.log(hrefs);
const hrefs = await page.$$eval('a', anchors => [].map.call(anchors, a => a.href));