How do I evaluate a javascript variable when web scrapping using cheerio

How do I evaluate a javascript variable when web scrapping using cheerio - javascript

When I used Cheerio to scrape https://www.bankofamerica.com/home-equity/assumptions-home-equity/?loanType=homeEquity&state=CA, I only receive a variable name instead of the variable value.
Code:
const BankofAmericaScraper = async (browser) => {
const date = new Date().toLocaleDateString();
const page = await browser.newPage();
await page.goto(URL, {
waitUntil: ["load"],
timeout: 0,
});
const MortgagesPage = await page.content();
const $ = cheerio.load(MortgagesPage);
const step1 = Object.values($(".col-num-2")[2])[5];
}
I get {{ percentage rates.product.currentRate }} and not 6.650.
How do I access the variable? I'm using a headless browser to evaluate it.

Short answer: With Cheerio, you cant
So right off the bat, Cheerio documentation states
Cheerio parses markup and provides an API for traversing/manipulating
the resulting data structure. It does not interpret the result as a
web browser does. Specifically, it does not produce a visual
rendering, apply CSS, load external resources, or execute JavaScript
The trade off for this is having that speed of returning data, in comparison to other libraries that emulate the page.

Related

Github pages, how to fetch file in js from repo

I have a github pages project that I am trying to create. I've got it working great on local, but of course when I publish it it fails.
The problem is in this bit of javascript, which is supposed to pull some data from a json file in the repo to build the contents of a certain page:
(async function(){
const response = await fetch(`https://GITUSER.github.io/GITREPO/tree/gh-pages/data/file.json`);//Error gets thrown here, because the asset does not exist in the current code state.
const docData = await response.json();
const contentTarget = document.getElementById('doc-target');
const tocTarget = document.getElementById('toc-target')
createContent(tocTarget,contentTarget,docData);
})();
Now, the problem is that pages won't load the asset because it doesn't know that it needs it until the function is called. Is there a way to get this asset loaded by pages so it can be called by the fetch API? Or is this beyond the capabilities of github pages?
Edited: Added some additional code for context.

Try using raw.githubusercontent.com like this
(async function(){
const response = await fetch('https://raw.githubusercontent.com/{username}/{repo}/{branch}/{file}')
const docData = await response.json();
const contentTarget = document.getElementById('doc-target');
const tocTarget = document.getElementById('toc-target')
createContent(tocTarget,contentTarget,docData);
})();
And it would work

Can't find any tags using NodeJS, puppeteer, and document.querySelector

I have a NodeJS Typescript project, and I am trying to get all the 'p' tags from a dynamically rendered website (not STATIC HTML but instead makes multiple requests to backend to get some data and render webpage). I am using typescript and have ["es6", "dom"] in my lib, and I have the following code (this is all my code in the project so far):
import puppeteer from 'puppeteer';
const getLinks = async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://webscraper.io/test-sites', { waitUntil: 'networkidle0' });
const links = await page.evaluate(() => document.querySelectorAll('p'));
console.log(links);
await browser.close();
}
However, I keep getting undefined when I print links. I assume this is because the program can't find any 'p' tags. Why is this?
Note: the url provided is just an example. I have tried across multiple different sites and I still get undefined.
Any help is appreciated! Thanks!

Don't use page.evaluate to get elements, use waitForSelector/waitForXpath/$x/$$ instead (see Puppeteer doc to know the differences between them: https://devdocs.io/puppeteer/index#pageselector-1):
const links: ElementHandle[] = await mainPage.$$("p");

How can I use jQuery with Puppeteer?

So, I am using Puppeteer (a headless browser) to scrape through a website, and when I access that url, how can I load jQuery to use it inside my page.evaluate() function.
All I have now is a .js file and I'm running the code below. It goes to my URL as intended until I get an error on page.evaluate() since it seems like it's not loading the jQuery as I thought it would from the code on line 7: await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
Any ideas how I can load jQuery correctly here, so that I can use jQuery inside my page.evaluate() function?
(async() => {
let url = "[website url I'm scraping]"
let browser = await puppeteer.launch({headless:false});
let page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle2'});
// code below doesn't seem to load jQuery, since I get an error in page.evaluate()
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'})
await page.evaluate( () => {
// want to use jQuery here to do access DOM
var classes = $( "td:contains('Lec')")
classes = classes.not('.Comments')
classes = classes.not('.Pct100')
classes = Array.from(classes)
});
})();

You are on the right path.
Also I don't see any jQuery code being used in your evaluate function.
There is no document.getElement function.
The best way would to be to add a local copy of jQuery to avoid any cross origin errors.
More details can be found in the already answered question here.
UPDATE: I tried a small snippet to test jquery. The puppeteer version is 10.4.0.
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto('https://google.com',{waitUntil: 'networkidle2'});
await page.addScriptTag({path: "jquery.js"})
await page.evaluate( () => {
let wrapper = $(".L3eUgb");
wrapper.css("background-color","red");
})
await page.screenshot({path:"hello.png"});
await browser.close();
})();
The screenshot is
So the jquery code is definitely working.
Also check if the host website doesn't have a jQuery instance already. In that case you would need to use jquery noConflict
$.noConflict();

Fixed it!
I realized I forgot to include the code where I did some extra navigation clicks after going to my initial URL, so the problem was from adding the script tag to my initial URL instead of after navigating to my final destination URL.
I also needed to use
await page.waitForNavigation({waitUntil: 'networkidle2'})
before adding the script tag so that the page was fully loaded before adding the script.

Can't extract next page link using xpath within puppeteer

I'm trying to figure out a way to scrape next page link from a webpage using xpath within puppeteer. When I execute the script, I can see that the script gets gibberish result even when the xpath is correct. How can I fix it?
const puppeteer = require("puppeteer");
const base = "https://www.timesbusinessdirectory.com";
let url = "https://www.timesbusinessdirectory.com/company-listings";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
page.waitForSelector(".company-listing");
const nextPageLink = await page.$x("//a[#aria-label='Next'][./span[#aria-hidden='true'][contains(.,'Next')]]", item => item.getAttribute("href"));
url = base.concat(nextPageLink);
console.log("========================>",url)
await browser.close();
})();
Current output:
https://www.timesbusinessdirectory.comJSHandle#node
Expected output:
https://www.timesbusinessdirectory.com/company-listings?page=2

First of all, there's a missing await on page.waitForSelector(".company-listing");. Not awaiting this defeats the point of the call entirely, but it could be that it incidentally works since the very strict waitUntil: "networkidle2" covers the selector you're interested in anyway, or the xpath is statically present (I didn't bother to check).
Generally speaking, if you're using waitForSelector right after a page.goto, waitUntil: "networkidle2" only slows you down. Only keep it if there's some content you need on the page other than the waitForSelector target, otherwise you're waiting for irrelevant requests that are pulling down images, scripts and data potentially unrelated to your primary target. If it's a slow-loading page, then increasing the timeout on your waitFor... is the typical next step.
Another note is that it's sort of odd to waitForSelector on some CSS target, then try to select an xpath immediately afterwards. It seems more precise to waitForXPath, then call $x on the exact same xpath pattern twice.
Next, let's look at the docs for page.$x:
page.$x(expression)
expression <string> Expression to evaluate.
returns: <Promise<Array<ElementHandle>>>
The method evaluates the XPath expression relative to the page document as its context node. If there are no such elements, the method resolves to an empty array.
Shortcut for page.mainFrame().$x(expression)
So, unlike evaluate, $eval and $$eval, $x takes 1 parameter and resolves to an elementHandle array. Your second parameter callback doesn't get you the href like you think -- this only works on eval-family functions.
In addition to consulting the docs, you can also console.log the returned value to confirm the behavior. The JSHandle#node you're seeing in the URL isn't gibberish, it's the stringified form of the JSHandle object and provides information you can cross-check against the docs.
The solution is to grab the first elementHandle from the array returned by the function and then evaluate on that handle using your original callback:
const puppeteer = require("puppeteer");
const url = "https://www.timesbusinessdirectory.com/company-listings";
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.goto(url);
const xp = `//a[#aria-label='Next']
[./span[#aria-hidden='true'][contains(.,'Next')]]`;
await page.waitForXPath(xp);
const [nextPageLink] = await page.$x(xp);
const href = await nextPageLink.evaluate(el => el.getAttribute("href"));
console.log(href); // => /company-listings?page=2
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
As an aside, there's also el => el.href for grabbing the href attribute. .href includes the base URL here, so you won't need to concatenate. In general, behavior differs beyond delivering the absolute vs relative path, so it's good to know about both options.

Get complete web page source html with puppeteer - but some part always missing

I am trying to scrape specific string on webpage below :
https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;
The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->
"View Page source"):
name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0"
I am using "puppeteer" and below is my code :
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//await page.goto('https://example.com');
const response = await page.goto("My-url-above");
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
console.log(await response.text());
console.log(await page.content());
await browser.close();
})()
But I cannot find the strings I am looking for in response.text() or page.content().
Am I using the wrong methods in page ?
How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?

If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):
<select
class="hprt-nos-select"
name="nr_rooms_4377601_232287150_0_1_0"
data-component="hotel/new-rooms-table/select-rooms"
data-room-id="4377601"
data-block-id="4377601_232287150_0_1_0"
data-is-fflex-selected="0"
id="hprt_nos_select_4377601_232287150_0_1_0"
aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>
You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:
await page.waitForSelector('.hprt-nos-select', { timeout: 0 });
BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).
You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.

Develop Reference

JavaScript is the programming language of the Web.

How do I evaluate a javascript variable when web scrapping using cheerio - javascript

Related

Github pages, how to fetch file in js from repo

Can't find any tags using NodeJS, puppeteer, and document.querySelector

How can I use jQuery with Puppeteer?

Can't extract next page link using xpath within puppeteer

Get complete web page source html with puppeteer - but some part always missing

Categories

Resources