Getting <span> text on web scraping - javascript

I'm using Puppeteer and jsDOM to scrape this site: https://www.lcfc.com/matches/results.
I want the names of the teams of every match, so on the console I use this:
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
On the console, the names prints ok but when I use this on my code it returns nothing.
This is my code:
const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
(async () => {
try {
const browser = await puppeteer.launch() ;
const page = await browser.newPage();
const response = await page.goto('https://www.lcfc.com/matches/results');
const body = await response.text();
const { window: { document } } = new jsdom.JSDOM(body);
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
await browser.close();
} catch (error) {
console.error(error);
}
})();
And I don't have any error. Some suggestion? Thank you.
I tried with this code now, but still not working. I show the code and a picture of the console:
const puppeteer = require('puppeteer');
(async () => {
try {
const browser = await puppeteer.launch() ;
const page = await browser.newPage();
await page.waitForSelector('.match-item__team-container span');
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
});
//listen to console events in the chrome tab and log it in nodejs process
page.on('console', consoleObj => console.log(consoleObj.text()));
await browser.close();
} catch (error) {
console.log(error);
}
})();

Do it puppeter way and use evaluate to run your code after waiting for the selector to appear via waitForSelector
await page.waitForSelector('.match-item__team-container span');
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
//or return the values of the selected item
return somevalue;
});
//listen to console events in the chrome tab and log it in nodejs process
page.on('console', consoleObj => console.log(consoleObj.text()));
evaluate runs your code inside the active tab of the chrome so you will not need jsDOM to parse the response.
UPDATE
The new timeout issue is because the page is taking too long to load: use {timeout : 0}
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
//or return the values of the selected item
return somevalue;
},{timeout:60000});

Related

Puppeteer evaluate function on load to show alert

I am a newbie in puppeteer and I can't understand why the following code cannot work. any explanation would be appreciated.
const puppeteer=require("puppeteer");
(async () => {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto('https://bet254.com');
await page.evaluate(async ()=> {
window.addEventListener("load",(event)=>{
document.alert("Loaded!");
})
});
})();
I was expecting an alert after loading. But nothing happened! How can I add a listener to show an alert on page load?
page.goto already waits for the page to load, so by the time your evalute runs, you can't re-wait for the page to load, so the load event will never fire.
Another problem is that document.alert isn't a function. You may be thinking of document.write or window.alert. In any case, neither function is particularly useful for debugging, so I suggest sticking to console.log unless you have a very compelling reason not to.
When working with Puppeteer, it's important to isolate problems by running your evaluate code by hand in the browser without Puppeteer, otherwise you might have no idea whether it's Puppeteer or the browser code that's failing.
Anything logged in evaluate won't be shown in your Node stdout or stderr, so you'll probably want to monitor that with a log listener. You'll need to look at both Node and the browser console for errors.
Depending on what you're trying to accomplish, page.evaluateOnNewDocument(pageFunction[, ...args]) will let you attach code to evaluate whenever you navigate, which might be what you're trying for here.
Here's an example of alerting headfully:
const puppeteer = require("puppeteer"); // ^19.6.3
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
await page.evaluateOnNewDocument(() => {
window.addEventListener("load", event => {
alert("Loaded!");
});
});
await page.goto("https://www.example.com", {waitUntil: "load"});
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Using console.log headlessly, with the log redirected to Node:
const puppeteer = require("puppeteer");
const onPageConsole = msg =>
Promise.all(msg.args().map(e => e.jsonValue()))
.then(args => console.log(...args));
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
page.on("console", onPageConsole);
await page.evaluateOnNewDocument(() => {
window.addEventListener("load", event => {
console.log("Loaded!");
});
});
await page.goto("https://www.example.com", {waitUntil: "load"});
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
If all you're trying to do is run some code in the browser after load, then you might not even need to attach a listener to the load event at all:
const onPageConsole = msg =>
Promise.all(msg.args().map(e => e.jsonValue()))
.then(args => console.log(...args));
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
page.on("console", onPageConsole);
await page.goto("https://www.example.com", {waitUntil: "load"});
await page.evaluate(() => console.log("Loaded!"));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Or if the code you want to run is purely Node just use normal control flow:
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto("https://www.example.com", {waitUntil: "load"});
console.log("Loaded!");
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
By the way, there's no need to make a function async unless you have await in it somewhere.
See also Puppeteer wait until page is completely loaded.

Puppeteer not allowing me to run multiple functions in different blocks of code

I'm in the process of making an Autocheckout bot, I'm attempting to make the section that checks if the item is in stock and I want to make it all different functions in different code blocks. The problem is I cant get it to run.
When I wrap the function in () only the first function runs while the second one does nothing.
Here is the code without the () around the functions, anyone know what I'm doing wrong?
const puppeteer = require ('puppeteer');
const puppeteerExtra = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
const rand_url = "https://www.walmart.com/ip/Cyberpunk-2077-Warner-Bros-PlayStation-4/786104378";
async function initBrowser(){
const browser = await puppeteer.launch({args: ["--incognito"],headless:false}); //Launches browser in incognito
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage(); //Ensures the new page is also incognito
await page.evaluateOnNewDocument(() => {delete navigator.__proto__.webdriver;});
await page.goto(rand_url); //goes to given link
return page;
};
async function checkstock(page){
await page.reload();
let content = await page.evaluate(() => document.body.innerHTML)
$("link[itemprop ='availability']", content).each(function(){
let out_of_stock = $(this).attr('href').toLowerCase().includes("outofstock");
if(out_of_stock){
console.log("Out of Stock");
} else{
await browser.close();
console.log("In Stock")
//await page.waitForSelector("button[class='button spin-button prod-ProductCTA--primary button--primary']", {visible: true,}); //Waits for Add to Cart Button
//await page.$eval("button[class='button spin-button prod-ProductCTA--primary button--primary']", elem => elem.click()); //Clicks Add to cart button
}
});
};
To execute the code do it as follow, but you will get ReferenceError: $ is not defined.
const puppeteer = require ('puppeteer');
const puppeteerExtra = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
const rand_url = "https://www.walmart.com/ip/Cyberpunk-2077-Warner-Bros-PlayStation-4/786104378";
async function initBrowser(){
const browser = await puppeteer.launch({args: ["--incognito"],headless:false}); //Launches browser in incognito
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage(); //Ensures the new page is also incognito
await page.evaluateOnNewDocument(() => {delete navigator.__proto__.webdriver;});
await page.goto(rand_url); //goes to given link
return page;
};
async function checkstock(page){
await page.reload();
let content = await page.evaluate(() => document.body.innerHTML)
console.error(content);
$("link[itemprop ='availability']", content).each(async function(){
let out_of_stock = $(this).attr('href').toLowerCase().includes("outofstock");
if(out_of_stock){
console.log("Out of Stock");
} else{
await browser.close();
}
});
};
(async () => {
const page = await initBrowser()
await checkstock(page)
})()
I debugged your code, and after add to launch.json:
"outputCapture": "std"
I noticed that there is an error in the following line:
await browser.close();
^^^^^
SyntaxError: await is only valid in async function
You need to add async
$("link[itemprop ='availability']", content).each(async function(){

How to reload and wait for an element to appear?

I tried searching for this answer but there doesn't seem to be an answer on the Internet. What I want to do is use node js to reload a page until it finds the element with the query I want. I will be using puppeteer for other parts of the program if that will help.
Ok, I used functions from both answers and came up with this, probably unoptimized code:
const puppeteer = require("puppeteer");
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("http://127.0.0.1:5500/main.html");
await page.waitForSelector("#buy-button");
console.log("worked");
} catch (err) {
console.log(`ERROR: ${err}`);
}
})();
But what I don't know how to do is to reload the page, and keep reloading until the id I want is there. For example, keep reloading youtube until the video you want is there(unpractical example, but I think it gets the point across).
Here's how I solved waiting for an element in puppeteer and reloading the page if it wasn't found;
async waitForSelectorWithReload(selector: string) {
const MAX_TRIES = 5;
let tries = 0;
while (tries <= MAX_TRIES) {
try {
const element = await this.page.waitForSelector(selector, {
timeout: 5000,
});
return element;
} catch (error) {
if (tries === MAX_TRIES) throw error;
tries += 1;
void this.page.reload();
await this.page.waitForNavigation({ waitUntil: 'networkidle0' });
}
}
}
And can be used as;
await waitForSelectorWithReload("input#name")
You can use "waitUntil: "networkidle2" to make sure the page is done loading. Obviously change the url, unless you are actually using evil.com
const puppeteer = require("puppeteer"); // include library
(async () =>{
const browser = await puppeteer.launch(); // run browser
const page = await browser.newPage(); // create new tab
await page.goto(
`http://www.evil.com`,
{
waitUntil: "networkidle2",
}
);
// do your stuff here
await browser.close();
})();
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
page
.waitForSelector('#myId')
.then(() => console.log('got it'));
browser.close();
});

How to get all html data after all scripts and page loading is done? (puppeteer)

Finally I figured how to use Node.js. Installed all libraries/extensions. So puppeteer is working, but as it was previous with Xmlhttp... it gets only template/body of the page, without needed information. All scripts on the page engage after few second it had been opened in browser (Web app?). I need to get information inside certain tags after Whole page is loaded. Also, I would ask, if it possible to have pure JavaScript, because I do not use jQuery like code. So it doubles difficulty for me...
Here what I have so far.
const puppeteer = require('puppeteer');
const $ = require('cheerio');
let browser;
let page;
const url = "really long link with latitude and attitude";
(async () => puppeteer
.launch()
.then(await function(browser) {
return browser.newPage();
})
.then(await function(page) {
return page.goto(url).then(function() {
return page.content();
});
})
.then(await function(html) {
$('strong', html).each(function() {
console.log($(this).text());
});
})
.catch(function(err) {
//handle error
}))();
I get only template default body elements inside strong tag. But it should contain a lot more data than just 10 items.
If you want full html same as inspect? Here it is:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
let bodyHTML = await page.evaluate(() => document.documentElement.outerHTML);
This
Some notes:
You need not cheerio with puppeteer and you need not reparse page.content(): you already have the full DOM with all scripts run and you can evaluate any code in window context like in a browser using page.evaluate() and transferring serializable data between web API context and Node.js API context.
Try to use async/await only, this will simplify your code and flow.
If you need to wait till all the scripts and other dependencies are loaded, use waitUntil: 'networkidle0' in page.goto().
If you suspect that document scripts need some time till the needed state, use various test functions like page.waitForSelector() or fall back to page.waitFor(milliseconds).
Here is a simple script that outputs all tag names in a page.
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
const data = await page.evaluate(
() => Array.from(document.querySelectorAll('*'))
.map(elem => elem.tagName)
);
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
You can specify your task in more details and we can try to write something more appropriate.
Script for www.bezrealitky.cz (task from a comment below):
'use strict';
const fs = require('fs');
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
page.setDefaultTimeout(0);
await page.goto('https://www.bezrealitky.cz/vyhledat?offerType=pronajem&estateType=byt&disposition=&ownership=&construction=&equipped=&balcony=&order=timeOrder_desc&boundary=%5B%5B%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.154133576294%2C%22lng%22%3A14.599004629591036%7D%2C%7B%22lat%22%3A50.14524430128%2C%22lng%22%3A14.58773054712799%7D%2C%7B%22lat%22%3A50.129307131988%2C%22lng%22%3A14.60087568578706%7D%2C%7B%22lat%22%3A50.122604734575%2C%22lng%22%3A14.659116306376973%7D%2C%7B%22lat%22%3A50.106512499343%2C%22lng%22%3A14.657434650206028%7D%2C%7B%22lat%22%3A50.090685542974%2C%22lng%22%3A14.705099547441932%7D%2C%7B%22lat%22%3A50.072175921973%2C%22lng%22%3A14.700004206235008%7D%2C%7B%22lat%22%3A50.056898491904%2C%22lng%22%3A14.640206899053055%7D%2C%7B%22lat%22%3A50.038528576841%2C%22lng%22%3A14.666852728301023%7D%2C%7B%22lat%22%3A50.030955909657%2C%22lng%22%3A14.656128752460972%7D%2C%7B%22lat%22%3A50.013435368522%2C%22lng%22%3A14.66854956530301%7D%2C%7B%22lat%22%3A49.99444182116%2C%22lng%22%3A14.640153080292066%7D%2C%7B%22lat%22%3A50.010839032542%2C%22lng%22%3A14.527474219359988%7D%2C%7B%22lat%22%3A49.970771602447%2C%22lng%22%3A14.46224174052395%7D%2C%7B%22lat%22%3A49.970669964027%2C%22lng%22%3A14.400648545303966%7D%2C%7B%22lat%22%3A49.941901176098%2C%22lng%22%3A14.395563234671044%7D%2C%7B%22lat%22%3A49.948384148423%2C%22lng%22%3A14.337635637038034%7D%2C%7B%22lat%22%3A49.958376114735%2C%22lng%22%3A14.324977842107955%7D%2C%7B%22lat%22%3A49.9676286223%2C%22lng%22%3A14.34491711110104%7D%2C%7B%22lat%22%3A49.971859099005%2C%22lng%22%3A14.326815050839059%7D%2C%7B%22lat%22%3A49.990608728081%2C%22lng%22%3A14.342731259186962%7D%2C%7B%22lat%22%3A50.002211140429%2C%22lng%22%3A14.29483886971002%7D%2C%7B%22lat%22%3A50.023596577558%2C%22lng%22%3A14.315872285282012%7D%2C%7B%22lat%22%3A50.058309376419%2C%22lng%22%3A14.248086830069042%7D%2C%7B%22lat%22%3A50.073179111%2C%22lng%22%3A14.290193274400963%7D%2C%7B%22lat%22%3A50.102973823639%2C%22lng%22%3A14.224439442359994%7D%2C%7B%22lat%22%3A50.130060800171%2C%22lng%22%3A14.302396419107936%7D%2C%7B%22lat%22%3A50.116019827009%2C%22lng%22%3A14.360785349547996%7D%2C%7B%22lat%22%3A50.148005694843%2C%22lng%22%3A14.365662825877052%7D%2C%7B%22lat%22%3A50.14142969454%2C%22lng%22%3A14.394903042943952%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%5D%5D&hasDrawnBoundary=1&mapBounds=%5B%5B%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%5D%5D&center=%7B%22lat%22%3A50.16447196305031%2C%22lng%22%3A14.387521875272125%7D&zoom=11&locationInput=praha&limit=15');
await page.waitForSelector('#search-content button.btn-icon');
while (await page.$('#search-content button.btn-icon') !== null) {
const articlesForNow = (await page.$$('#search-content article')).length;
console.log(`Articles for now: ${articlesForNow}. Getting more...`);
await Promise.all([
page.evaluate(
() => { document.querySelector('#search-content button.btn-icon').click(); }
),
page.waitForFunction(
old => document.querySelectorAll('#search-content article').length > old,
{},
articlesForNow
),
]);
}
const articlesAll = (await page.$$('#search-content article')).length;
console.log(`All articles: ${articlesAll}.`);
fs.writeFileSync('full.html', await page.content());
fs.writeFileSync('articles.html', await page.evaluate(
() => document.querySelector('#search-content div.b-filter__inner').outerHTML
));
fs.writeFileSync('articles.txt', await page.evaluate(
() => [...document.querySelectorAll('#search-content article')]
.map(({ innerText }) => innerText)
.join(`\n${'-'.repeat(50)}\n`)
));
console.log('Saved.');
await browser.close();
} catch (err) {
console.error(err);
}
})();
Just one line:
const html = await page.content();
Details:
import puppeteer from 'puppeteer'
const test = async (url) => {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle0' })
const html = await page.content()
console.log(html)
}
await test('https://stackoverflow.com/')

Puppeteer - how to use page.evaluateHandle

I have some trouble using the newest version of puppeteer.
I'm using puppeteer version 0.13.0.
I have a site with this element:
<div class="header">hey there</div>
I'm trying to run this code:
const headerHandle = await page.evaluateHandle(() => {
const element = document.getElementsByClassName('header');
return element;
});
Now the headerHandle is a JSHandle with a description: 'HTMLCollection(0)'.
If I try to run
headerHandle.getProperties() and try to console.log I get Promise { <pending> }.
If I just try to get the element like this:
const result = await page.evaluate(() => {
const element = document.getElementsByClassName('header');
return Promise.resolve(element);
});
I get an empty object.
How do I get the actual element or the value of the element?
Puppeteer has changed the way evaluate works, the safest way to retrieve DOM elements is by creating a JSHandle, and passing that handle to the evaluate function:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com', { waitUntil: 'networkidle2' });
const jsHandle = await page.evaluateHandle(() => {
const elements = document.getElementsByTagName('h1');
return elements;
});
console.log(jsHandle); // JSHandle
const result = await page.evaluate(els => els[0].innerHTML, jsHandle);
console.log(result); // it will log the string 'Example Domain'
await browser.close();
})();
For reference: evalute docs, issue #1590, issue #1003 and PR #1098
Fabio's approach is good to have for working with arrays, but in many cases you don't need the nodes themselves, just their serializable contents or properties. In OP's case, there's only one element being selected, so the following works more directly (with less straightforward approaches shown for comparison):
const puppeteer = require("puppeteer"); // ^19.1.0
const html = `<!DOCTYPE html><html><body>
<div class="header">hey there</div>
</body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const text = await page.$eval(".header", el => el.textContent);
console.log(text); // => hey there
// or, less directly:
const text2 = await page.evaluate(() => {
// const el = document.getElementsByClassName(".header")[0] // take the 0th element
const el = document.querySelector(".header"); // ... better still
return el.textContent;
});
console.log(text2); // => hey there
// even less directly, similar to OP:
const handle = await page.evaluateHandle(() =>
document.querySelector(".header")
);
const text3 = await handle.evaluate(el => el.textContent);
console.log(text3); // => hey there
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Getting the text from multiple elements is also straightforward, not requiring handles:
const html = `<!DOCTYPE html><html><body>
<div class="header">foo</div>
<div class="header">bar</div>
<div class="header">baz</div>
</body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const text = await page.$$eval(
".header",
els => els.map(el => el.textContent)
);
console.log(text);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
As Fabio's approach attests, things get trickier when working with multiple elements when you want to use the handles in Puppeteer. Unlike the ElementHandle[] return of page.$$, page.evaluateHandle's JSHandle return isn't iterable, even if the handle point to an array. It's only expandable into an array back into the browser.
One workaround is to return the length, optionally attach the selector array to the window (or re-query it multiple times), then run a loop and call evaluateHandle to return each ElementHandle:
// ...
await page.setContent(html);
const length = await page.$$eval(".header", els => {
window.els = els;
return els.length;
});
const nodes = [];
for (let i = 0; i < length; i++) {
nodes.push(await page.evaluateHandle(i => window.els[i], i));
}
// now you can loop:
for (const el of nodes) {
console.log(await el.evaluate(el => el.textContent));
}
// ...
See also Puppeteer find list of shadowed elements and get list of ElementHandles which, in spite of the shadow DOM in the title, is mostly about working with arrays of handles.

Categories

Resources