JS Image scraper - javascript

I thought making a basic image scraper would be a fun project. The code down below works in the console on the website but I don't know how to get it to work from my app.js.
var anchors = document.getElementsByTagName('a');
var hrefs = [];
for(var i=0; i < anchors.length; i++){
var src = anchors[i].href;
if(src.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log(hrefs);
I thought using puppeteer was a good idea but my knowledge is too limited to determine whether that's right or not. This is my puppeteer code:
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
var anchors = await page.evaluate(() => document.getElementsByTagName('a'));
var hrefs = [];
for(var i=0; i < anchors.length; i++){ var img = anchors[i].href;
if(img.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log({hrefs}, {img});
browser.close();
}
I understand that the last part of the code is wrong but I can't find a solid answer to what to be written instead.
Thank you for taking your time.

page.evaluate() can only transfer serializable values (roughly, the values JSON can handle). As document.getElementsByTagName() returns a collection of DOM elements that are not serializable (they contain methods and circular references), each element in the collection is replaced with an empty object. You need to return either serializable value (for example, an array of texts or href attributes) or use something like page.$$(selector) and ElementHandle API.
Web API is not defined outside of the .evaluate() argument function, so you need to place all the Web API part in .evaluate() argument function and return serializable data from it.
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const data = await page.evaluate(() => {
const anchors = document.getElementsByTagName('a');
const hrefs = [];
for (let i = 0; i < anchors.length; i++) {
const img = anchors[i].href;
if (img.endsWith(".jpeg")) {
hrefs.push(img);
}
}
return hrefs;
});
console.log(data);
await browser.close();
}

Related

Get entire Playwright page in html and Text

I am using playwright in nodejs and I am having some problems when getting the page Text or Html. I just want to get the url as string like: <html><div class="123"><a>link</a>something</div><div>somethingelse</div></hmtl>
const browser = await playwright.chromium.launch({
headless: true,
});
const page = await browser.newPage();
await page.goto(url);
I was trying to use const pageText = page.$('div').innerText; and also const pageText2 = await page.$$eval('div', el => el.innerText);
But both do not work and just give me undefined.
For the full html of the page, this is what you need: const html = await page.content()
To get the inner text of the div, this should work: const pageText = await page.innerText('div')
See:
https://playwright.dev/docs/api/class-page#page-content
https://playwright.dev/docs/api/class-page#page-inner-text

Scraping a web page for .pdf link's and writing all of the matching links to a text file in nodeJS

I am an aspiring developer. As one of my projects, I am learning how to do web scraping. The goal is here to scrape a given webpage for any links that are PDFs and saving those links to a text file in NodeJS. With the given code, I am successfully console logging all of the matching links, but I am only getting one file written to my text file. Can someone steer me into the right direction?
const puppeteer = require("puppeteer");
const fs = require("fs/promises");
let myNewURL =
"https://www.renault.co.il/cars/Zoe/index.html?fbclid=IwAR1RtxbC_U2fImp9_KXJuQ869h5Wv77fyZVj8uBOU86rU90wb2L_NfrNppc";
async function scrapeSite(url) {
console.log("firing");
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url);
//this gives back an actual array, not a node list
const linkCollection = await page.$$eval("a", (links) => {
return links.map((link) => {
return link.href;
});
});
for (const link of linkCollection) {
if (link.includes(".pdf")) {
console.log(link);
await fs.writeFile("pdfLinks.txt", link);
}
}
await browser.close();
}
scrapeSite(myNewURL);
fs.writeFile overwrites the original file with each call. Try fs.appendFile instead. I've also added a newline (\n) at the end so the links are on individual lines:
for (const link of linkCollection) {
if (link.includes(".pdf")) {
console.log(link);
await fs.appendFile("pdfLinks.txt", link + '\n');
}
}
Alternatively, you could collect the links into an array first, then write them all together:
const pdfLinks = [];
for (const link of linkCollection) {
if (link.includes(".pdf")) {
console.log(link);
pdfLinks.push(link);
}
}
const output = pdfLinks.join('\n')
await fs.writeFile("pdfLinks.txt", output);
use append flag:
await fs.writeFile("pdfLinks.txt", link+'\n',{flag:'a'});

Puppeteer returning undefined (JS) using xPath

I'm trying to scrape this element: on this website.
My JS code:
const puppeteer = require("puppeteer");
const url = 'https://magicseaweed.com/Bore-Surf-Report/1886/'
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/header/h3/div[1]/span[1]')
let text = await page.evaluate(res => res.textContext, title[0])
console.log(text) // UNDEFINED
text is undefined. What is the problem here? Thanks.
I think you need to fix 1 or 2 issues on your code.
textContent vs textContext
xpath
For the content you want the xpath should be:
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/div[1]/div[1]/div/div[2]/ul[1]/li[1]/text()')
And to get the content of this:
const text = await page.evaluate(el => {
return el.textContent.trim()
}, title[0])
Notice you need send title[0] as an argument to the page function.
OR
if you don't need to use xpath, it seems you could get directly using class name to find the element:
const rating = await page.evaluate(() => {
return $('.rating.rating-large.clearfix > li.rating-text')[0].textContent.trim()
})

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks
Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

pagination with chromeless in node.js

I am using the chromeless headless browser on AWS Lambda.
I'm trying to figure out how to paginate content, but I'm new to node and async/await.
This is my code:
const Chromeless = require('chromeless').default
async function run() {
const chromeless = new Chromeless({})
var results = [];
const instance = await chromeless
.goto('https://www.privacyshield.gov/list')
for (i = 0; i < 3; i++)
{
console.log('in for');
instance
.html()
.click('a.btn-navigate:contains("Next Results")')
.wait(3000)
results.push(html)
}
await chromeless.end()
}
run().catch(console.error.bind(console))
but I get the error:
TypeError: Cannot read property 'html' of undefined
which means instance is not defined outside of await. I don't wait to create separate instances in each for loop iteration, as I would lose my position on the page.
It took some time to figure it out but was interesting, this is my first await async code from node too.
const { Chromeless } = require('chromeless')
async function run() {
const chromeless = new Chromeless()
let curpos = chromeless
chromeless.goto('https://www.privacyshield.gov/list')
.press(13)
.wait(3000);
const page1 = await curpos.html()
curpos = curpos.click('a.btn-navigate')
.wait(3000);
const page2 = await curpos.html()
curpos = curpos.click('a.btn-navigate')
.wait(3000);
const page3 = await curpos.html()
console.log(page1)
console.log("\n\n\n\n\n\n\n")
console.log(page2)
console.log("\n\n\n\n\n\n\n")
console.log(page3)
await chromeless.end()
}
run().catch(console.error.bind(console))
I hope you can take it to the loop from there.
Interestingly I was able to convert into ES5 code and debug it out.
Hope it helps.

Categories

Resources