How do I continuously listen for a new item while scraping a website - javascript

I am using puppeteer to scrape a website that is being live updated, to report the latest item elsewhere.
Currently the way I was thinking accomplishing this is to run a setInterval call on my async scrape and to compare if the last item has changed, checking every 30 seconds. I assume there has to be a better way of doing this then that.
Here is my current code:
const puppeteer = require('puppeteer');
playtracker = async () => {
console.log('loading');
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('URL-Being-Scraped');
await page.waitForSelector('.playlist-tracklist-view');
let html = await page.$$eval('.playlist-tracklist-view > .playlist-track', tracks => {
tracks = tracks.filter(track => track.querySelector('.playlist-trackname').textContent);
tracks = tracks.map(el => el.querySelector('.playlist-trackname').textContent);
return tracks;
});
console.log('logging', html[html.length-1]);
};
setInterval(playtracker, 30000)

There is an api called "MutationObserver". You can check that out on MDN. Here's the link https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver
What it is doing is basically do whatever you want to do when the spesific element has changed. Lets say you have a list you want to listen. What you would do is
const listElement = document.querySelector( [list element] );
const callbackFunc = funcion foo () {
//do something
}
const yourMutationObserver = new MutationObserver(callbackFunc)
yourMutationObserver.observe(listElement)
You can disconnect your mutation observer with yourMutationObserver.disconnect() method whenever you want.
This could help too if you confused about how to implement it https://stackoverflow.com/a/48145840/14138428

Related

Puppeteer- How to .click() a single button out of a grid of buttons with same classname?

I'm developing a Nike SNKRS BOT to buy shoes with Puppeteer and Node.js.
I'm having issues to distinguish and .click() Size button screenshot of html devtools and front end buttons
That's my code: i'm not experienced so i have tried everything
const xpathButton = '//*
[#id="root"]/div/div/div[1]/div/div[1]/div[2]/div/section[1]/div[2]/aside/div/div[2]/div/
div[2]/ul/li[1]/button'
const puppeteer = require('puppeteer')
const productUrl = 'https://www.nike.com/it/launch/t/air-max-97-coconut-
milk-black'
const idAcceptCookies = "button[class='ncss-btn-primary-dark btn-lg']"
async function givePage(){
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage();
return page;
}
async function addToCart(page){
await page.goto(urlProdotto);
await page.waitForSelector(idAcceptCookies);
await page.click(idAcceptCookies,elem => elem.click());
//this is where the issues begin
//attempt 1
await page.evaluate(() => document.getElementsByClassName('size-grid-
dropdown size-grid-button"')[1].click());
//attempt 2
const sizeButton = "button[class='size-grid-dropdown size-grid-button']
button[name='42']";
await page.waitForSelector(sizeButton);
await page.click(sizeButton,elem => elem.click());
}
//attempt 3
await page.click(xpathButton)
//attempt 4
document.evaluate("//button[contains ( ., '36')]", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue
async function checkout(){
var page = await givePage();
await addToCart(page)
}
checkout()
Attempt number 2 looks like the best approach, except your selector is wrong. The button does not have a name attribute, according to your screenshot, so you will need another approach, closer to attempt 3.
You can use puppeteer to select an element by with xpath, and xpath allows you to select by an element's text content.
Try this:
await page.waitForXPath('//button[contains(text(), "EU 36")]')
const [button] = await page.$x('//button[contains(text(), "EU 36")]')
await button.click()
Because the xpath selector is returning an array of element handles, I destructure the first element in the array (which should be the only match), and assign it a value of button. That element handle can now be clicked.

Unable to get isDisabled() to work in Playwright

I need to check that a button is disabled (checking for a last page of a table). There are two with the same id (top and bottom of the table).
const nextPageButtons = await this.page.$$('button#_btnNext'); // nextPageButtons.length is 2, chekced via console.log
const nextPageButtonState = await nextPageButtons[0].isDisabled();
But when I do the above I get: elementHandle.isDisabled: Unable to adopt element handle from a different document.
Why doesn't this work?
So, this works:
const nextPageButtons = await this.page.$$('button#_btnNext');
const nextPageButton1 = await nextPageButtons[0];
const nextPageButton1State = await nextPageButtonsState.isDisabled();

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks
Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

StaleElementReferenceError on iterations

My application gets a list of IDs from the db. I iterate over these with a cursor & for every ID, I plug it into a URL with Selenium to get specific items on a page. This is doing a search on a keyword & getting the most relevant item to that search. There are around 1000 results from the db. At random iterations, 1 of the driver actions will throw up an StaleElementReferenceError with the full message of:
stale element reference: element is not attached to the page document\n (Session info: chrome=77.0.3865.75)
Looking at the official docs I can see that the 2 common causes for this are:
The element has been deleted entirely.
The element is no longer attached to the DOM.
With the former being the most frequent cause.
index.js
const { MongoClient, ObjectID } = require('mongodb')
const fs = require('fs')
const path = require('path')
const { Builder, Capabilities, until, By } = require('selenium-webdriver')
const chrome = require('selenium-webdriver/chrome')
require('dotenv').config()
async function init() {
try {
const chromeOpts = new chrome.Options()
const ids = fs.readFileSync(path.resolve(__dirname, '..', 'data', 'primary_ids.json'), 'utf8')
const client = await MongoClient.connect(process.env.DB_URL || 'mongodb://localhost:27017/test', {
useNewUrlParser: true
})
const db = client.db(process.env.DB_NAME || 'test')
const productCursor = db.collection('product').find(
{
accountId: ObjectID(process.env.ACCOUNT_ID),
primaryId: {
$in: JSON.parse(ids)
}
},
{
_id: 1,
primaryId: 1
}
)
const resultsSelector = 'body #wrapper div.src-routes-search-style__container--2g429 div.src-routes-search-style__products--3rsz9'
const mostRelevantSelector = `${resultsSelector}
> div:nth-child(2)
> div.src-routes-search-product-item-raw-style__product--3vH_O:nth-child(1)`
const titleContainerSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerText--3NefD
> div.src-routes-search-product-item-raw-style__description--3swql
> div.src-routes-search-product-item-raw-style__titleContainer--tazkH`
const productImageSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerImages--1PfdF
> a.src-routes-search-product-item-raw-style__productImage--1Y42Y
> img`
const linkSelector = `${titleContainerSelector} > a`
const primaryIdSelector = `${titleContainerSelector} > p`
chromeOpts.setChromeBinaryPath('/usr/local/bin')
const driver = await new Builder()
.withCapabilities(Capabilities.chrome())
.forBrowser('chrome')
.build()
let newProds = {}
let product
let i = 0
while (await productCursor.hasNext()) {
i += 1
product = await productCursor.next()
let searchablePrimaryId = product.primaryId
let link
let primaryId
let pId
let href
let img
let imgSrc
if (product.primaryId.includes('#')) {
searchablePrimaryId = product.primaryId.substr(0, product.primaryId.indexOf('#'))
}
if (searchablePrimaryId.includes('-')) {
searchablePrimaryId = searchablePrimaryId.substr(0, searchablePrimaryId.indexOf('-'))
}
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000) // wait 10 seconds
img = await driver.wait(until.elementLocated(By.css(productImageSelector)), 10000)
imgSrc = await img.getAttribute('src')
primaryId = await driver.wait(until.elementLocated(By.css(primaryIdSelector)), 10000)
pId = await primaryId.getText()
href = await link.getAttribute('href')
const iceCatId = href.substr(href.lastIndexOf('-') + 1, href.length)
const _iceCatId = iceCatId.substr(0, iceCatId.indexOf('.html'))
const idFound = (searchablePrimaryId.toUpperCase() === pId.toUpperCase()) && !imgSrc.includes('logo-fullicecat')
newProds[product._id.toString()] = {
primaryId: product.primaryId,
iceCatId: idFound ? _iceCatId : 'N/A'
}
}
const foundProducts = Object.values(newProds).filter(prod => prod.iceCatId !== 'N/A')
console.log(`\nFound ${foundProducts.length}/${JSON.parse(ids).length}`)
fs.writeFileSync(path.resolve(__dirname, '..', 'data', 'new_products.json'), JSON.stringify(newProds, null, 4), 'utf8')
driver.quit()
} catch(err) {
throw err
}
}
init()
.then(res => {
console.log(res)
})
.catch(err => {
console.error(err)
})
To debug, I put a try...catch around each of the driver actions to see which specific action it is that is failing but that didn't work as it was never a consistent action that was failing. For example, sometimes if would have been one of the elementLocated lines or others it would have just been the getAttribute action.
If it is the latter in that scenario, that is why I am confused as to why this error is being thrown as surely selenium has found the element on the page (i.e. link) but is unable to do getAttribute('src') on the element? That's why I'm confused as to the error I'm getting. I imagine I must be doing something wrong with how I am setting up selenium to handle iterations. The iterations never get higher than 110
In your case the second cause is The element is no longer attached to the DOM. If a WebElement is located and the DOM is refreshed afterwards this element become stale even if the DOM hasn't change, the same locator will return new WebElement.
Normally, driver.get() will block until the page is fully loaded, however this site is running JavaScript to load the search results. You can test it by running document.readyState in the developer tools console, you will see "complete" results while the search results are still loading.
The page has a spinner before the results are located, hopefully it will be enough to wait for it to appear and became stale before scraping the page
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
let spinner = driver.wait(until.elementIsVisible(By.className('src-routes-search-style__loader---acti')))
driver.wait(until.stalenessOf(spinner))
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000)
You don't have wait for Ajax request to finish. The website retrieves and refreshes dom once you go to end and also keeps calling index every few seconds so DOM probably keeps updating. You can probably hold AJAX requests, get your results, process and enable AJAX again.
Could you try removing "await" from img Src = await img.getAttribute('src'). Since wait for img is already handled in its previous line.

Problems when using Chosen Selectors in external helper functions

I use Testcafe to test a website which is using the jquery plugin Chosen and
I want to make an assertion in my test code depending on a value returned by an external helper function (getSelectedOption).
This function gets a Chosen Selector as a parameter and should return the selected value to the assertion, but the function always returns the first element of the list instead of the chosen one.
When I use the function code in my test, everything works fine.
It seems that the function doesn't have the actual state about the HTML data and can't see that an element is already selected.
This is a snippet from the test code:
await t
.click(await getOptionByText('salutation', 'Frau'))
.expect(await getSelectedOption('gender')).eql('weiblich')
This is a snippet from the external functions:
export const getChosenSelectorFromName = selectName => `#${selectName}_chosen`;
export const getSelectedOption = async selectName => {
const selectedOptionText = await
Selector(getChosenSelectorFromName(selectName))
.find('.chosen-single')
.innerText;
return selectedOptionText.toLowerCase().trim()
};
export const getOptionByText = async (selectName, optionText) => {
const chosenSelectorString = getChosenSelectorFromName(selectName);
await t.click(Selector(chosenSelectorString));
return await Selector(chosenSelectorString)
.find('.chosen-drop')
.find('li')
.withText(optionText);
};
When I use similar code like the getSelectedOption function inside my test, everything works fine:
const genderSelect = Selector('#gender_chosen);
.click(await getOptionByText('salutation', 'Frau'))
.expect(genderSelect.innerText).eql('WEIBLICH')
If you call await Selector(<some value>) then TestCafe immediately retries the data from the web page at the current moment.
You can tell TestCafe to retry data from web page until it becomes equal to the expected value.
To do it, you need to move the DOM manipulation function into ClientFunction:
import { Selector, ClientFunction } from "testcafe";
fixture `Fixture`
.page('https://harvesthq.github.io/chosen/');
const getChosenSelectorFromName = selectName => `#${selectName}_chosen`;
const getSelectedOption = ClientFunction(selector => {
var choosenDiv = document.querySelector(selector);
var singleValueEl = choosenDiv.querySelector('.chosen-single');
return singleValueEl.innerText;
});
test('test', async t => {
await t.expect(getSelectedOption('.chosen-container')).eql('Choose a Country...');
});

Categories

Resources