pagination with chromeless in node.js - javascript

I am using the chromeless headless browser on AWS Lambda.
I'm trying to figure out how to paginate content, but I'm new to node and async/await.
This is my code:
const Chromeless = require('chromeless').default
async function run() {
const chromeless = new Chromeless({})
var results = [];
const instance = await chromeless
.goto('https://www.privacyshield.gov/list')
for (i = 0; i < 3; i++)
{
console.log('in for');
instance
.html()
.click('a.btn-navigate:contains("Next Results")')
.wait(3000)
results.push(html)
}
await chromeless.end()
}
run().catch(console.error.bind(console))
but I get the error:
TypeError: Cannot read property 'html' of undefined
which means instance is not defined outside of await. I don't wait to create separate instances in each for loop iteration, as I would lose my position on the page.

It took some time to figure it out but was interesting, this is my first await async code from node too.
const { Chromeless } = require('chromeless')
async function run() {
const chromeless = new Chromeless()
let curpos = chromeless
chromeless.goto('https://www.privacyshield.gov/list')
.press(13)
.wait(3000);
const page1 = await curpos.html()
curpos = curpos.click('a.btn-navigate')
.wait(3000);
const page2 = await curpos.html()
curpos = curpos.click('a.btn-navigate')
.wait(3000);
const page3 = await curpos.html()
console.log(page1)
console.log("\n\n\n\n\n\n\n")
console.log(page2)
console.log("\n\n\n\n\n\n\n")
console.log(page3)
await chromeless.end()
}
run().catch(console.error.bind(console))
I hope you can take it to the loop from there.
Interestingly I was able to convert into ES5 code and debug it out.
Hope it helps.

Related

JS Image scraper

I thought making a basic image scraper would be a fun project. The code down below works in the console on the website but I don't know how to get it to work from my app.js.
var anchors = document.getElementsByTagName('a');
var hrefs = [];
for(var i=0; i < anchors.length; i++){
var src = anchors[i].href;
if(src.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log(hrefs);
I thought using puppeteer was a good idea but my knowledge is too limited to determine whether that's right or not. This is my puppeteer code:
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
var anchors = await page.evaluate(() => document.getElementsByTagName('a'));
var hrefs = [];
for(var i=0; i < anchors.length; i++){ var img = anchors[i].href;
if(img.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log({hrefs}, {img});
browser.close();
}
I understand that the last part of the code is wrong but I can't find a solid answer to what to be written instead.
Thank you for taking your time.
page.evaluate() can only transfer serializable values (roughly, the values JSON can handle). As document.getElementsByTagName() returns a collection of DOM elements that are not serializable (they contain methods and circular references), each element in the collection is replaced with an empty object. You need to return either serializable value (for example, an array of texts or href attributes) or use something like page.$$(selector) and ElementHandle API.
Web API is not defined outside of the .evaluate() argument function, so you need to place all the Web API part in .evaluate() argument function and return serializable data from it.
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const data = await page.evaluate(() => {
const anchors = document.getElementsByTagName('a');
const hrefs = [];
for (let i = 0; i < anchors.length; i++) {
const img = anchors[i].href;
if (img.endsWith(".jpeg")) {
hrefs.push(img);
}
}
return hrefs;
});
console.log(data);
await browser.close();
}

How do I continuously listen for a new item while scraping a website

I am using puppeteer to scrape a website that is being live updated, to report the latest item elsewhere.
Currently the way I was thinking accomplishing this is to run a setInterval call on my async scrape and to compare if the last item has changed, checking every 30 seconds. I assume there has to be a better way of doing this then that.
Here is my current code:
const puppeteer = require('puppeteer');
playtracker = async () => {
console.log('loading');
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('URL-Being-Scraped');
await page.waitForSelector('.playlist-tracklist-view');
let html = await page.$$eval('.playlist-tracklist-view > .playlist-track', tracks => {
tracks = tracks.filter(track => track.querySelector('.playlist-trackname').textContent);
tracks = tracks.map(el => el.querySelector('.playlist-trackname').textContent);
return tracks;
});
console.log('logging', html[html.length-1]);
};
setInterval(playtracker, 30000)
There is an api called "MutationObserver". You can check that out on MDN. Here's the link https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver
What it is doing is basically do whatever you want to do when the spesific element has changed. Lets say you have a list you want to listen. What you would do is
const listElement = document.querySelector( [list element] );
const callbackFunc = funcion foo () {
//do something
}
const yourMutationObserver = new MutationObserver(callbackFunc)
yourMutationObserver.observe(listElement)
You can disconnect your mutation observer with yourMutationObserver.disconnect() method whenever you want.
This could help too if you confused about how to implement it https://stackoverflow.com/a/48145840/14138428

Caching custom emojis from another shard

im trying to get a larger discord bot of mine to save all custom emojis it grabs from another shard to a cache so I can serve better response times for each shard. To give a theoretical example, my bot spawns 4 shards and only one shard serves the guild that contains all the custom emojis that I want to use across all shards. I am using this function to grab the emojis, but I need to await each one, and it can make my response times up to 15 seconds as there are many emojis I need to grab:
function findEmoji(id) {
const temp = this.emojis.get(id);
if (!temp) return null;
const emoji = Object.assign({}, temp);
if (emoji.guild) emoji.guild = emoji.guild.id;
emoji.require_colons = emoji.requiresColons;
return emoji;
}
async function grabEmoji(emojiID) {
const emojiArray = await client.shard.broadcastEval(`(${findEmoji}).call(this, '${emojiID}')`);
const foundEmoji = emojiArray.find(emoji => emoji);
if (!foundEmoji) return;
const raw = await client.rest.makeRequest('get', Discord.Constants.Endpoints.Guild(foundEmoji.guild).toString(), true);
const guild = new Discord.Guild(client, raw);
const emoji = new Discord.Emoji(guild, foundEmoji);
return await emoji;
}
// then when I send the message, I call the function with the said ID of the emoji I want:
await grabEmoji("530800350656200705");
On the other hand, when I remove await, it will give me listener errors (maxListeners reached) or whatever, and then display "null".
Here is what I have tried, but I havent been able to get it to work.
const emojiMap = new Map();
createMap();
async function createMap() {
let woodenPick = await grabEmoji("601256699629797379"),
stonePick = await grabEmoji("601256431076769803"),
ironPick= await grabEmoji("601257055285673987"),
goldPick = await grabEmoji("601256566670491658"),
diamondPick= await grabEmoji("601256973798998016"),
emeraldPick = await grabEmoji("601256896577404938"),
rubyPick = await grabEmoji("601256312696733696"),
ultimatePick= await grabEmoji("629817042954092545"),
sandstonePick = await grabEmoji("629817043142705152"),
aquaPick = await grabEmoji("629817733902761985"),
techPick = await grabEmoji("502940161085014046"),
stone = await grabEmoji("502940717883064321"),
coal = await grabEmoji("502940528149659649"),
iron =await grabEmoji("502940160942669824"),
gold = await grabEmoji("493801062856392705"),
diamond= await grabEmoji("493805522466766849"),
obsidian =await grabEmoji("493801062671581184"),
emerald = await grabEmoji("630846535025819649"),
ruby =await grabEmoji("502940161001259018"),
lapis = await grabEmoji("502940160988807188"),
redstone= await grabEmoji("632411168601931822"),
silver = await grabEmoji("632413243503149087"),
neonite = await grabEmoji("632413243708801024"),
pulsatingStar= await grabEmoji("632404511759138816"),
sapphire = await grabEmoji("642799734192341013"),
developerBadge = await grabEmoji("642799734209118221"),
staffBadge = await grabEmoji("642799734209118221"),
donatorBadge = await grabEmoji("642799734247129089"),
contributorBadge = await grabEmoji("642799734247129089");
emojiMap.set(['woodenPick', woodenPick])
emojiMap.set(['stonePick', stonePick])
emojiMap.set(['ironPick', ironPick])
emojiMap.set(['goldPick', goldPick])
emojiMap.set(['diamondPick', diamondPick])
emojiMap.set(['emeraldPick', emeraldPick])
emojiMap.set(['rubyPick', rubyPick])
emojiMap.set(['ultimatePick', ultimatePick])
emojiMap.set(['sandstonePick', sandstonePick])
emojiMap.set(['aquaPick', aquaPick])
emojiMap.set(['techPick', techPick])
emojiMap.set(['stone', stone])
emojiMap.set(['coal', coal])
emojiMap.set(['iron', iron])
emojiMap.set(['gold', gold])
emojiMap.set(['diamond', diamond])
emojiMap.set(['obsidian', obsidian])
emojiMap.set(['emerald', emerald])
emojiMap.set(['ruby', ruby])
emojiMap.set(['lapis', lapis])
emojiMap.set(['redstone', redstone])
emojiMap.set(['silver', silver])
emojiMap.set(['neonite', neonite])
emojiMap.set(['pulsatingStar', pulsatingStar])
emojiMap.set(['sapphire', sapphire])
emojiMap.set(['developerBadge', developerBadge])
emojiMap.set(['staffBadge', staffBadge])
emojiMap.set(['donatorBadge', donatorBadge])
emojiMap.set(['contributorBadge', contributorBadge])
}
client.on('message', ... //rest of the code continues for my command handler.
//grab emojis
let woodenPick = emojiMap.get('woodenPick')
let stonePick = emojiMap.get('stonePick')
let ironPick = emojiMap.get('ironPick')
let goldPick = emojiMap.get('goldPick')
let diamondPick = emojiMap.get('diamondPick')
let emeraldPick = emojiMap.get('emeraldPick')
let rubyPick = emojiMap.get('rubyPick')
let ultimatePick = emojiMap.get('ultimatePick')
let sandstonePick = emojiMap.get('sandstonePick')
let aquaPick = emojiMap.get('aquaPick')
let techPick = emojiMap.get('techPick')
let stone = emojiMap.get('stone')
let coal = emojiMap.get('coal')
let iron = emojiMap.get('iron')
let gold = emojiMap.get('gold')
let diamond = emojiMap.get('diamond')
let obsidian = emojiMap.get('obsidian')
let emerald = emojiMap.get('emerald')
let ruby = emojiMap.get('ruby')
let lapis = emojiMap.get('lapis')
let redstone = emojiMap.get('redstone')
let silver = emojiMap.get('silver')
let neonite = emojiMap.get('neonite')
let pulsatingStar = emojiMap.get('pulsatingStar')
let sapphire = emojiMap.get('sapphire')
let developerBadge= emojiMap.get('developerBadge')
let staffBadge = emojiMap.get('staffBadge')
let donatorBadge = emojiMap.get('donatorBadge')
let contributorBadge = emojiMap.get('contributorBadge')
Doing that returns undefined as seen here:
Does anyone have any ideas? I'm directly saving the emoji object to the map thinking I can just use it later.
Getting discord.js's abstraction of an Emoji object in order to display it in a message is extremely unnecessary, but I can't blame you as discord.js tends to nudge its users towards these kinds of practices.
You already know the emoji names and IDs. There is no other information that you need to get from your other shards that your bot doesn't already have. In Discord messages, custom emoji are represented like this:
Custom Emoji - <:NAME:ID> -> <:mmLol:216154654256398347>
Custom Emoji (Animated) - <a:NAME:ID> -> <a:b1nzy:392938283556143104>
source: discord api docs reference
Hence, you don't need to make any requests, any broadcast evals, or anything of the sort: you only need static data. Like this:
let emojiMap = {
woodenPick: "601256699629797379",
stonePick: "601256431076769803",
//etc
};
I recommend a util function for putting the emoji in a message:
function getEmoji(name) {
return `<:${name}:${emojiMap[name]}>`;
}
Use it like this:
await msg.react(emojiMap.woodenPick); //might need to be :name:id
//etc, probably use an array for that (or Object.keys(emojiMap))
//make embed
let description = `${getEmoji("woodenPick")} --> Wooden Pickaxe\netc...`;

StaleElementReferenceError on iterations

My application gets a list of IDs from the db. I iterate over these with a cursor & for every ID, I plug it into a URL with Selenium to get specific items on a page. This is doing a search on a keyword & getting the most relevant item to that search. There are around 1000 results from the db. At random iterations, 1 of the driver actions will throw up an StaleElementReferenceError with the full message of:
stale element reference: element is not attached to the page document\n (Session info: chrome=77.0.3865.75)
Looking at the official docs I can see that the 2 common causes for this are:
The element has been deleted entirely.
The element is no longer attached to the DOM.
With the former being the most frequent cause.
index.js
const { MongoClient, ObjectID } = require('mongodb')
const fs = require('fs')
const path = require('path')
const { Builder, Capabilities, until, By } = require('selenium-webdriver')
const chrome = require('selenium-webdriver/chrome')
require('dotenv').config()
async function init() {
try {
const chromeOpts = new chrome.Options()
const ids = fs.readFileSync(path.resolve(__dirname, '..', 'data', 'primary_ids.json'), 'utf8')
const client = await MongoClient.connect(process.env.DB_URL || 'mongodb://localhost:27017/test', {
useNewUrlParser: true
})
const db = client.db(process.env.DB_NAME || 'test')
const productCursor = db.collection('product').find(
{
accountId: ObjectID(process.env.ACCOUNT_ID),
primaryId: {
$in: JSON.parse(ids)
}
},
{
_id: 1,
primaryId: 1
}
)
const resultsSelector = 'body #wrapper div.src-routes-search-style__container--2g429 div.src-routes-search-style__products--3rsz9'
const mostRelevantSelector = `${resultsSelector}
> div:nth-child(2)
> div.src-routes-search-product-item-raw-style__product--3vH_O:nth-child(1)`
const titleContainerSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerText--3NefD
> div.src-routes-search-product-item-raw-style__description--3swql
> div.src-routes-search-product-item-raw-style__titleContainer--tazkH`
const productImageSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerImages--1PfdF
> a.src-routes-search-product-item-raw-style__productImage--1Y42Y
> img`
const linkSelector = `${titleContainerSelector} > a`
const primaryIdSelector = `${titleContainerSelector} > p`
chromeOpts.setChromeBinaryPath('/usr/local/bin')
const driver = await new Builder()
.withCapabilities(Capabilities.chrome())
.forBrowser('chrome')
.build()
let newProds = {}
let product
let i = 0
while (await productCursor.hasNext()) {
i += 1
product = await productCursor.next()
let searchablePrimaryId = product.primaryId
let link
let primaryId
let pId
let href
let img
let imgSrc
if (product.primaryId.includes('#')) {
searchablePrimaryId = product.primaryId.substr(0, product.primaryId.indexOf('#'))
}
if (searchablePrimaryId.includes('-')) {
searchablePrimaryId = searchablePrimaryId.substr(0, searchablePrimaryId.indexOf('-'))
}
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000) // wait 10 seconds
img = await driver.wait(until.elementLocated(By.css(productImageSelector)), 10000)
imgSrc = await img.getAttribute('src')
primaryId = await driver.wait(until.elementLocated(By.css(primaryIdSelector)), 10000)
pId = await primaryId.getText()
href = await link.getAttribute('href')
const iceCatId = href.substr(href.lastIndexOf('-') + 1, href.length)
const _iceCatId = iceCatId.substr(0, iceCatId.indexOf('.html'))
const idFound = (searchablePrimaryId.toUpperCase() === pId.toUpperCase()) && !imgSrc.includes('logo-fullicecat')
newProds[product._id.toString()] = {
primaryId: product.primaryId,
iceCatId: idFound ? _iceCatId : 'N/A'
}
}
const foundProducts = Object.values(newProds).filter(prod => prod.iceCatId !== 'N/A')
console.log(`\nFound ${foundProducts.length}/${JSON.parse(ids).length}`)
fs.writeFileSync(path.resolve(__dirname, '..', 'data', 'new_products.json'), JSON.stringify(newProds, null, 4), 'utf8')
driver.quit()
} catch(err) {
throw err
}
}
init()
.then(res => {
console.log(res)
})
.catch(err => {
console.error(err)
})
To debug, I put a try...catch around each of the driver actions to see which specific action it is that is failing but that didn't work as it was never a consistent action that was failing. For example, sometimes if would have been one of the elementLocated lines or others it would have just been the getAttribute action.
If it is the latter in that scenario, that is why I am confused as to why this error is being thrown as surely selenium has found the element on the page (i.e. link) but is unable to do getAttribute('src') on the element? That's why I'm confused as to the error I'm getting. I imagine I must be doing something wrong with how I am setting up selenium to handle iterations. The iterations never get higher than 110
In your case the second cause is The element is no longer attached to the DOM. If a WebElement is located and the DOM is refreshed afterwards this element become stale even if the DOM hasn't change, the same locator will return new WebElement.
Normally, driver.get() will block until the page is fully loaded, however this site is running JavaScript to load the search results. You can test it by running document.readyState in the developer tools console, you will see "complete" results while the search results are still loading.
The page has a spinner before the results are located, hopefully it will be enough to wait for it to appear and became stale before scraping the page
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
let spinner = driver.wait(until.elementIsVisible(By.className('src-routes-search-style__loader---acti')))
driver.wait(until.stalenessOf(spinner))
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000)
You don't have wait for Ajax request to finish. The website retrieves and refreshes dom once you go to end and also keeps calling index every few seconds so DOM probably keeps updating. You can probably hold AJAX requests, get your results, process and enable AJAX again.
Could you try removing "await" from img Src = await img.getAttribute('src'). Since wait for img is already handled in its previous line.

Register 2 senderIds in JavaScript FCM

On a web page, I'm trying to create 2 firebase apps, with different names, each one associated with a different senderId.
I'm basically doing this:
const init = async ()=>{
const senderId1 = "SENDERID_1";
const senderId2 = "SENDERID_2";
const firebase1 = firebase.initializeApp({"messagingSenderId": senderId1},`name${senderId1}`);
const firebase2 = firebase.initializeApp({"messagingSenderId": senderId2},`name${senderId2}`);
const messaging1 = firebase1.messaging();
const messaging2 = firebase2.messaging();
await messaging1.requestPermission();
await messaging2.requestPermission();
const token1 = await messaging1.getToken(senderId1,"FCM");
const token2 = await messaging2.getToken(senderId2,"FCM");
document.querySelector("#token1").innerHTML = token1;
document.querySelector("#token2").innerHTML = token2;
document.querySelector("#areTheSame").innerHTML = (token1 == token2);
};
init();
Here's a page that exemplifies this behaviour.
The code doesn't generate any errors but token1 is always the same as token2. Obviously I need them to be different. Seems like a caching issue?
Does anyone have any idea if there's a workaround for this?
Thanks in advance

Categories

Resources