I'm very new to javascript and Puppeteer as well.
I'm trying to grab some innerHTML from a series of web pages inside a forum. The pages' URLs follow a pattern that has a prefix and '/page-N' at the end, N being the page number.
So I decided to loop through the pages using a for loop and template literals to load a new page URL on each loop, until I reach the final number of pages, contained in the variable C.numberOfPages.
Problem is: the code inside the page.evaluate() function is not working, when I run my code I get the TypeError: Cannot read property of undefined. I've checked and the source of the problem is that document.getElementById('discussion_subentries') is returning undefined.
I've tested the same code that is inside the page.evaluate() function in Chrome Dev Tools and it works fine, returning the innerHTML I wanted. All of those .children[] concatenations were necessary due to the structure of the page I'm scraping, and they work fine at the browser, returning the proper value.
So how do I make it work in my Puppeteer script?
for (let i = 1; i <= C.numberOfPages; i++) {
let URL = `${C.url}page-${i}`;
await page.goto(URL);
await page.waitForSelector('#discussion_subentries');
let pageData = await page.evaluate(() => {
let discussionEntries = document.getElementById('discussion_subentries')
.children[1];
let discussionEntryMessages = [];
for (let j = 0; j < discussionEntries.childElementCount; j++) {
let thisEntryMessage =
discussionEntries.children[j].children[0].children[1].children[1]
.children[1].innerHTML;
discussionEntryMessages.push(thisEntryMessage);
}
return discussionEntryMessages;
});
entryData.discussionEntryMessages.push(pageData);
}
Page evaluate is not the problem, it works 100% as the devtools. The problem is most probably that wait for selector doesnt to its proper job and doesnt wait for the element to be properly loaded before going further. Try to debug with adding some sleep instead of the wait for selector, to confirm that thats the problem.
Related
I have been trying to download all USA and CANADA servers here on Nord VPN website: https://nordvpn.com/ovpn/
I tried to manually download it but it is time consuming to scroll down every time and identify each US related servers, so i just wrote simple Javascript that can be run on Chrome Inspect Element Console:
var servers = document.getElementsByClassName("mr-2");
var inc_servers = [];
for (var i = 0; i < servers.length; i++) {
var main_server = servers[i];
var server = servers[i].innerText;
if(server.includes("ca")){
var parent_server = main_server.parentElement.parentElement;
parent_server.querySelector(".Button.Button--primary.Button--small").click();
inc_servers.push(server);
}
}
console.log(JSON.stringify(inc_servers));
I also wrote simple python script that automatically click "save" file:
while True:
try:
app2 = pywinauto.Application().connect(title=u'Save As', found_index=0)
app2.SaveAs.Save.click()
except:
pass
It gets all the elements, it works there, however, when I let javascript click each element, maybe because of too many elements, it returns an error:
VM15575:8 Throttling navigation to prevent the browser from hanging. See https://crbug.com/1038223. Command line switch --disable-ipc-flooding-protection can be used to bypass the protection
Are there any other best alternative for my problem? Or maybe how to fix the error message above? I tried running this command in my command prompt: switch --disable-ipc-flooding-protection
but it returns also an error: 'switch' is not recognized as an internal or external command,
operable program or batch file.
I only know basic Javascript and Python. Thanks
So right off the bat, your program is simply downloading files too fast.
Adding a small delay between each file download allows your JavaScript to run.
var servers = document.getElementsByClassName("mr-2");
var inc_servers = [];
for (var i = 0; i < servers.length; i++) {
var main_server = servers[i];
var server = servers[i].innerText;
if(server.includes("ca")){
var parent_server = main_server.parentElement.parentElement;
// Add 1 second delay between each download (fast enough on my computer.. Might be too fast for yours.)
await new Promise(resolve => setTimeout(resolve, 1000));
parent_server.querySelector(".Button.Button--primary.Button--small").click();
}
}
// Remove the logging. Just tell the user that it's worked
console.log("Done downloading all files.");
This is more of a temporary solution, but this script seems like it only needs to be run once, so it'll work for now.
(your python code runs fine. Nothing to do there)
Hope this helped.
I've got a Google Apps Script WebApp that relies on an array of objects that are generated from a Google Spreadsheet. The app uses jquery and miniSearch to provide user functionality.
Currently, I run the server-side function with a success handler at the beginning of the HTML tag and update a "global" variable with the array of objects declared before it.
Index.html:
<script>
let data
google.scripts.run
.withSuccessHandler(payload=>{
data = payload}).getLinks() //[{link:body}, {link1:body1}]
setTimeout(()=>{
const documents = data
miniSearch = new miniSearch(...)
miniSearch.addAll(documents)}, 2500)
...
</script>
Code.gs
function getLinks(){
.
.
.
let values = sheet.getRange(1, 1, lastRow, lastCol)
for (let row = 0; row < lastRow; row++) {
let entry = new Data(row + 1, values[row][0], values[row][1], values[row][2], values[row][3], values[row][4], values[row][5], values[row][6])
allTitles.push(entry)
}
return allTitles
}
I simulate waiting for the google.scripts.run to finish by setting a setTimeout of 2500ms on the miniSearch indexing execution (which relies on the aforementioned array) and most of the time it works.
The problem is, that before the app is ran for the first time in a given period, the contents are not cached and the execution takes longer than the setTimeout, thus, as expected, the searching function bugs out since it's got no data to run on.
My question is: How to make the code wait and confirm that google.scripts.run has returned the needed data?
I have tried doing it with regular promises or async await functions, but to my understanding, google runs its server functions asynchronously and there's no way to tell (in code, that is) if it's finished, I've tried running it as $(function(){google.script.run..}) as to try to load the contents as soon as the DOM is loaded, but to no avail..
The only way to make sure it finishes is to do this. If its unresponsive then the problem lies in getLinks, Data, or whatever miniSearch is.
<script>
const documents = null;
google.scripts.run.withSuccessHandler( function(payload) {
documents = payload;
miniSearch = new miniSearch(...);
miniSearch.addAll(documents);
}.getLinks(); //[{link:body}, {link1:body1}]
...
</script>
I'm trying to set a HTML input to read-only using ExecuteScriptAsync. I can make it work, but it's not an ideal scenario, so I'm wondering if anyone knows why it doesn't work the way I would expect it to.
I'm using Cef3, version 63.
I tried to see if it's a timing issue and doesn't appear to be.
I tried invalidating the view of the browser but that doesn't seem to help.
The code I currently have, which works:
public void SetReadOnly()
{
var script = #"
(function(){
var labelTags = document.getElementsByTagName('label');
var searchingText = 'Notification Initiator';
var found;
for (var i=0; i<labelTags.length; i++)
{
if(labelTags[i].textContent == searchingText)
{
found = labelTags[i]
break;
}
}
if(found)
{
found.innerHTML='Notification Initiator (Automatic)';
var input;
input = found.nextElementSibling;
if(input)
{
input.setAttribute('readonly', 'readonly');
}
}})()
";
_viewer.Browser.ExecuteScriptAsync(script);
_viewer.Browser.ExecuteScriptAsync(script);
}
now, if I remove
found.innerHTML='Notification Initiator (Automatic)';
the input is no longer shown as read-only. The HTML source of the loaded webpage does show it as read-only, but it seems like the frame doesn't get re-rendered once that property is set.
Another issue is that I'm executing the script twice. If I run it only once I don't get the desired result. I'm thinking this could be a problem with V8 Context that is required for the script to run. Apparently running the script will create the context, so that could be the reason why running it twice works.
I have been trying to figure this out for hours, haven't found anything that would explain this weird behaviour. Does anyone have a clue?
Thanks!
I wrote a script that's running from ebay listing iframe. It's working fine, it runs on $(document).ready(), sends an AJAX request to a remote server, gets some info, manipulate the DOM on 'success' callback, everything working perfect...
However, I added a piece of code, which should get the document.referrer, and extract some keywords from it, if they exist. Specifically, if a user searches ebay for a product, and clicks on my product from the results, the function extracts the keywords he entered.
Now, the problem is, that function is not running on page load at all. It seems like it blocks the script when it comes to that part. This is the function:
function getKeywords(){
var index = window.parent.document.referrer.indexOf('_nkw');
if (index >= 0){
var currentIndex = index + 5;
var keywords = '';
while (window.parent.document.referrer[currentIndex] !== '&'){
keywords += window.parent.document.referrer[currentIndex++];
}
keywords = keywords.split('+');
return keywords;
}
}
And I tried calling two logs right after it:
console.log('referrer: ' + window.parent.document.referrer);
console.log(getKeywords());
None of them is working. It's like when it comes to that 'window.parent.document.referrer' part, it stops completely.
But, when I put this all in a console, and run it, it works perfectly. It logs the right referrer, and the right keywords.
Does anybody know what might be the issue here?
The reason it is working on the console is because your window object is the outer window reference and not your iframe. Besides that, on the console:
window.parent === window
// ==> true
So, on in fact you are running window.document.referrer and not your frame's window.parent.document.referrer.
If you want to access your frame's window object you should something like
var myFrame = document.getElementsByClassName('my-frame-class')[0];
myFrame.contentWindow === window
// ==> false
myFrame.contentWindow.parent.window === window
// ==> true
This might help you debug your problem, but I guess the browser is just preventing an inner iframe from accessing the parent's window object.
I'm trying to load sucessive urls using javascript from firefox developer console.
So far, I've tried with different versions of this code:
function redirect() {
var urls = ["http://www.marca.com", "http://www.yahoo.es"]
for (i = 0; i < urls.length; i++) {
setTimeout(location.assign(urls[i], 5000));
}
}
But the result of this code is that it only redirects to the last url from the array. Every page should be fully loaded before iterating to the next page.
I've also tried using window.onload, but with no luck either. It's always the last url which is loaded.
I guess this must be something very basic (I'm new to javascript), but can't find any solution to this.
Any help or hints of what I'm doing wrong here would be very appreciated. Thanks in advance!
Inspecting the console after running your code, it appears that a request is sent to the first URL but is soon aborted when the loop runs for the second time and instead it redirects to the latest URL.
My suggestion would be to open the pages in different tabs, if that works for you.
You can do,
for (i = 0; i < urls.length; i++) {
window.open(urls[i],"_blank");
}
This will open the pages in the new tabs.