Repeatedly clicking an element on a page using Electron (NightareJS) - javascript

I'm writing a page scraper for a dynamic web page. The page has an initial load and then loads the remainder of the content after a short load time.
I've accounted for the load and have successfully scraped the HTML from the page, but the page doesn't load ALL the content at once. Instead it loads a specified amount of content via GET request URL and then has a "Get more" button on the page. My objective is to click this "Get More" button until all the content is loaded on the page. For those wondering, I don't wish to load all the content at once via GET URL because of impact to their server.
I'm stuck forming the loop or iteration that would allow me to repeatedly click on the page.
const NIGHTMARE = require("nightmare");
const BETHESDA = NIGHTMARE({ show: true });
BETHESDA
// Open the bethesda web page. Web page will contain 20 mods to start.
.goto("https://bethesda.net/en/mods/skyrim?number_results=40&order=desc&page=1&platform=XB1&product=skyrim&sort=published&text=")
// Bethesda website serves all requested mods at once. Each mod has the class "tile". Wait for any tile class to appear, then proceed.
.wait(".tile");
let additionalModsPresent = true;
while(additionalModsPresent) {
setTimeout(function() {
BETHESDA
.wait('div[data-is="main-mods-pager"] > button')
.click('div[data-is="main-mods-pager"] > button')
}, 10000)
additionalModsPresent = false;
}
// let moreModsBtn = document.querySelector('div[data-is="main-mods-pager"] > button');
// .end()
BETHESDA.catch(function (error) {
console.error('Search failed:', error);
});
My thinking thus far has been to use a while loop that attempts to click the button after some interval of time. If an error occurs, it's likely because the button doesn't exist. The issue I'm having is that I can't seem to get the click to work inside of a setTimeout or setInterval. I believe there is some sort of scoping issue but I don't know what exactly is going on.
If I can get the click method to work in setInterval or something similar, the issue would be solved.
Thoughts?

You can refer to the issue (Problem running nightmare in loops)[https://github.com/segmentio/nightmare/issues/522]
I modified your code with given guidelines. It seem to work fine
const NIGHTMARE = require("nightmare");
const BETHESDA = NIGHTMARE({
show: true
});
BETHESDA
// Open the bethesda web page. Web page will contain 20 mods to start.
.goto("https://bethesda.net/en/mods/skyrim?number_results=40&order=desc&page=1&platform=XB1&product=skyrim&sort=published&text=")
// Bethesda website serves all requested mods at once. Each mod has the class "tile". Wait for any tile class to appear, then proceed.
.wait(".tile");
next();
function next() {
BETHESDA.wait('div[data-is="main-mods-pager"] > button')
.click('div[data-is="main-mods-pager"] > button')
.then(function() {
console.log("click done");
next();
})
.catch(function(err) {
console.log(err);
console.log("All done.");
});
}
Ultimately, it should timeout on wait() for button and then you can handle the error in catch() block. Beware it goes on and on :) I did not wait till the end (you might run out of memory).

Related

Cypress foolishly waits for the new page to load after the button has been clicked and even the new page's core actionable resource is ready

I know this has been asked on SO and Github already, but none of the solutions are suitable for the scenario I have because, I am looking for successful SSO login by doing keypress/button-click actions on Google Auth and Identity Auth SSO options.
In these options the final Login page URL is too long and unknown(in it's entirety) so not fully and reliably verifiable by Cypress URL match methods, so I want Cypress to just click the button and wait for the next page to load until certain core-actionable element is found/present on the new page and then not stupidly wait any further and type required credentials(username) on that long-URL login page and press enter to go to password page and fill the password and press enter and let the rest of the steps continue "naturally".
Below is the code that I have and as I mentioned in the comment in the code, after the button click, it waits for new page to fully load, even after the new page already has actionable item, and increasing the defaultCommandTimeout to 20000 also doesn't help:
describe('Test if <ProjectName> Search page is reachable', () => {
it('Visits the <ProjectName> Search page', () => {
cy.visit('/')
cy.window().then(w => w.beforeReload = true)
// Sign in with Google button, cypress timeouts after this button has been clicked,
// even if the new page has actionable item("identifier" field) ready for use
cy.get('button.<sso-class>').first().click()
cy.window().should('not.have.prop', 'beforeReload')
cy.get('input[name=identifier]').type('<GoogleUsername>').type('{enter}')
cy.get('input[name=password]', {timeout: 4000}).type('<GooglePassword>').type('{enter}')
cy.url().should('eq', '/search')
})
})
And after waiting, times out throwing the below error:
(page load)
--waiting for new page to load--
CypressError
Cypress command timeout of 20000ms exceeded.
Because this error occurred during a after all hook we are skipping all of the remaining tests.
As you can see in the code, I have already tried one fix mentioned in this GH(Github) post and the other fix is not suitable, as I said, I don't know the final Login page URL of Google/Identity server, and also it's too long with too many querystrings which makes it unreliable for Login page-load checks...
Anyone has any better suggestions ?

How can all the individual executions of a content.js on different frames of the same page communicate with each other?

So, I'm building an extension that autofills different types of forms. As it's not apparent from the url which form is used on a particular website, I need to match all the urls in my manifest. I'm trying to detect the form by the 'src'-attribute in the web page.
Some of the fields of a certain form are not in the first frame. So "all_frames" has to be true in my manifest. That means content.js fires once for each frame or iFrame.
**content.js:**
async function checkForPaymentType(value, attribute = 'src') {
return document.querySelectorAll(`[${attribute}*="${value}"]`);
}
let hasThisForm;
document.addEventListener('DOMContentLoaded', () => {
checkForPaymentType('formJs.js').then((value) => {
if(value.length) {
hasThisForm = true;
}
if(hasThisForm)
fillForm();
});
});
The problem now is, that that only the first frame has the "src='formJs.js" attribute in one of its elements. So it only fills out those fields in the first frame.
My solution idea
What I am trying to do is some sort of global boolean variable ('hasThisForm') that can only be set true. So once the first frame detected that there is this form on the website the other frames fire fillForm() as well.
Problems
1.I'm not able to set a variable that can be read from all of the executions.
2.I need the other executions of content.js to wait for the first one.
Another solution would be to have some sort of window.querySelectorAll, so every execution of content.js searches in the whole page and not just in its frame.
Thanks in advance:)
So I figured it out.
As pointed out in the comment from #wOxxOm or in this question (How to get all frames to access one single global variable) you need to manage everything via the background page.
You want to set up a listener in every Frame and send a message only from the top frame (to the background page which sends it back to the whole tab).
After hours of trying I noticed that the listeners weren't even ready when the message from the topFrame was sent. I put in a sleeper before sending the message which is not the ideal way I guess. But there is no "listenerIsSet-Event".
This is my code:
content.js
document.addEventListener('DOMContentLoaded', () => {
chrome.runtime.onMessage.addListener(
function (msgFromTopFrame) {
console.log(msgFromTopFrame)
});
if (window === top) {
Sleep(1000).then(() => {
const msgToOtherFrames = {'greeting': 'hello'};
chrome.runtime.sendMessage(msgToOtherFrames);
});
}
});
background.js
chrome.runtime.onMessage.addListener((msg, sender) => {
if(('greeting' in msg)) {
chrome.tabs.sendMessage(sender.tab.id, msg);
}
});
You probably want to execute some code depending on the value received. You can write it only once in the listener. It will execute in all frames including the top frame as the background.js sends it to all frames in the tab.
Note:
There may be some errors with the dicts/keys in the messages "sent around" in this code. Just log the message in all the listeners to get the right expressions.

How to waitFor when page refreshes in Puppeteer?

I have an app I'm working with that is behaving like this... You visit a url /refresh, and it loads the page with a loader/spinner/bar showing for like 5 seconds, then it refreshes the page after it's done. It does this so it can load the latest data that was computed during /refresh.
Right now I am just setting a timeout longer than the loader will most likely stay around, but this is brittle because a bad network connection could put it over the line.
How can I instead "watch" for when the refresh happens? What technique would you recommend. It seems to start to get hairy pretty fast.
Into the nitty gritty, when the loader is showing, when it finishes it is gone for like a half a second before the page reload. So I can't just wait til the loader is gone. It seems like I need to keep some sort of state variable around in the DOM like in localStorage, but can't pinpoint it. Would love some help.
well you could "watch" for the element that display the data using page.$(selector), or if no such element you could also wait for the specific request 's response:
const waitForResponse = (page, url) => {
return new Promise(resolve => {
page.on("response", function callback(response){
if (response.url() === url) {
resolve(response);
page.removeListener("response",callback)
}
})
})
};
const res = await waitForResponse(page,"url of the request you want to wait for");
Wait for Network request before continuing process

Finding Window and Navigating to URL with Crossrider

I'm rather new to Javascript and Crossrider. I believe what I'm trying to do is a rather simple thing - maybe I missed something here?
I am writing an extension that automatically logs you into Dropbox and at a later time will log you out. I can log the user into Dropbox automatically, but now my client wants me to automatically log those people out of dropbox by FINDING the open Dropbox windows and logging each one of them out.
He says he's seen it and it's possible.
Basically what I want is some code that allows me to get the active tabs, and set the location.href of those tabs. Or even close them. So far this is what I got:
//background.js:
appAPI.ready(function($) {
// Initiate background timer
backgroundTimer();
// Function to run backround task every minute
function backgroundTimer() {
if (appAPI.db.get('logout') == true)
{
// retrieves the array of tabs
appAPI.tabs.getAllTabs(function(allTabInfo)
{
// loop through tabs
for (var i=0; i<allTabInfo.length; i++)
{
//is this dropbox?
if (allTabInfo[i].tabUrl.indexOf('www.dropbox.com')!=-1)
{
appAPI.tabs.setActive(allTabInfo[i].tabId);
//gives me something like chrome-extension://...
window.alert(window.location.href);
//code below doesn't work
//window.location.href = 'https://www.dropbox.com/logout';
}
}
appAPI.db.set('logout',false);
});
window.alert('logged out.');
}
setTimeout(function() {
backgroundTimer();
}, 10 * 1000);
}
});
When I do appAPI.tabs.setActive(allTabInfo[i].tabId); and then window.alert(window.location.href); I get as address "chrome-extension://xxx" - which I believe is the address of my extension, which is totally not what I need, but rather the URL of the active window! More than that, I need to navigate the current window to the log out page... or at least refresh it. Can anybody help, please?
-Rowan R. J.
P.S.
Earlier I tried saving the window reference of the dropbox URL I opened, but I couldn't save the window reference into the appAPI.db, so I changed technique. Help!
In general, your use of the Crossrider APIs looks good.
The issue here is that you are trying to use window.location.href to get the address of the active tab. However, in the background scope, the window object relates to the background page/tab and and not the active tab; hence you receive the URL of the background page. [Note: Scopes can't directly interactive with each others objects]
Since your objective is to change/close the URL of the active dropbox tab, you can achieve this using messaging between scopes. So, in your example you can send a message from the background scope to the extension page scope with the request to logout. For example (and I've taken the liberty to simplify the code):
background.js:
appAPI.ready(function($) {
appAPI.setInterval(function() {
if (appAPI.db.get('logout')) {
appAPI.tabs.getAllTabs(function(allTabInfo) {
for (var i=0; i<allTabInfo.length; i++) {
if (allTabInfo[i].tabUrl.indexOf('www.dropbox.com')!=-1) {
// Send a message to all tabs using tabId as an identifier
appAPI.message.toAllTabs({
action: 'logout',
tabId: allTabInfo[i].tabId
});
}
}
appAPI.db.set('logout',false);
});
}
}, 10 * 1000);
});
extension.js:
appAPI.ready(function($) {
// Listen for messsages
appAPI.message.addListener(function(msg) {
// Logout if the tab ids match
if (msg.action === 'logout' && msg.tabId === appAPI.getTabId()) {
// change URL or close code
}
});
});
Disclaimer: I am a Crossrider employee

Navigating / scraping hashbang links with javascript (phantomjs)

I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.

Categories

Resources