The "await" property of "async" function sleeps after an instance - Javascript - javascript

I am working on a scraper . I am using Phantom JS along with Node JS. Phantom JS loads the page with async function, just like : var status = await page.open(url). Sometimes, because of the slow internet the page takes longer to load and after a time the page status is not returned, to check while its loaded or not. And the page.open() sleeps, which doesn't return anything at all, and all the execution is waiting.
So, my basic question is; is there any way to keep this page.open(url) alive, as the execution of the rest of the code waits until the page is loaded.
My Code is
const phantom = require('phantom');
ph_instance = await phantom.create();
ph_page = await ph_instance.createPage();
var status = await ph_page.open("https://www.cscscholarship.org/");
if (status == 'success') {
console.log("Page is loaded successfully !");
//do more stuff
}

From your comment, it seems like it might be timing out (because of slow internet sometimes)... you can validate this by adding the onResourceTimeout method to your code (link: http://phantomjs.org/api/webpage/handler/on-resource-timeout.html)
It would look something like this:
ph_instance.onResourceTimeout = (request) => {
console.log('Timeout caught:' + JSON.stringify(request));
};
And if that ends up being true, you can increase the default resource timeout settings (link: http://phantomjs.org/api/webpage/property/settings.html) like this:
ph_instance.settings.resourceTimeout = 60000 // 60 seconds
Edit: I know the question is about phantom, but I wanted to also mention another framework I've used for scraping projects before called Puppeteer (link: https://pptr.dev/) I personally found that their API's are easier to understand and code in, and it's currently a maintained project unlike Phantom JS which is not maintained anymore (their last release was two years ago).

Related

Apify web scraper task not stable. Getting different results between runs minutes apart

I'm building a very simple scraper to get the 'now playing' info from an online radio station I like to listen too.
It's stored in a simple p element on their site:
data html location
Now using the standard apify/web-scraper I run into a strange issue. The scraping sometimes works, but sometimes doesn't using this code:
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
If the scraper works I get this result:
[{"nowPlaying": "Hangover Hotline - hosted by Lamebrane"}]
But if it doesn't I get this:
[{"nowPlaying": ""}]
And there is only a 5 minute difference between the two scrapes. The website doesn't change, the data is always presented in the same way. I tried checking all the boxes to circumvent security and different mixes of options (Use Chrome, Use Stealth, Ignore SSL errors, Ignore CORS and CSP) but that doesn't seem to fix it unfortunately.
Scraping instable
Any suggestions on how I can get this scraping task to constantly return the data I need?
It would be great if you can attach the URL, it will help me to find out the problem.
With the information you provided, I guess that the data you want to are loaded asynchronously. You can use context.waitFor() function.
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
await context.waitFor(() => !!$('p.js-playing-now').text());
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
You can pass the function to wait, and I will wait until the result of the function will be true. You can check the doc.

What's the correct way to handle removing a potentially busy file in NodeJS?

I have a NodeJS server managing some files. It's going to watch for a known filename from an external process and, once received, read it and then delete it. However, sometimes it's attempted to be read/deleted before the file has "unlocked" from previous use so likely will fail occasionally. What I'd like to do is retry this file asap, either as soon as it's finished or continuously at a fast pace.
I'd rather avoid a long sleep where possible, because this needs to be handled ASAP and every second counts.
fs.watchFile(intput_json_file, {interval: 10}, function(current_stats, previous_stats) {
var json_data = "";
try {
var file_cont = fs.readFileSync(input_json_file); // < TODO: async this
json_data = JSON.parse(file_cont.toString());
fs.unlink(input_json_file);
} catch (error) {
console.log("The JSON in the could not be parsed. File will continue to be watched.");
console.log(error);
return;
}
// Else, this has loaded properly.
fs.unwatchFile(input_json_file);
// ... do other things with the file's data.
}
// set a timeout for the file watching, just in case
setTimeout(fs.unwatchFile, CLEANUP_TIMEOUT, input_json_file);
I expect "EBUSY: resource busy or locked" to turn up occasionally, but fs.watchFile isn't always called when the file is unlocked.
I thought of creating a function and then calling it with a delay of 1-10ms, where it could call itself if that fails too, but that feels like a fast route to a... cough stack overflow.
I'd also like to steer clear of synchronous methods so that this scales nicely, but being relatively new to NodeJS all the callbacks are starting to turn into a maze.
May be it will be over for this story, but you can create own fs with full control. In this case other programs will write data directly to your program. Just search by word fuse and fuse-binding

Re-using same instance again webdriverJS

I am really new to Selenium. I managed to open a website using the below nodejs code
var webdriver = require('selenium-webdriver');
var driver = new webdriver.Builder()
.forBrowser('chrome')
.build();
console.log(driver);
driver.get('https://web.whatsapp.com');
//perform all other operations here.
https://web.whatsapp.com is opened and I manually scan a QR code and log in. Now I have different javascript files to perform actions like delete, clear chat inside web.whatsapp.com etc...
Now If I get some error, I debug and when I run the script again using node test.js, it takes another 2 minutes to load page and do the steps I needed. I just wanted to reopen the already opened tab and continue my script instead new window opens.
Edit day 2 : Still searching for solution. I tried below code to save object and reuse it.. Is this the correct approach ? I get a JSON parse error though.
var o = new chrome.Options();
o.addArguments("user-data-dir=/Users/vishnu/Library/Application Support/Google/Chrome/Profile 2");
o.addArguments("disable-infobars");
o.addArguments("--no-first-run");
var driver = new webdriver.Builder().withCapabilities(webdriver.Capabilities.chrome()).setChromeOptions(o).build();
var savefile = fs.writeFile('data.json', JSON.stringify(util.inspect(driver)) , 'utf-8');
var parsedJSON = require('./data.json');
console.log(parsedJSON);
It took me some time and a couple of different approaches, but I managed to work up something I think solves your problem and allows to develop tests in a rather nice way.
Because it does not directly answer the question of how to re-use a browser session in Selenium (using their JavaScript API), I will first present my proposed solution and then briefly discuss the other approaches I tried. It may give someone else an idea and help them to solve this problem in a nicer/better way. Who knows. At least my attempts will be documented.
Proposed solution (tested and works)
Because I did not manage to actually reuse a browser session (see below), I figured I could try something else. The approach will be the following.
Idea
Have a main loop in one file (say init.js) and tests in a separate file (test.js).
The main loop opens a browser instance and keeps it open. It also exposes some sort of CLI that allows one to run tests (from test.js), inspect errors as they occur and to close the browser instance and stop the main loop.
The test in test.js exports a test function that is being executed by the main loop. It is passed a driver instance to work with. Any errors that occur here are being caught by the main loop.
Because the browser instance is opened only once, we have to do the manual process of authenticating with WhatsApp (scanning a QR code) only once. After that, running a test will reload web.whatsapp.com, but it will have remembered that we authenticated and thus immediately be able to run whatever tests we define in test.js.
In order to keep the main loop alive, it is vital that we catch each and every error that might occur in our tests. I unfortunately had to resort to uncaughtException for that.
Implementation
This is the implementation of the above idea I came up with. It is possible to make this much fancier if you would want to do so. I went for simplicity here (hope I managed).
init.js
This is the main loop from the above idea.
var webdriver = require('selenium-webdriver'),
by = webdriver.By,
until = webdriver.until,
driver = null,
prompt = '> ',
testPath = 'test.js',
lastError = null;
function initDriver() {
return new Promise((resolve, reject) => {
// already opened a browser? done
if (driver !== null) {
resolve();
return;
}
// open a new browser, let user scan QR code
driver = new webdriver.Builder().forBrowser('chrome').build();
driver.get('https://web.whatsapp.com');
process.stdout.write("Please scan the QR code within 30 seconds...\n");
driver.wait(until.elementLocated(by.className('chat')), 30000)
.then(() => resolve())
.catch((timeout) => {
process.stdout.write("\b\bTimed out waiting for code to" +
" be scanned.\n");
driver.quit();
reject();
});
});
}
function recordError(err) {
process.stderr.write(err.name + ': ' + err.message + "\n");
lastError = err;
// let user know that test failed
process.stdout.write("Test failed!\n");
// indicate we are ready to read the next command
process.stdout.write(prompt);
}
process.stdout.write(prompt);
process.stdin.setEncoding('utf8');
process.stdin.on('readable', () => {
var chunk = process.stdin.read();
if (chunk === null) {
// happens on initialization, ignore
return;
}
// do various different things for different commands
var line = chunk.trim(),
cmds = line.split(/\s+/);
switch (cmds[0]) {
case 'error':
// print last error, when applicable
if (lastError !== null) {
console.log(lastError);
}
// indicate we are ready to read the next command
process.stdout.write(prompt);
break;
case 'run':
// open a browser if we didn't yet, execute tests
initDriver().then(() => {
// carefully load test code, report SyntaxError when applicable
var file = (cmds.length === 1 ? testPath : cmds[1] + '.js');
try {
var test = require('./' + file);
} catch (err) {
recordError(err);
return;
} finally {
// force node to read the test code again when we
// require it in the future
delete require.cache[__dirname + '/' + file];
}
// carefully execute tests, report errors when applicable
test.execute(driver, by, until)
.then(() => {
// indicate we are ready to read the next command
process.stdout.write(prompt);
})
.catch(recordError);
}).catch(() => process.stdin.destroy());
break;
case 'quit':
// close browser if it was opened and stop this process
if (driver !== null) {
driver.quit();
}
process.stdin.destroy();
return;
}
});
// some errors somehow still escape all catches we have...
process.on('uncaughtException', recordError);
test.js
This is the test from the above idea. I wrote some things just to test the main loop and some WebDriver functionality. Pretty much anything is possible here. I have used promises to make test execution work nicely with the main loop.
var driver, by, until,
timeout = 5000;
function waitAndClickElement(selector, index = 0) {
driver.wait(until.elementLocated(by.css(selector)), timeout)
.then(() => {
driver.findElements(by.css(selector)).then((els) => {
var element = els[index];
driver.wait(until.elementIsVisible(element), timeout);
element.click();
});
});
}
exports.execute = function(d, b, u) {
// make globally accessible for ease of use
driver = d;
by = b;
until = u;
// actual test as a promise
return new Promise((resolve, reject) => {
// open site
driver.get('https://web.whatsapp.com');
// make sure it loads fine
driver.wait(until.elementLocated(by.className('chat')), timeout);
driver.wait(until.elementIsVisible(
driver.findElement(by.className('chat'))), timeout);
// open menu
waitAndClickElement('.icon.icon-menu');
// click profile link
waitAndClickElement('.menu-shortcut', 1);
// give profile time to animate
// this prevents an error from occurring when we try to click the close
// button while it is still being animated (workaround/hack!)
driver.sleep(500);
// close profile
waitAndClickElement('.btn-close-drawer');
driver.sleep(500); // same for hiding profile
// click some chat
waitAndClickElement('.chat', 3);
// let main script know we are done successfully
// we do so after all other webdriver promise have resolved by creating
// another webdriver promise and hooking into its resolve
driver.wait(until.elementLocated(by.className('chat')), timeout)
.then(() => resolve());
});
};
Example output
Here is some example output. The first invocation of run test will open up an instance of Chrome. Other invocations will use that same instance. When an error occurs, it can be inspected as shown. Executing quit will close the browser instance and quit the main loop.
$ node init.js
> run test
> run test
WebDriverError: unknown error: Element <div class="chat">...</div> is not clickable at point (163, 432). Other element would receive the click: <div dir="auto" contenteditable="false" class="input input-text">...</div>
(Session info: chrome=57.0.2987.133)
(Driver info: chromedriver=2.29.461571 (8a88bbe0775e2a23afda0ceaf2ef7ee74e822cc5),platform=Linux 4.9.0-2-amd64 x86_64)
Test failed!
> error
<prints complete stacktrace>
> run test
> quit
You can run tests in other files by simply calling them. Say you have a file test-foo.js, then execute run test-foo in the above prompt to run it. All tests will share the same Chrome instance.
Failed attempt #1: saving and restoring storage
When inspecting the page using my development tools, I noticed that it appears to use the localStorage. It is possible to export this as JSON and write it to a file. On a next invocation, this file can be read, parsed and written to the new browser instance storage before reloading the page.
Unfortunately, WhatsApp still required me to scan the QR code. I have tried to figure out what I missed (cookies, sessionStorage, ...), but did not manage. It is possible that WhatsApp registers the browser as being disconnected after some time has passed. Or that it uses other browser properties (session ID?) to recognize the browser. This is pure speculating from my side though.
Failed attempt #2: switching session/window
Every browser instance started via WebDriver has a session ID. This ID can be retrieved, so I figured it may be possible to start a session and then connect to it from the test cases, which would then be run from a separate file (you can see this is the predecessor of the final solution). Unfortunately, I have not been able to figure out a way to set the session ID. This may actually be a security concern, I am not sure. People more expert in the usage of WebDriver might be able to clarify here.
I did find out that it is possible to retrieve a list of window handles and switch between them. Unfortunately, windows are only shared within a single session and not across sessions.

HTML5 Webworker Startup Synchronization Guarantees

I have a bit of javascript I want to run in a webworker, and I am having a hard time understanding the correct approach to getting them to work in lock-step. I invoke the WebWorker from the main script as in the following simplified script:
// main.js
werker = new Worker("webWorkerScaffold.js");
// #1
werker.onmessage = function(msgObj){
console.log("Worker Reply")
console.log(msgObj);
doSomethingWithMsg(msgObj);
};
werker.onerror = function(err){
console.log("Worker Error:");
console.log(err);
};
werker.postMessage("begin");
Then the complimentary worker script looks like the following:
// webWorkerScaffold.js
var doWorkerStuffs = function(msg){}; // Omitted
// #2
onmessage = function (msgObj){
// Messages in will always be json
if (msgObj.data.msg === "begin")
doWorkerStuffs();
};
This code (the actual version) works as expected, but I am having a difficult time confirming it will always perform correctly. Consider the following:
The "new Worker()" call is made, spawning a new thread.
The spawned thread is slow to load (lets say hangs at "// #2")
The parent thread does "werker.postMessage..." with no recipient
... ?
The same applies in the reverse direction, where I might change the worker script to make noise outward once it is setup internally, under that scenario the main thread could hang at "// #1" and miss the incoming message as it dosen't have its comm's up.
Is there some way to guarantee that these scripts move forward in a lock-step way?
What I am really looking for is a zmq-like REP/REQ semantic, where one or the other blocks (or calls back) when 1:1 transactions can take place.

Running JS in a killable 'thread'; detecting and canceling long-running processes

Summary: How can I execute a JavaScript function, but then "execute" (kill) it if it does not finish with a timeframe (e.g. 2 seconds)?
Details
I'm writing a web application for interactively writing and testing PEG grammars. Unfortunately, the JavaScript library I'm using for parsing using a PEG has a 'bug' where certain poorly-written or unfinished grammars cause infinite execution (not even detected by some browsers). You can be happily typing along, working on your grammar, when suddenly the browser locks up and you lose all your hard work.
Right now my code is (very simplified):
grammarTextarea.onchange = generateParserAndThenParseInputAndThenUpdateThePage;
I'd like to change it to something like:
grammarTextarea.onchange = function(){
var result = runTimeLimited( generateParserAndThenParseInput, 2000 );
if (result) updateThePage();
};
I've considered using an iframe or other tab/window to execute the content, but even this messy solution is not guaranteed to work in the latest versions of major browsers. However, I'm happy to accept a solution that works only in latest versions of Safari, Chrome, and Firefox.
Web workers provide this capability—as long as the long-running function does not require access to the window or document or closures—albeit in a somewhat-cumbersome manner. Here's the solution I ended up with:
main.js
var worker, activeMsgs, userTypingTimeout, deathRowTimer;
killWorker(); // Also creates the first one
grammarTextarea.onchange = grammarTextarea.oninput = function(){
// Wait until the user has not typed for 500ms before parsing
clearTimeout(userTypingTimeout);
userTypingTimeout = setTimeout(askWorkerToParse,500);
}
function askWorkerToParse(){
worker.postMessage({action:'parseInput',input:grammarTextarea.value});
activeMsgs++; // Another message is in flight
clearTimeout(deathRowTimer); // Restart the timer
deathRowTimer = setTimeout(killWorker,2000); // It must finish quickly
};
function killWorker(){
if (worker) worker.terminate(); // This kills the thread
worker = new Worker('worker.js') // Create a new worker thread
activeMsgs = 0; // No messages are pending on this new one
worker.addEventListener('message',handleWorkerResponse,false);
}
function handleWorkerResponse(evt){
// If this is the last message, it responded in time: it gets to live.
if (--activeMsgs==0) clearTimeout(deathRowTimer);
// **Process the evt.data.results from the worker**
},false);
worker.js
importScripts('utils.js') // Each worker is a blank slate; must load libs
self.addEventListener('message',function(evt){
var data = evt.data;
switch(data.action){
case 'parseInput':
// Actually do the work (which sometimes goes bad and locks up)
var parseResults = parse(data.input);
// Send the results back to the main thread.
self.postMessage({kind:'parse-results',results:parseResults});
break;
}
},false);

Categories

Resources