Make NodeJS/JSDom wait for full rendering before scraping - javascript

I'm trying to scrape data from a website that I need to log into. Unfortunately, I'm getting different results using JSDom/NodeJS than I would if I were to use a web browser, such as FF. In particular, I'm not getting the log in form with the username, password and submit button.
I understand much of Javascript, at least, is asynchronous. However, I thought the "done" function of JSDom waits synchronously for the full rendering of the page. I guess what I'd like to do is simulate an HTTPS get and wait for the full document.ready to be done.
var jsdom = require("jsdom");
var jsdom_global = require("jsdom-global");
var fs = require("fs");
var jquery = fs.readFileSync("./jquery-3.1.1.min.js", "utf-8");
jsdom.env({
url: "https://wemc.smarthub.coop/Login.html#login:",
src: [jquery],
done: function (err, window) {
var $ = window.$;
if($("button#LoginSubmitButton").length) {
console.log('Click button found');
} else {
console.log('Click button not found');
}
// The following text boxes are not coming back:
// $("input#LoginUsernameTextBox")
// $("input#LoginPasswordTextBox")
// If I enable the line below, I see a lot less than I would if I
// do a view source in any reasonable browser.
//console.log($("body").html());
}
});

Usually, this will happen because JSDOM doesn't execute the JS when it hits the page. In that case, the only elements returned will be the server rendered HTML.
You could try a headless browser module such as PhantomJS etc and see how that goes for you. There's a section about the distinction between the two at the bottom of the JSDOM github page.

Related

Where and how can I add a wait for page to load in my script?

I'm very new to JS im having an issue at the moment where my axe chrome extension isn't matching up with the below script. A suggestion would be to wait for the page to load however Im a bit confused about what to use and where to use it? Hoping that waiting for the page load will allow me to view the AA issues that aren't being pulled through.
const driver = new WebDriver.Builder().forBrowser('chrome').build();
driver.get('MySite').then( () => {
new AxeBuilder(driver).withTags(['wcag2a', 'wcag2aa', 'wcag21a', 'wcag21aa', 'best-practice', 'wcag***', 'act', 'section508', 'section508.*.*', 'experimental', 'cat.*', 'color-contrast'])
.analyze((err, results) => {
if (err) {
// Handle error somehow
}
console.log(results.violations);
});
});
For classic scripts, if the async attribute is present, then the classic script will be fetched in parallel to parsing and evaluated as soon as it is available.
read this:
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#attr-async
To wait for the page to load, wrap your script in the following:
window.onload = function() {
// ...your code here...
}
See onload documentation.
Alternatively, move the <script> tag to the bottom of the HTML (and don't use the async attribute).

Apify web scraper task not stable. Getting different results between runs minutes apart

I'm building a very simple scraper to get the 'now playing' info from an online radio station I like to listen too.
It's stored in a simple p element on their site:
data html location
Now using the standard apify/web-scraper I run into a strange issue. The scraping sometimes works, but sometimes doesn't using this code:
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
If the scraper works I get this result:
[{"nowPlaying": "Hangover Hotline - hosted by Lamebrane"}]
But if it doesn't I get this:
[{"nowPlaying": ""}]
And there is only a 5 minute difference between the two scrapes. The website doesn't change, the data is always presented in the same way. I tried checking all the boxes to circumvent security and different mixes of options (Use Chrome, Use Stealth, Ignore SSL errors, Ignore CORS and CSP) but that doesn't seem to fix it unfortunately.
Scraping instable
Any suggestions on how I can get this scraping task to constantly return the data I need?
It would be great if you can attach the URL, it will help me to find out the problem.
With the information you provided, I guess that the data you want to are loaded asynchronously. You can use context.waitFor() function.
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
await context.waitFor(() => !!$('p.js-playing-now').text());
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
You can pass the function to wait, and I will wait until the result of the function will be true. You can check the doc.

Stop fs.createWriteStream creating writeable stream when file is deleted

Folks: I'm creating an Angular/Node app, where users download files via selecting a related thumbnail.
As files download, a small list is shown with the download progress - using status-bar.
When the file is downloaded a success message is shown.
Each item in the list has a delete button which removes the files when clicked. All of this works fine.
Question: Similar to this post - when the delete button is clicked, the idea is to stop the download - this is why I thought I'd just delete file.
However, I'm using fs.createWriteStream and when the file is deleted, the stream appears to continue, regardless of the file not being there. This then causes the file.on('finish', function() { state to kick in and show the success message.
To tackle this, I check to see if the file path exists when the finish state kicks in so to display the success message correctly. This feels pretty hacky, especially when there's large files downloading.
Is there a way to cancel the stream from progressing when the file is deleted?
Following your comment 'yes, just like that', I have one question. You are obviously creating the file in client system, and writing in streams. How are you doing it from browser? Are you using any API that gives you access of node's core module in browser? Like browserify.
Having said that, if my understanding is correct, you can achieve that in the following way
var http = require("http"),
fs = require("fs"),
stream = require("stream"),
util = require("util"),
abortStream=false, // When user click on delete, update this flag to true
ws,
Transform;
ws = fs.createWriteStream('./op.jpg');
// Transform streams read input, process data [n times], output processed data
// readStream ---pipe---> transformStream1 ---pipe---> ...transformStreamn ---pipe---> outputStream
// #api https://nodejs.org/api/stream.html#stream_class_stream_transform
// #exmpl https://strongloop.com/strongblog/practical-examples-of-the-new-node-js-streams-api/
Transform = stream.Transform || require("readable-stream").Transform;
function InterruptedStream(options){
if(!(this instanceof InterruptedStream)){
return new InterruptedStream;
}
Transform.call(this, options);
}
util.inherits(InterruptedStream, Transform);
InterruptedStream.prototype._transform = function (chunkdata, encoding, done) {
// This is just for illustration, giving you the idea
// Do not hard code the condition here.
// Suggested to give the condition during constructor call, may be
if(abortStream===true){
// Take care of this part.
// Your logic might try to write in the stream after it is closed.
// You can catch the exception but before that try not to write in the first place
this.end(); // Stops the stream
}
this.push(chunkdata, encoding);
done();
};
var is=new InterruptedStream();
is.pipe(ws);
// Download large file
http.get("http://www.zastavki.com/pictures/1920x1200/2011/Space_Huge_explosion_031412_.jpg", function(res) {
res.on('data', function(data) {
is.write(data);
// Simulates click on delete button
setTimeout(function(){
abortStream=false;
res.destroy();
// Delete the file, I think you have the logic in place
}, 2000);
}).on('end', function() {
console.log("end");
});
});
The above code snippet gives rough idea how its to be done. You can just copy paste it, run (it will work) and make changes.
If we are not on same page please let me know, Ill try to rectify my answer.
i think you can emit an event when your file is deleted and capture that event in
var wt = fs.createWriteStream();
wt.on('eventName',function(){
wt.emit('close');
})
this will close your writableStream.
and delete event should be fired from client side.

ReportViewer Web Form causes page to hang

I was asked to take a look at what should be a simple problem with one of our web pages for a small dashboard web app. This app just shows some basic state info for underlying backend apps which I work heavily on. The issues is as follows:
On a page where a user can input parameters and request to view a report with the given user input, a button invokes a JS function which opens a new page in the browser to show the rendered report. The code looks like this:
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}
});
The page that is then opened has the following code which is called from Page_Load:
rptViewer.ProcessingMode = ProcessingMode.Remote
rptViewer.AsyncRendering = True
rptViewer.ServerReport.Timeout = CInt(WebConfigurationManager.AppSettings("ReportTimeout")) * 60000
rptViewer.ServerReport.ReportServerUrl = New Uri(My.Settings.ReportURL)
rptViewer.ServerReport.ReportPath = "/" & My.Settings.ReportPath & "/" & Request("Report")
'Set the report to use the credentials from web.config
rptViewer.ServerReport.ReportServerCredentials = New SQLReportCredentials(My.Settings.ReportServerUser, My.Settings.ReportServerPassword, My.Settings.ReportServerDomain)
Dim myCredentials As New Microsoft.Reporting.WebForms.DataSourceCredentials
myCredentials.Name = My.Settings.ReportDataSource
myCredentials.UserId = My.Settings.DatabaseUser
myCredentials.Password = My.Settings.DatabasePassword
rptViewer.ServerReport.SetDataSourceCredentials(New Microsoft.Reporting.WebForms.DataSourceCredentials(0) {myCredentials})
rptViewer.ServerReport.SetParameters(parameters)
rptViewer.ServerReport.Refresh()
I have omitted some code which builds up the parameters for the report, but I doubt any of that is relevant.
The problem is that, when the user clicks the show report button, and this new page opens up, depending on the types of parameters they use the report could take quite some time to render, and in the mean time, the original page becomes completely unresponsive. The moment the report page actually renders, the main page begins functioning again. Where should I start (google keywords, ReportViewer properties, etc) if I want to fix this behavior such that the other page can load asynchronously without affecting the main page?
Edit -
I tried doing the follow, which was in a linked answer in a comment here:
$.ajax({
context: document.body,
async: true, //NOTE THIS
success: function () {
window.open(Address);
}
});
this replaced the window.open call. This seems to work, but when I check out the documentation, trying to understand what this is doing I found this:
The .context property was deprecated in jQuery 1.10 and is only maintained to the extent needed for supporting .live() in the jQuery Migrate plugin. It may be removed without notice in a future version.
I removed the context property entirely and it didnt seem to affect the code at all... Is it ok to use this ajax call in this way to open up the other window, or is there a better approach?
Using a timeout should open the window without blocking your main page
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
setTimeout(function() {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}, 0);
}
});
This is a long shot, but have you tried opening the window with a blank URL first, and subsequently changing the location?
$("#btnShowReport").click(function(){
If (CheckSession()) {
var pop = window.open ('', 'showReport');
pop = window.open ('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>', 'showReport');
}
})
use
`$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.location.href='<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>';
}
});`
it will work.

Efficient scrolling of piped output in a browser window

I have a custom browser plugin (built with FireBreath) that will invoke a local process on a users machine and pipe stdout back to the browser, to do this i'm running the process through a popen() call and as I read data from the pipe I fire a JSAPI event and send it back to the browser.
In the browser I append the output to a div as pre-formatted text and tell the div to scroll to the bottom.
Code in the browser plugin:
FILE* in;
if(!(in = _popen(command_string, "r")))
{
return NULL;
}
while(fgets(buff, sizeof(buff), in)!=NULL)
{
send_output_to_browser(buff);
}
HTML & Javascript/jQuery:
<pre id="sync_status_window" style="overflow:scroll">
<span id="sync_output"></span>
</pre>
var onPluginTextReceived = function (text)
{
$('#sync_output').append(text);
var objDiv = document.getElementById('sync_status_window');
objDiv.scrollTop = objDiv.scrollHeight;
}
This method works for the browsers I need it to (this is a limited use internal tool), but it's frustratingly laggy. My process usually finishes about 30-60 seconds before the output window finishes scrolling. So, how do I make this more efficient? Is there a better way to pipe this text back to the browser?
There are two optimizations I see potential in:
keep a reference to your pre and span, you keep repeating the dom
tree search , which is quite costly
Chunk up the output - either on the C side (preferable) or on the JS
side.
For quick hack (without removing dependency on jquery, which should be done) could look like
//Higher or global scope
var pluginBuffer=[];
var pluginTimeout=false;
var sync_status_window=document.getElementById('sync_status_window');
function onPluginTextReceived(text)
{
pluginBuffer[pluginBuffer.length]=text;
if (!pluginTimeout) pluginTimeout=window.SetTimeout('onPluginTimer();',333);
}
function onPluginTimer()
{
var txt=pluginBuffer.join('');
pluginBuffer=[];
pluginTimeout=false;
$('#sync_output').append(text);
sync_status_window.scrollTop = sync_status_window.scrollHeight;
}
Adapt to your needs, I chose 333ms for 3 updates/second

Categories

Resources