I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.
Related
I'm attempting to use javascript to determine if the user is using a certain language and if they're not using english then for the page to load a different page BUT with the params of which I've grabbed from the url.
I have been able to load the page with the params but I keep falling into a loop reloading the page, even after skimming through the countless other examples, such as: this or this.
function locateUserLanguage() {
var languageValue = (navigator.languages ? navigator.languages[0] : (navigator.language || navigator.userLanguage)).split('-');
var url = window.location.href.split('?');
var baseUrl = url[0];
var urlParams = url[1];
if (languageValue[0] === 'en') {
console.log('no redirect needed, stay here.');
} else {
// I tried to set location into a variable but also wasn't working.
// var newURL = window.location.href.replace(window.location.href, 'https://www.mysite.dog/?' + urlParams);
window.location.href = 'https://www.mysite.dog/?' + urlParams
}
} locateUserLanguage();
I've attempted to place a return true; as well as return false; but neither stop the loop.
I've tried window.location.replace(); and setting the window.location.href straight to what I need, but it's continuing to loop.
There is a possibility that the script in which this function is written is executed in both of your pages (english and non-english) on load. So, as soon as the page is loaded, locateUserLanguage function is executed in both english and non-english website causing the infinite loop.
You need to put a check before you call locateUserLanguage function.
Suppose english website has url = "www.myside.com" and non-english website has url "www.myside.aus". So the condition needs to be
if (window.location.host === "www.myside.com") { locateUserLanguage() }
This will make sure that locateUserLanguage is called only in english website.
Or other apporach can be to load this script only in english website which will avoid the usage of conditional statement.
Hope it helps. Revert for any doubts.
To give you some background, many (if not all) websites load their images one by one, so if there are a lot of images, and/or you have a slow computer, most of the images wont show up. This is avoidable for the most part, however if you're running a script to exact image URLs, then you don't need to see the image, you just want its URL. My question is as follows:
Is it possible to trick a webpage into thinking an image is done loading so that it will start loading the next one?
Typically browser will not wait for one image to be downloaded before requesting the next image. It will request all images simultaneously, as soon as it gets the srcs of those images.
Are you sure that the images are indeed waiting for previous image to download or are they waiting for a specific time interval?
In case if you are sure that it depends on download of previous image, then what you can do is, route all your requests through some proxy server / firewall and configure it to return an empty file with HTTP status 200 whenever an image is requested from that site.
That way the browser (or actually the website code) will assume that it has downloaded the image successfully.
how do I do that? – Jack Kasbrack
That's actually a very open ended / opinion based question. It will also depend on your OS, browser, system permissions etc. Assuming you are using Windows and have sufficient permissions, you can try using Fiddler. It has an AutoResponder functionality that you can use.
(I've no affiliation with Fiddler / Telerik as such. I'm suggesting it only as an example and because I've used it in the past and know that it can be used for the aforementioned purpose. There will be many more products that provide similar functionality and you should use the product of your choice.)
use a plugin called lazy load. what it does is it will load the whole webpage and will just load the image later on. it will only load the image when the user scroll on it.
To extract all image URLs to a text file maybe you could use something like this,
If you execute this script inside any website it will list the URLs of the images
document.querySelectorAll('*[src]').forEach((item) => {
const isImage = item.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (isImage) console.log(item.src);
});
You could also use the same idea to read Style from elements and get images from background url or something, like that:
document.querySelectorAll('*').forEach((item) => {
const computedItem = getComputedStyle(item);
Object.keys(computedItem).forEach((attr) => {
const style = computedItem[attr];
const image = style.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (image) console.log(image[0]);
});
});
So, at the end of the day you could do some function like that, which will return an array of all images on the site
function getImageURLS() {
let images = [];
document.querySelectorAll('*').forEach((item) => {
const computedItem = getComputedStyle(item);
Object.keys(computedItem).forEach((attr) => {
const style = computedItem[attr];
const image = style.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (image) images.push(image[0]);
});
});
document.querySelectorAll('*[src]').forEach((item) => {
const isImage = item.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (isImage) images.push(item.src);
});
return images;
}
It can probably be optimized but, well you get the idea..
If you just want to extract images once. You can use some tools like
1) Chrome Extension
2) Software
3) Online website
If you want to run it multiple times. Probably use the above code https://stackoverflow.com/a/53245330/4674358 wrapped in if condition
if(document.readyState === "complete") {
extractURL();
}
else {
//Add onload or DOMContentLoaded event listeners here: for example,
window.addEventListener("onload", function () {
extractURL();
}, false);
//or
/*document.addEventListener("DOMContentLoaded", function () {
extractURL();
}, false);*/
}
extractURL() {
//code mentioned above
}
You want the "DOMContentLoaded" event docs. It fires as soon as the document is fully parsed, but before everything has been loaded.
let addIfImage = (list, image) => image.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g) ?
[image.src, ...list] :
list;
let getSrcFromTags= (tag = 'img') => Array.from(document.getElementsByTagName(tag))
.reduce(addIfImage, []);
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", doSomething);
} else { // `DOMContentLoaded` already fired
doSomething();
}
I am using this, works as expected:
var imageLoading = function(n) {
var image = document.images[n];
var downloadingImage = new Image();
downloadingImage.onload = function(){
image.src = this.src;
console.log('Image ' + n + ' loaded');
if (document.images[++n]) {
imageLoading(n);
}
};
downloadingImage.src = image.getAttribute("data-src");
}
document.addEventListener("DOMContentLoaded", function(event) {
setTimeout(function() {
imageLoading(0);
}, 0);
});
And change every src attribute of image element to data-src
So i am working on a Userscript and there is one major step i'm trying to find the easiest resolve with since i am very new to Javascript coding...I'm trying to perform/code a function that will open a specified URL:
EXAMPLE: Homepage ("http://www.EXAMPLE.com")
(page can be opened as 'Window.open' = Blank, or _self);
...when the parent or (current) URL that is open
EXAMPLE: innner.href = ("www.EXAMPLE.com/new/01262016/blah/blah/blah");
...has a text on the HTML documnt page that reads:
EXAMPLE TEXT from page ("www.EXAMPLE.com/new/01262016/blah/blah/blah");:
"this is the end of the page, please refresh to return back to homepage"
(TEXT: not the real keyword, but want to use phase as a detection for a setTimeout function to return back to home.)
Any help will be much appreicated, you guys are veryinformative here. Thanks in advance.
I think I have the gist of you question. It is a straighforward, though quite intensive, task to scan the entire text content of a page for specific keywords with JavaScript. However, if the keywords appear more than once (on multiple pages that should not redirect) then your users will get undesirable results.
A simple solution would be to add a class="last-page" attribute to the body-tag of the final page and run a function that checks for this. Something like....
HTML
<body class="last-page"><!--page content--></body>
JS
window.onload = function() {
var interval = 5000; // five seconds
if (document.body.classList.contains('last-page')) {
setTimeout(function() {
window.location.assign('http://the-next-page.com/');
}, interval);
}
};
Alternatively, if you have the ability to wrap the specified text in a uniquely identified html-tag, such as...
<span id="last-page">EXAMPLE TEXT</span>
...then the presence of this tag can be checked on each page load - similar to the function above:
window.onload = function() {
var interval = 5000;
if (document.getElementById('last-page') {
setTimeout(/* code as before */);
}
};
Yet another solution is to check the page URL against a variable...
window.onload = function() {
var finalURL = 'http://the-last-page.com/blah/...';
if (window.location === finalURL) {
/* same as before */
}
};
If this kind of thing is not an option please leave a comment and I'll add a function that gathers a pages entire text content and compares adjacent words to a pre-defined set of keys.
I was asked to take a look at what should be a simple problem with one of our web pages for a small dashboard web app. This app just shows some basic state info for underlying backend apps which I work heavily on. The issues is as follows:
On a page where a user can input parameters and request to view a report with the given user input, a button invokes a JS function which opens a new page in the browser to show the rendered report. The code looks like this:
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}
});
The page that is then opened has the following code which is called from Page_Load:
rptViewer.ProcessingMode = ProcessingMode.Remote
rptViewer.AsyncRendering = True
rptViewer.ServerReport.Timeout = CInt(WebConfigurationManager.AppSettings("ReportTimeout")) * 60000
rptViewer.ServerReport.ReportServerUrl = New Uri(My.Settings.ReportURL)
rptViewer.ServerReport.ReportPath = "/" & My.Settings.ReportPath & "/" & Request("Report")
'Set the report to use the credentials from web.config
rptViewer.ServerReport.ReportServerCredentials = New SQLReportCredentials(My.Settings.ReportServerUser, My.Settings.ReportServerPassword, My.Settings.ReportServerDomain)
Dim myCredentials As New Microsoft.Reporting.WebForms.DataSourceCredentials
myCredentials.Name = My.Settings.ReportDataSource
myCredentials.UserId = My.Settings.DatabaseUser
myCredentials.Password = My.Settings.DatabasePassword
rptViewer.ServerReport.SetDataSourceCredentials(New Microsoft.Reporting.WebForms.DataSourceCredentials(0) {myCredentials})
rptViewer.ServerReport.SetParameters(parameters)
rptViewer.ServerReport.Refresh()
I have omitted some code which builds up the parameters for the report, but I doubt any of that is relevant.
The problem is that, when the user clicks the show report button, and this new page opens up, depending on the types of parameters they use the report could take quite some time to render, and in the mean time, the original page becomes completely unresponsive. The moment the report page actually renders, the main page begins functioning again. Where should I start (google keywords, ReportViewer properties, etc) if I want to fix this behavior such that the other page can load asynchronously without affecting the main page?
Edit -
I tried doing the follow, which was in a linked answer in a comment here:
$.ajax({
context: document.body,
async: true, //NOTE THIS
success: function () {
window.open(Address);
}
});
this replaced the window.open call. This seems to work, but when I check out the documentation, trying to understand what this is doing I found this:
The .context property was deprecated in jQuery 1.10 and is only maintained to the extent needed for supporting .live() in the jQuery Migrate plugin. It may be removed without notice in a future version.
I removed the context property entirely and it didnt seem to affect the code at all... Is it ok to use this ajax call in this way to open up the other window, or is there a better approach?
Using a timeout should open the window without blocking your main page
$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
setTimeout(function() {
window.open('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>');
}, 0);
}
});
This is a long shot, but have you tried opening the window with a blank URL first, and subsequently changing the location?
$("#btnShowReport").click(function(){
If (CheckSession()) {
var pop = window.open ('', 'showReport');
pop = window.open ('<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>', 'showReport');
}
})
use
`$('#btnShowReport').click(function () {
document.getElementById("Error").innerHTML = "";
var exists = CheckSession();
if (exists) {
window.location.href='<%=Url.Content("~/Reports/Launch.aspx?Report=Short&Area=1") %>';
}
});`
it will work.
I can detect when the content of an iframe has loaded using the load event. Unfortunately, for my purposes, there are two problems with this:
If there is an error loading the page (404/500, etc), the load event is never fired.
If some images or other dependencies failed to load, the load event is fired as usual.
Is there some way I can reliably determine if either of the above errors occurred?
I'm writing a semi-web semi-desktop application based on Mozilla/XULRunner, so solutions that only work in Mozilla are welcome.
If you have control over the iframe page (and the pages are on the same domain name), a strategy could be as follows:
In the parent document, initialize a variable var iFrameLoaded = false;
When the iframe document is loaded, set this variable in the parent to true calling from the iframe document a parent's function (setIFrameLoaded(); for example).
check the iFrameLoaded flag using the timer object (set the timer to your preferred timeout limit) - if the flag is still false you can tell that the iframe was not regularly loaded.
I hope this helps.
This is a very late answer, but I will leave it to someone who needs it.
Task: load iframe cross-origin content, emit onLoaded on success and onError on load error.
This is the most cross browsers origin independent solution I could develop. But first of all I will briefly tell about other approaches I had and why they are bad.
1. iframe That was a little shock for me, that iframe only has onload event and it is called on load and on error, no way to know it is error or not.
2. performance.getEntriesByType('resource'). This method returns loaded resources. Sounds like what we need. But what a shame, firefox always adds Resource in resources array no matter it is loaded or failed. No way to know by Resource instance was it success. As usual. By the way, this method does not work in ios<11.
3. script I tried to load html using <script> tag. Emits onload and onerror correctly, sadly, only in Chrome.
And when I was ready to give up, my elder collegue told me about html4 tag <object>. It is like <iframe> tag except it has fallbacks when content is not loaded. That sounds like what we are need! Sadly it is not as easy as it sounds.
CODE SECTION
var obj = document.createElement('object');
// we need to specify a callback (i will mention why later)
obj.innerHTML = '<div style="height:5px"><div/>'; // fallback
obj.style.display = 'block'; // so height=5px will work
obj.style.visibility = 'hidden'; // to hide before loaded
obj.data = src;
After this we can set some attributes to <object> like we'd wanted to do with iframe. The only difference, we should use <params>, not attributes, but their names and values are identical.
for (var prop in params) {
if (params.hasOwnProperty(prop)) {
var param = document.createElement('param');
param.name = prop;
param.value = params[prop];
obj.appendChild(param);
}
}
Now, the hard part. Like many same-like elements, <object> doesn't have specs for callbacks, so each browser behaves differently.
Chrome. On error and on load emits load event.
Firefox. Emits load and error correctly.
Safari. Emits nothing....
Seems like no different from iframe, getEntriesByType, script....
But, we have native browser fallback! So, because we set fallback (innerHtml) directly, we can tell if <object> is loaded or not
function isReallyLoaded(obj) {
return obj.offsetHeight !== 5; // fallback height
}
/**
* Chrome calls always, Firefox on load
*/
obj.onload = function() {
isReallyLoaded(obj) ? onLoaded() : onError();
};
/**
* Firefox on error
*/
obj.onerror = function() {
onError();
};
But what to do with Safari? Good old setTimeout.
var interval = function() {
if (isLoaded) { // some flag
return;
}
if (hasResult(obj)) {
if (isReallyLoaded(obj)) {
onLoaded();
} else {
onError();
}
}
setTimeout(interval, 100);
};
function hasResult(obj) {
return obj.offsetHeight > 0;
}
Yeah.... not so fast. The thing is, <object> when fails has unmentioned in specs behaviour:
Trying to load (size=0)
Fails (size = any) really
Fallback (size = as in innnerHtml)
So, code needs a little enhancement
var interval = function() {
if (isLoaded) { // some flag
return;
}
if (hasResult(obj)) {
if (isReallyLoaded(obj)) {
interval.count++;
// needs less then 400ms to fallback
interval.count > 4 && onLoadedResult(obj, onLoaded);
} else {
onErrorResult(obj, onError);
}
}
setTimeout(interval, 100);
};
interval.count = 0;
setTimeout(interval, 100);
Well, and to start loading
document.body.appendChild(obj);
That is all. I tried to explain code in every detail, so it may look not so foolish.
P.S. WebDev sucks
I had this problem recently and had to resort to setting up a Javascript Polling action on the Parent Page (that contains the IFRAME tag). This JavaScript function checks the IFRAME's contents for explicit elements that should only exist in a GOOD response. This assumes of course that you don't have to deal with violating the "same origin policy."
Instead of checking for all possible errors which might be generated from the many different network resources.. I simply checked for the one constant positive Element(s) that I know should be in a good response.
After a pre-determined time and/or # of failed attempts to detect the expected Element(s), the JavaScript modifies the IFRAME's SRC attribute (to request from my Servlet) a User Friendly Error Page as opposed to displaying the typical HTTP ERROR message. The JavaScript could also just as easily modify the SRC attribute to make an entirely different request.
function checkForContents(){
var contents=document.getElementById('myiframe').contentWindow.document
if(contents){
alert('found contents of myiframe:' + contents);
if(contents.documentElement){
if(contents.documentElement.innerHTML){
alert("Found contents: " +contents.documentElement.innerHTML);
if(contents.documentElement.innerHTML.indexOf("FIND_ME") > -1){
openMediumWindow("woot.html", "mypopup");
}
}
}
}
}
I think that the pageshow event is fired for error pages. Or if you're doing this from chrome, then your check your progress listener's request to see if it's an HTTP channel in which case you can retrieve the status code.
As for page dependencies, I think you can only do this from chrome by adding a capturing onerror event listener, and even then it will only find errors in elements, not CSS backgrounds or other images.
Doesn't answer your question exactly, but my search for an answer brought me here, so I'm posting just in case anyone else had a similar query to me.
It doesn't quite use a load event, but it can detect whether a website is accessible and callable (if it is, then the iFrame, in theory, should load).
At first, I thought to do an AJAX call like everyone else, except that it didn't work for me initially, as I had used jQuery. It works perfectly if you do a XMLHttpRequest:
var url = http://url_to_test.com/
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status != 200) {
console.log("iframe failed to load");
}
};
xhttp.open("GET", url, true);
xhttp.send();
Edit:
So this method works ok, except that it has a lot of false negatives (picks up a lot of stuff that would display in an iframe) due to cross-origin malarky. The way that I got around this was to do a CURL/Web request on a server, and then check the response headers for a) if the website exists, and b) if the headers had set x-frame-options.
This isn't a problem if you run your own webserver, as you can make your own api call for it.
My implementation in node.js:
app.get('/iframetest',function(req,res){ //Call using /iframetest?url=url - needs to be stripped of http:// or https://
var url = req.query.url;
var request = require('https').request({host: url}, function(response){ //This does an https request - require('http') if you want to do a http request
var headers = response.headers;
if (typeof headers["x-frame-options"] != 'undefined') {
res.send(false); //Headers don't allow iframe
} else {
res.send(true); //Headers don't disallow iframe
}
});
request.on('error',function(e){
res.send(false); //website unavailable
});
request.end();
});
Have a id for the top most (body) element in the page that is being loaded in your iframe.
on the Load handler of your iframe, check to see if getElementById() returns a non null value.
If it is, iframe has loaded successfully. else it has failed.
in that case, put frame.src="about:blank". Make sure to remove the loadhandler before doing that.
If the iframe is loaded on the same origin as the parent page, then you can do this:
iframeEl.addEventListener('load', function() {
// NOTE: contentDocument is null if a connection error occurs or if
// X-Frame-Options is not SAMESITE (which could happen with
// 4xx or 5xx error pages if the corresponding error handlers
// do not specify SAMESITE). If error handlers do not specify
// SAMESITE, then networkErrorOccurred will incorrectly be set
// to true.
const networkErrorOccurred = !iframeEl.contentDocument;
const serverErrorOccurred = (
!networkErrorOccurred &&
!iframeEl.contentDocument.querySelector('#well-known-element')
);
if (networkErrorOccurred || serverErrorOccurred) {
let errorMessage;
if (networkErrorOccurred) {
errorMessage = 'Error: Network error';
} else if (serverErrorOccurred) {
errorMessage = 'Error: Server error';
} else {
// Assert that the above code is correct.
throw new Error('networkErrorOccurred and serverErrorOccurred are both false');
}
alert(errorMessage);
}
});