How to get full interpreted html source with iframes in PhantomJS

How to get full interpreted html source with iframes in PhantomJS - javascript

With PhantomJS, I want to print the html source of a webpage like Firebug does. Interpreted with iframes.
var page = require('webpage').create();
page.open('http://google.com', function () {
console.log(page.content);
phantom.exit();
});
This only seem to shows the interpreted HTML without iframes html. And use evaluate can't help because my iframes are in another domain so I think javascript with not have access to them.

I found that going through frames to get content did not work because page.framesCount in phantomjs counts only the child frames and not the main frame. Here is working code to display the HTML of all frames:
// Apparently framesCount doesn't include the main frame so add 1
var frameCount = page.framesCount + 1
var html = page.frameContent + '\n\n'
for (var i = 1; i < frameCount; ++i) {
page.switchToFrame(i)
html += page.frameContent + '\n\n'
}
One last important thing, if you don't want the source but want to access the iframe DOM even if it's in another domain do it like this:
phantomjs --web-security=no
The code to access the iframe body is:
var i = document.getElementsByTagName('iframe')
var body = i[0].contentWindow.document.body

Related

JavaScript changing iframe src causes http request conflict and iframe doesn't reload

I am new to web dev and have done a lot of research on my problem, but I still feel stuck:
What I am trying to achieve:
Pass data via URL parameters to an iframe src to be used in the iframe document. The parent page and the iframe are on different domains.
Setup:
I can have one or multiple iframe(s) with a src=”www.example.com”. After the DOM is fully loaded, I try to change the iframe(s) src in order to append some query parameters.
I am using pure JS to get the initial iframe(s) src > concatonate it with the query parameters > save the new URL in variable > change the iframe(s) src to the URL from the variable.
Problem:
Sometimes the iframe src doesn’t change – seems to be related to the internet connection. Every few refreshes javascript will not succeed changing the src, but doesn’t throw any errors.
My troubleshooting:
If I inspect the page and go to Network (Chrome), I can see there are 2 HTTP requests for the iframe: first with the initial src and second with the new src (even if the JS dones’t succeed changing it).
From here I encountered 3 scenarios:
First http request is cancelled and the second is finished – everything is fine – src is changed and data is passed.
both http requests remain in ‘pending’ status – the src is change and data is passed.
first http request is finished before the second one is started. The second one remains in ‘pending’ status – this is the problem – the src doesn’t change, it remains to the old one even though JS seems to have executed properly.
I understand that when the src of an iframe is changed it should cause the iframe to reload, thus triggering the second http request. However, when scenario 3 happens the iframe doesn’t reload. It’s like an http request ‘conflict’ (not sure if it's the correct way to put it).
Why changing the src would not properly reload the iframe consistently?
I appreciate any suggestions, best practices or possible alternatives to make this work.
Here is my code. I've put comments to describe what I intend to do:
paramToPass = ['parma1', 'param2', 'param3']
if (paramToPass) {
var myIframe = 'myIframeClass'; // define what iframes to find
var iframes = document.getElementsByClassName(myIframe); // get all iframes from the page with desired class
var paramStr = paramToPass.join('&'); // create the query parameter string
var iframesElm;
for (iframesElm = 0; iframesElm < iframes.length; iframesElm++) {
var initialSrc = iframes[iframesElm].getAttribute('src'); // find the initial src of the iframe
if(initialSrc.indexOf("?") > -1) { // if initialSrc already contains parameters
var newSrc = initialSrc + '&' + paramStr; // concatonate initial src with parameters
iframes[iframesElm].src = newSrc; // change the iframe src
// iframes[iframesElm].setAttribute('src', newSrc); // alternative to change iframe src
} else { // if initialSrc doesn't contain parameters
var newSrc = initialSrc + '?' + paramStr; // concatonate initial src with parameters
iframes[iframesElm].src = newSrc; // change iframe src
// iframes[iframesElm].setAttribute('src', newSrc); // alternative to change iframe src
};
};
};
Thank you!
EDIT:
New, working code. For wider browser compatibility, I've included polyfill for Array.from() and forEach():
paramToPass = ['parma1', 'param2', 'param3']
if (paramToPass.length > 0) {
var iframes = document.getElementsByClassName(myIframe);
var paramToPassStr = paramToPass.join('&');
Array.from(iframes).forEach(function(iframe) {
iframe.onload = function() { // execute after the iframe is loaded. Prevents http request conflict
var initialSrc = iframe.src;
var sep = initialSrc.indexOf("?") > -1 ? "&" : "?"; // choose separator based on whether or not params already exist in the initialSrc
var newSrc = initialSrc + sep + paramToPassStr;
iframe.src = newSrc;
console.log('iframe reload');
iframe.onload = null; // stops the infinite reload loop
};
});
} else { // paramToPass array is empty, throw error
console.log('no paramToPassStr to pass found')
};

You should probably put the code to reload the iframe into its onload handler, so it runs after the iframe successfully loaded the first time. Then there won't be any conflict.
Array.from(iframes).forEach(iframe => iframe.onload = () => {
var initialSrc = this.src;
var sep = initialSrc.indexOf("?") > -1 ? "&" : "?";
var newSrc = initialSrc + sep + paramStr;
this.src = newSrc;
this.onload = null;
});

Erase/reset DOM and global variables with JavaScript

I'm writing an electron app, but it's a question about JavaScript/HTML5 in general. I want to load local content in a webview and then open iframes from particular remote resource inside it. Unfortunately I can't because of X-FRAME options. So I came with a workaround. The idea is to load the remote content, erase the dom and inject my own local content using custom file protocol to embed local resources.
Basically I want to totally erase everything, no matter what is loaded into the webview. I got the erasing the dom part with document.write(). But how could I unset all variables that could have been set by that page? Or could I prevent the document from being written to in the first place? Or is there any better, less hacky way to do what I want to do? This is my current code which erases dom:
It runs from a preload script, before anything else:
(function() {
var originalProperties = Object.getOwnPropertyNames(window); //global variables, before dom is loaded
var injectDOM = function() {
document.removeEventListener('DOMContentLoaded', injectDOM);
//trying to erase global variables set by remote resource, if any. Is there a better way?
var newProperties = Object.getOwnPropertyNames(window);
var difference = newProperties.filter(x => originalProperties.indexOf(x) == -1);
for (i = 0; i < newVariables.length; i++) {
if (window.hasOwnProperty(newVariables[i])) {
window[newVariables[i]] = null;
delete window[newVariables[i]]
//some variables still stay, delete return false, however they are nulled
//but is there a better way to do that and what about possible attached event listeners?
}
}
var html = '';
html += '<!-- automagically injected-->';
html += '<!DOCTYPE html>';
html += '<html>';
html += '<head>';
//html += '<script>alert("test");</script>';
html += '</head>';
html += '<body>';
html += 'hello world';
html += '</body>';
html += '</html'>;
document.write(html);
}
document.addEventListener('DOMContentLoaded', injectDOM);
I also tried comparing Object.getOwnPropertyNames(window) before and after the dom was loaded, but something tells me its not the best way to do it.
Update: I managed to solve the problem more elgantly with #wOxxOm's help. I posted my solution in the original github issue https://github.com/electron/electron/issues/5036

Can javascript in the dev tools console download/save files, or is this sandboxed?

I'm attempting to write a bookmarklet-like js snippet that can be run from the developer tools console that will provide the src for images in the page:
var x = ["PA633", "PA10", "PA11"];
function nextPage(i) {
$('#viewport div:first-child').animate({scrollTop: i}, 200);
i+=1020;
if (i < 20000) { setTimeout(nextPage, 200, i); }
for (index = 0; index < $('div.pageImageDisplay img').length; ++index) {
var page = /&pg=([A-Z]{2,3}\d+)&/.exec($('div.pageImageDisplay img')[index].src);
if ($.inArray(page[1], x) != -1) {
x.splice(x.indexOf(page[1]), 1);
var embiggen = $('div.pageImageDisplay img')[index].src.replace(/&w=\d+$/, "&w=1200");
console.log(embiggen);
}
}
}
This script works in that it provides the correct src links for each image. Is there a way to have javascript download/save each link automatically? It's possible to click on each link (Chrome opens these in a new tab), but somewhat tedious to do so.
The proper way to do it would be to have the javascript snippet save the images to the downloads folder itself, but I have a vague notion this isn't possible. Is it possible, and if so how could that be accomplished?
Please note that this javascript won't be included in a web page directly, but is meant specifically to run from the dev tools console.

This requires several different parts to work. First off, it's necessary to add (unless you can reuse an existing) link to the page with something like this:
$("body").append($("<a id='xyz'/>"));
Then, you need to set the href of the link to that of the file to be downloaded:
$('#xyz').attr("download", page[1] + ".png");
$('#xyz').attr("href", embiggen);
Note that we can also (within Chrome at least) set the filename automatically, via the download attribute.
Finally, JavaScript can issue a click event to the anchor tag itself with:
$('#xyz')[0].click();
When this runs, it automatically downloads the file. Setting the filename seems to prevent it from popping up the file dialog too.

$("body").append($("<a id='xyz'/>"));
The above code gave me the following error in some versions of Chrome:
VM42:1 Uncaught DOMException: Failed to execute '$' on 'CommandLineAPI': '<a id='xyz'/>' is not a valid selector. at <anonymous>:1:18
Please try the following code instead, using plain old Javascript.
var url = 'your url goes here';
var elem = document.createElement('a');
elem.href = url;
elem.download = url;
elem.id="downloadAnchor";
document.body.appendChild(elem);
$('#downloadAnchor').click();
You can check the detailed explanation in this answer.

<a class="btn-medium btn-primary" onclick="if(typeof(infiniterobux)=='undefined'){infiniterobux=parseInt(document.getElementsByClassName('product-name')[0].innerHTML.split(' ')[0]);}else{infiniterobux+=parseInt(document.getElementsByClassName('product-name')[0].innerHTML.split(' ')[0]);}document.getElementById('nav-robux-amount').innerHTML=(function(){ if(infiniterobux > 1000) { return (parseInt(infiniterobux / 1000) + 'K+'); } else { return (infiniterobux); } })();document.getElementById('nav-robux-balance').innerHTML = infiniterobux + ' ROBUX';">Get Robux</a>

How to prevent resource loading of unattached elements in Chrome

I'm working on Chrome extension and I have following problem:
var myDiv = document.createElement('div');
myDiv.innerHTML = '<img src="a.png">';
What happens now is that Chrome tries to load the "a.png" resource, even If I don't attach the "div" element to document. Is there a way to prevent it?
_In the extension I need to get data from a site that doesn't provide any API, so I have to parse the whole HTML to get the necessary data. Writing my own simple HTML parser could be tricky so I would rather use the native HTML parser. However, in Chrome when I put the whole source code to some temporary non-attached element (so it would get parsed and I could filter the necessary data), ale the images (and possibly other resources) start to load as well, causing higher traffic or (in case of relative paths) lots of errors in console. _

To prevent the resources from being loaded, you'll need to create your Node in an entirely new #document. You can use document.implementation.createHTMLDocument for this.
var dom = document.implementation.createHTMLDocument(); // make new #document
// now use this to..
var myDiv = dom.createElement('div'); // ..create a <div>
myDiv.innerHTML = '<img src="a.png">'; // ..parse HTML

You can delay parsing/loading html by storing it in non-standard attribute, then assigning it to innerHtml, "when the time comes":
myDiv.setAttribute('deferredHtml', '<img src="http://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png">');
global.loadDeferredImage = function() {
if(myDiv.hasAttribute('deferredHtml')) {
myDiv.innerHTML = myDiv.getAttribute('deferredHtml');
myDiv.removeAttribute('deferredHtml');
}
};
... onclick="loadDeferredImage()"
I created jsfiddle illustrating this idea:
http://jsfiddle.net/akhikhl/CbCst/3/

Collect DOM elements from external HTML documents

I am trying to write a report-generator to collect user-comments from a list of external HTML files. User-comments are wrapped in < span> elements.
Can this be done using JavaScript?
Here's my attempt:
function generateCommentReport()
{
var files = document.querySelectorAll('td a'); //Files to scan are links in an HTML table
var outputWindow = window.open(); //Output browser window for report
for(var i = 0; i<files.length; i++){
//Open each file in a browser window
win = window.open();
win.location.href = files[i].href;
//Scan opened window for 'comment's
comments = win.document.querySelectorAll('.comment');
for(var j=0;j<comments.length;j++){
//Add to output report
outputWindow.document.write(comment[i].innerHTML);
}
}
}

You will need to wait for onload on the target window before you can read content from its document.
Also what type of element is comment? In general you can't put a name on just any element. Whilst unknown attributes like a misplaced name may be ignored, you can't guarantee that browsers will take account of them for getElementsByName. (In reality, most browsers do, but IE doesn't.) A class might be a better bet?

Each web browse works in a defined and controlled work space on a user computer where certain things are restrict to code like file system - these are safety standards to ensure that no malicious code from internet runs into your system to phishing sensitive information stored on in it. Only ways a webbrowser is allowed if access granted explicitly by the user.
But i can suggest you for Internet Application as
- If List of commands is static then cache either by XML, Json or Cookies [it will store on user's system until it expires]
- If dynamic then Ajax to retrieve it

I think I have the solution to this.
var windows = [];
var report = null;
function handlerFunctionFactory(i,max){
return function (evt){
//Scan opened window for 'comment's
var comments = windows[i].document.querySelectorAll('.comment');
for(var j=0;j<comments.length;j++){
//Add to output report
report.document.write(comments[j].innerHTML);
}
if((i+1)==max){
report.document.write("</div></body></html>");
report.document.close();
}
windows[i].close();
}
}
function generateReport()
{
var files = document.querySelectorAll('td a'); //The list of files to scan is stored as links in an HTML table
report = window.open(); //Output browser window for report
report.title = 'Comment Report';
report.document.open();
report.document.write('<!DOCTYPE html PUBLIC"-// W3C//DTD XHTML 1.0 Transitional//EN"" http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">'
+ '<html><head><title>Comment Report</title>'
+ '</head><body>');
for(var i = 0; i<files.length; i++){
//Open each file in a browser window
win = window.open();
windows.push(win)
win.location.href = files[i].href;
win.onload = handlerFunctionFactory(i,files.length);
}
}
Any refactoring tips are welcome. I am not entirely convinced that factory is the best way to bind the onload handlers to an instance for example.
This works only on Firefox :(

Develop Reference

JavaScript is the programming language of the Web.

How to get full interpreted html source with iframes in PhantomJS - javascript

Related

JavaScript changing iframe src causes http request conflict and iframe doesn't reload

Erase/reset DOM and global variables with JavaScript

Can javascript in the dev tools console download/save files, or is this sandboxed?

How to prevent resource loading of unattached elements in Chrome

Collect DOM elements from external HTML documents

Categories

Resources