Chrome Extension - Get the true raw HTML document before JavaScript rendering - javascript

I'm using a content script (loaded with run_at: document_start to try and grab the exact source of the page before any DOM modifications take place from JavaScript.
I want the pure HTML - exactly what you'd get from Right Click > View Source in the browser.
I've tried two methods which both nearly work but not quite.
Here's the actual raw source of the page, from Right Click > View Source
<!doctype html>
<html lang="en">
<head>
<title>Raw HTML title</title>
</head>
<body>
<p>Something here.</p>
<script>
document.title = 'Title injected by JS';
</script>
</body>
</html>
Things I've tried:
new XMLSerializer().serializeToString(document)
This produces the following:
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en"><head>
<title>Raw HTML title</title>
</head>
<body>
<p>Something here.</p>
<script>
document.title = 'Title injected by JS';
</script></body></html>
It's close, but for some reason the formatting isn't correct, 'doctype' is capitalised and the xmlns attribute added to the <html> tag.
document.documentElement.outerHTML
Produces the following:
<html lang="en"><head>
<title>Raw HTML title</title>
</head>
<body>
<p>Something here.</p>
<script>
document.title = 'Title injected by JS';
</script></body></html>
</body></html>
It's missing the doctype and the formatting is also not as per the original.

Doesn't seem you can get the 'pure' HTML as seen in view source. The closest you can get is what's given back by
new XMLSerializer().serializeToString(document)
If you trigger the above in a content script run at run_at: document_start (before anything exists in DOM at all) and monitor for DOM mutations you can grab the first mutation with something like this:
var observer = new MutationObserver(function(mutations) {
mutations.forEach(function(mutation) {
var rawHTML = new XMLSerializer().serializeToString(document);
console.log(rawHTML);
});
});
var config = { attributes: true, childList: true, characterData: true }
observer.observe(target, config);
XMLSerializer() has solid browser support: https://caniuse.com/#feat=xml-serializer

Related

How do I call the JavaScript function properly?

<!DOCTYPE html>
<html lang="en">
<head>
<title>JavaScript Example</title>
<script>
function displayString() {
return "<h1>Main Heading</h1>"
}
displayString();
document.write("Execute during page load from the head<br>");
</script>
</head>
<body>
<script>
document.write("Execute during page load from the body<br>");
</script>
</body>
</html>
So this is my problem. No matter where I put the displayString(), the h1 just never seems to show up on the browser. Can anybody please help me see where I am wrong? I am new to JavaScript. Oh, and what I am trying to do is to call the function.
You need to write the returned String to the document:
<!DOCTYPE html>
<html lang="en">
<head>
<title>JavaScript Example</title>
<script>
function displayString() {
return "<h1>Main Heading</h1>"
}
document.write(displayString());
document.write("Execute during page load from the head<br>");
</script>
</head>
<body>
<script>
document.write("Execute during page load from the body<br>");
</script>
</body>
</html>
No matter where I put the displayString(), the h1 just never seems to
show up on the browser.
If you wish to add a new element to a document, several approaches are available:
document.write (effectively deprecated)
.innerHTML (sometimes useful, but can be slow)
DOM API - recommended approach
The recommended approach is to use the DOM API.
DOM stands for Document Object Model. Essentially it's the markup of your document represented as a tree-like structure of nodes. There are many DOM API functions which allow you to:
add
remove
append
prepend
insert
update
new DOM nodes.
Any DOM node may be added, removed or updated, including:
parent elements
child elements
sibling elements
element attributes
ids, classNames, classLists
custom data-* attributes
text nodes
Here is an example:
function displayMainHeading () {
let mainHeading = document.createElement('h1');
mainHeading.textContent = 'Main Heading';
document.body.prepend(mainHeading);
}
displayMainHeading();
<p>Paragraph 1</p>
<p>Paragraph 2</p>
Further Reading
This is a good primer to get you started:
A Beginners Guide To DOM Manipulation by Iqra Masroor

document.getElementById can't find qUnit DOM elements

I'm trying to get a reference to a DOM object created by qUnit, with no luck. It works just fine with a "home made" DOM element. I have made a test site to illustrate the problem. Turn on Firebug or other logging window when visiting the site.
This is the code of the website:
window.onload = function() {
var qunitTestrunnerToolbar_element = document.getElementById("qunit-testrunner-toolbar");
console.log("qunitTestrunnerToolbar_element: ", qunitTestrunnerToolbar_element);
var test_element = document.getElementById("test_element");
console.log("test_element: ", test_element);
};
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Testing 'require' error</title>
</head>
<body>
<p>See console for output</p>
<script src="index.js"></script>
<script src="http://code.jquery.com/jquery-2.2.0.js"></script>
<p id="test_element">Test element</p>
</body>
</html>
It won't work like this
I am not talking about qunit but document.getElementById("qunit-testrunner-toolbar"); will return null because there are no element present in this html.
If you are particularly asking how to get actual id and not null
You may, add your original script file in this html and then var qunitTestrunnerToolbar_element = document.getElementById("qunit-testrunner-toolbar"); will console it in indexjs or if you can include <iframe> in your test html you can do
<iframe src="urlWithinYourDomain.html" style="display:none" id="iframeId"></iframe>
and in indexjs
var qunitTestrunnerToolbar_element = document.getElementById('iframeId').contentWindow.document.getElementById('qunit-testrunner-toolbar'); if you like html way.

DOMDocument stripping tags from inline scripts PHP

This is a strange one but looks like $dom->saveHTML() is stripping tags from inline javascript
$domStr = '
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>my page</title>
<script>
var elem = "<div>some content</div>";
</script>
</head>
<body>
<div>
MY PAGE
</div>
</body>
</html>
';
$doc = new DOMDocument();
libxml_use_internal_errors(true);//prevents tags in js from throwing errors; see php.net manual
$doc->formatOutput = true;
$doc->strictErrorChecking = false;
$doc->preserveWhiteSpace = true;
$doc->loadHTML($domStr);
echo $doc->saveHTML();
exit;
http://sandbox.onlinephpfunctions.com/code/ad59a2a1016b2128e437ef61dbe00f1c511bff8d
if you use libxml_use_internal_errors(true); you will not see what is wrong but if removed you get
<b>Warning</b>: DOMDocument::loadHTML(): Unexpected end tag : div
Same thing happens with
$doc->formatOutput = false;
Any help is appreciated.
I've avoided this by not including any HTML in my inline JavaScript. Instead, I've added <template> elements containing the HTML string I want to manipulate in JS, and then I read that dynamically at runtime. For example:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>my page</title>
</head>
<body>
<div>
MY PAGE
</div>
<template id="content-template">
<div>some content</div>
</template>
<script>
var elem = document.getElementById('content-template').innerHTML;
...
</script>
</body>
</html>
It is probably a bug of DomDocument.
You have to escape the closing tag of HTML in JS or it gets misinterpreted.
This should work
var elem = "<div>some content<\/div>";
Alternatively, if you pass option 1 to the loadHtml the parser will ignore it.
In a bit of an oddity 1 can mean both LIBXML_SCHEMA_CREATE and LIBXML_ERR_WARNING as these two predefined constants have the same value. Presumably it is meant to be LIBXML_SCHEMA_CREATE which does the following "Create default/fixed value nodes during XSD schema validation".
You're missing the opening <html> tag right after the DOCTYPE declaration.

Nested iframes, AKA Iframe Inception

Using jQuery I am trying to access div id="element".
<body>
<iframe id="uploads">
<iframe>
<div id="element">...</div>
</iframe>
</iframe>
</body>
All iframes are on the same domain with no www / non-www issues.
I have successfully selected elements within the first iframe but not the second nested iframe.
I have tried a few things, this is the most recent (and a pretty desperate attempt).
var iframe = jQuery('#upload').contents();
var iframeInner = jQuery(iframe).find('iframe').contents();
var iframeContent = jQuery(iframeInner).contents().find('#element');
// iframeContent is null
Edit:
To rule out a timing issue I used a click event and waited a while.
jQuery().click(function(){
var iframe = jQuery('#upload').contents().find('iframe');
console.log(iframe.find('#element')); // [] null
});
Any ideas?
Thanks.
Update:
I can select the second iframe like so...
var iframe = jQuery('#upload').contents().find('iframe');
The problem now seems to be that the src is empty as the iframe is generated with javascript.
So the iframe is selected but the content length is 0.
Thing is, the code you provided won't work because the <iframe> element has to have a "src" property, like:
<iframe id="uploads" src="http://domain/page.html"></iframe>
It's ok to use .contents() to get the content:
$('#uploads).contents() will give you access to the second iframe, but if that iframe is "INSIDE" the http://domain/page.html document the #uploads iframe loaded.
To test I'm right about this, I created 3 html files named main.html, iframe.html and noframe.html and then selected the div#element just fine with:
$('#uploads').contents().find('iframe').contents().find('#element');
There WILL be a delay in which the element will not be available since you need to wait for the iframe to load the resource. Also, all iframes have to be on the same domain.
Hope this helps ...
Here goes the html for the 3 files I used (replace the "src" attributes with your domain and url):
main.html
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>main.html example</title>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(function () {
console.log( $('#uploads').contents().find('iframe').contents().find('#element') ); // nothing at first
setTimeout( function () {
console.log( $('#uploads').contents().find('iframe').contents().find('#element') ); // wait and you'll have it
}, 2000 );
});
</script>
</head>
<body>
<iframe id="uploads" src="http://192.168.1.70/test/iframe.html"></iframe>
</body>
iframe.html
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>iframe.html example</title>
</head>
<body>
<iframe src="http://192.168.1.70/test/noframe.html"></iframe>
</body>
noframe.html
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>noframe.html example</title>
</head>
<body>
<div id="element">some content</div>
</body>
var iframeInner = jQuery(iframe).find('iframe').contents();
var iframeContent = jQuery(iframeInner).contents().find('#element');
iframeInner contains elements from
<div id="element">other markup goes here</div>
and iframeContent will find for elements which are inside of
<div id="element">other markup goes here</div>
(find doesn't search on current element) that's why it is returning null.
Hey I got something that seems to be doing what you want a do. It involves some dirty copying but works. You can find the working code here
So here is the main html file :
<!DOCTYPE html>
<html>
<head>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
Iframe = $('#frame1');
Iframe.on('load', function(){
IframeInner = Iframe.contents().find('iframe');
IframeInnerClone = IframeInner.clone();
IframeInnerClone.insertAfter($('#insertIframeAfter')).css({display:'none'});
IframeInnerClone.on('load', function(){
IframeContents = IframeInner.contents();
YourNestedEl = IframeContents.find('div');
$('<div>Yeepi! I can even insert stuff!</div>').insertAfter(YourNestedEl)
});
});
});
</script>
</head>
<body>
<div id="insertIframeAfter">Hello!!!!</div>
<iframe id="frame1" src="Test_Iframe.html">
</iframe>
</body>
</html>
As you can see, once the first Iframe is loaded, I get the second one and clone it. I then reinsert it in the dom, so I can get access to the onload event. Once this one is loaded, I retrieve the content from non-cloned one (must have loaded as well, since they use the same src). You can then do wathever you want with the content.
Here is the Test_Iframe.html file :
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div>Test_Iframe</div>
<iframe src="Test_Iframe2.html">
</iframe>
</body>
</html>
and the Test_Iframe2.html file :
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div>I am the second nested iframe</div>
</body>
</html>
You probably have a timing issue. Your document.ready commend is probably firing before the the second iFrame is loaded. You dont have enough info to help much further- but let us know if that seems like the possible issue.
You should use live method for elements which are rendered later, like colorbox, hidden fields or iframe
$(".inverter-value").live("change",function() {
elem = this
$.ajax({
url: '/main/invertor_attribute/',
type: 'POST',
aysnc: false,
data: {id: $(this).val() },
success: function(data){
// code
},
dataType: 'html'
});
});
I think the best way to reach your div:
var your_element=$('iframe#uploads').children('iframe').children('div#element');
It should work well.
If browser supports iframe, then DOM inside iframe come from src attribute of respective tag. Contents that are inside iframe tag are used as a fall back mechanism where browser does not supports iframe tag.
Ref: http://www.w3schools.com/tags/tag_iframe.asp
I guess your problem is that jQuery is not loaded in your iframes.
The safest approach is to rely on pure DOM-based methods to parse your content.
Or else, start with jQuery, and then once inside your iframes, test once if typeof window.jQuery == 'undefined', if it's true, jQuery is not enabled inside it and fallback on DOM-based method.

How to create a HTML document DOM object from code?

Currently I'm doing this:
var newdoc = document.implementation.createHTMLDocument("Wrong title");
newdoc.open();
newdoc.write('<!doctype html><html><head><title>Right title</title></head><body><div id="a_div">Right content</div></body></html>');
newdoc.close();
And then I try to get some info about the document loaded, for example:
> newdoc.title
Right title
> newdoc.getElementById("a_div").innerHTML
Right content
The issue is that it only works in Chrome. On Firefox and Opera the DOM does not seem to be loaded after document close. What am I doing wrong?
I wrote this little fiddle to demonstrate the problem: http://jsfiddle.net/uHz2m/
Okay, after reading the docs I noticed createHTMLDocument() does not create a zero byte-length document object but a basic HTML scaffolding like this:
<!DOCTYPE html>
<html>
<head>
<title>Wrong title</title>
</head>
<body></body>
</html>
That's why newdoc.write() does not work as expected.
Instead, I can just take the html element and change its HTML code (corrected fiddle).
var newdoc = document.implementation.createHTMLDocument("Wrong title");
newdoc.documentElement.innerHTML = '\
<!doctype html>\
<html>\
<head>\
<title>Right title</title>\
</head>\
<body>\
<div id="a_div">Right content</div>\
</body>\
</html>';

Categories

Resources