parsing HTML with Firefox

parsing HTML with Firefox - javascript

uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200) {
searchcontents = req.responseText;
myHTML = searchcontents;
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
parsedHTML = tempDiv;
sitefound = sc_sitefound(uri, parsedHTML);
}
}
};
req.send(null);
function sc_sitefound(uri, parsedHTML) {
alert(parsedHTML);
gclasses = parsedHTML.getElementsByClassName('g');
for (var gclass in gclasses) {
atags = gclass.getElementsByTagName('a');
alert(atags);
tag1 = atags[0];
htmlattribute1 = tag1.getAttribute('html');
if (htmlattribute1 == uri) {
sitefound = htmlattribute1;
return sitefound;
}
}
return null;
}
parsedHTML is a XULElement
gclasses is an HTMLCollection
if there are many divs of class G in the Google Directory search results, why are the g classes empty?

var tempDiv = document.createElement('div');
If you're in an XUL environment, that's not creating an HTML element node: it'll be an XUL element. Since the innerHTML property is exclusive to HTMLElement and not other XML Elements, setting innerHTML on tempDiv will do nothing (other than adding a custom property containing the HTML string). Consequently there are no elements with class ‘g’ inside tempDiv... there are no elements at all inside it.
If you have a plain HTML document loaded in the browser, you could try using content.document.createElement to get an HTML wrapper element on which innerHTML will be available. This still isn't a brilliant way to parse a whole page of HTML because the document in question might have <head> content you can't put in a div, and HTTP headers that you'll be throwing away. Probably better to load the target file into an HTMLDocument object of its own. A good way to do that would be using an iframe. See this page for examples of both these approaches.
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
It's seven shades of not-a-good-idea to process HTML with regex; this could go wrong in many ways when Google slightly change their page markup. Let the browser do the job of parsing instead. Setting innerHTML does not cause script elements to be executed straight away (futher DOM manipulations can though); you can pick out the unwanted script elements later, if you need to. With the XUL iframe approach you can simply disable JavaScript on the iframe.
for (var gclass in gclasses) {
The for...in loop is for use against Objects used as mappings. It should not be used for iterating a sequence (such as Array, NodeList or in this case HTMLCollection) as it doesn't do what you might expect. For iterating sequences, stick to the standard C-style for (var i= 0; i<sequence.length; i++) loop.
You could also do with adding var declarations for all your other local variables.

Related

importNode and Microsoft Edge

I have a dynamic page where, with a bar button, I can change the main div content.
Most of the pages are static except one, which contains JavaScript (RGraph charts).
That's why in order to make it working I use the following code:
var data = new FormData();
data.append( 'action', 'charts' );
// clean the content
var myNode = document.getElementById("contentView");
while (myNode.firstChild)
{
myNode.removeChild(myNode.firstChild);
}
// set the new content
var div = document.createElement("div");
var t = document.createElement('template');
t.innerHTML = _connectToServer( data );
for (var i=0; i < t.content.childNodes.length; i++)
{
var node = document.importNode(t.content.childNodes[i], true);
div.appendChild(node);
}
document.getElementById("contentView").appendChild(div);
The problem is that as far as I see (and I read) such a code is not compatible with Microsoft Edge, and I would like to make it going with Edge as well.
What's the best way to succeed?

Ok, so I took this over to our DOM team to better understand this interop issue.
This is actually a Chrome bug per spec. What happens is when you create a template element and place innerHTML inside of it, it is treated as a DocumentFragment. And a template element's contents can't have executable script.
Here are the relavent spec links that cover this:
InnerHTML
Template Element
To work around this, as I pointed to earlier you'll need to create a <script> tag not within a <template> element and utilize textContent. Here's an example of this: http://jsbin.com/gizayuyape/edit?html,js,output
Or here is the code:
// Using textContent
var body = document.getElementsByTagName('body')[0];
var s = document.createElement('script');
s.textContent = 'document.write("Script textContent");';
body.appendChild(s);
Good news, at least on the interop front, is that Chrome has fixed this issue starting in version 67:

Set script tag from API call in header of index.html

I'm trying to implement dynatrace in my react app, and this works fine if I just add the required script tag manually in my index.html header.
However I want to make use of dynatraces api which returns the whole script tag element (so I can use for different environments).
How can I add the script tag to my index.html after calling the api? Creating a script element from code won't work because the response of the api call is a script tag itself (which is returned as string).
I tried creating a div element and adding the script as innerHtml, then append it to the document. But scripts don't get executed in innerHtml text.
const wrapperDiv = document.createElement("div");
wrapperDiv.innerHTML = "<script>alert('simple test')</script>";
document.head.appendChild(wrapperDiv.firstElementChild);
Can this be done?
I found a roundabout way of doing this:
const wrapperDiv = document.createElement("div");
const scriptElement = document.createElement("script");
wrapperDiv.innerHTML = "<script src=... type=...></script>";
for(let i = 0; i < wrapperDiv.firstElementChild.attributes.length; i++){
const attr = wrapperDiv.firstElementChild.attributes[i];
scriptElement.setAttribute(attr.name, attr.value);
}
document.head.appendChild(scriptElement);
in this example of the script i'm only using a src but this can be done with the value as well. If there is any better way for doing this pls let me know

This can be achieved without use of eval() :
const source = "alert('simple test')";
const wrapperScript = document.createElement("script");
wrapperScript.src = URL.createObjectURL(new Blob([source], { type: 'text/javascript' }));
document.head.appendChild(wrapperScript);
In the code above you basically create Blob, containing your script, in order to create Object URL (representation of File or Blob object in browser memory).
This solution is based on idea that dynamically added <script> is evaluated by browser in case it has src property.
Update:
Since endpoint returns you <script> tag with some useful attributes, the best solution would be to clone attributes (including src) - your current approach is pretty good.

I found a roundabout way of doing this:
const wrapperDiv = document.createElement("div");
const scriptElement = document.createElement("script");
wrapperDiv.innerHTML = "<script src=... type=...></script>";
for(let i = 0; i < wrapperDiv.firstElementChild.attributes.length; i++){
const attr = wrapperDiv.firstElementChild.attributes[i];
scriptElement.setAttribute(attr.name, attr.value);
}
document.head.appendChild(scriptElement);
in this example of the script i'm only using a src but this can be done with the value as well

How can I copy text from a website and use it in my own HTML file

I am making a website to look for prices of flights. Every time that I load my HTML file I have to copy the prices from another website that is not mine and insert them in my HTML file.
The source code of the other website indicates that the tag that I am looking for is a span tag, like <span class="amount price-amount">250</span>
So the question is: How can I copy or extract that info and use it or insert it in my HTML file?
I would like to solve it using HTML, CSS, JavaScript and/or Bootstrap.

Client-Side Webscraping
You do this using page stripping. At least that's what I call it. A basic example is:
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = function () {
if (xhr.readyState === 4) {
var doc = document.createElement('div');
doc.innerHTML = xhr.responseText;
var elems = doc.getElementsByTagName('*'),
prices = [];
for (var i = 0; i < elems.length; i += 1) {
if ((elems[i].getAttribute('class')||'').indexOf('price-amount') > -1 && (elems[i].getAttribute('class')||'').indexOf('amount') > -1) {
prices.push(elems[i].innerHTML);
}
}
}
};
xhr.open('GET', 'airlinesite.com/path/to/page', true);
xhr.send();
This will get the HTML from airlinesite.com/path/to/page. Then it will get all the elements. Loop through them. If it has a class amount and price-amount, it will store it's value in an Array. The values will be stored in prices.
For this, the target domain must have CORS, which it probably does

Use a web-scraper; I recommend request and cheerio. This assumes you have Node JS and know how to install packages.
Here's a simple sample code:
var request = require('request');
var cheerio = require('cheerio');
request(this.url, function(error, response, body) {
if (!error && response.statusCode == 200) {
// body is the scraped html
$ = cheerio.load(arg); // the jQuery-like selector
var price = $('span.price-amount').text(); // the price you want. Use the selector accordingly.
}
}

use inspect element, do this by right clicking and click inspect element. then there will be a box with and arrow pointing into in the top left corner, this is the search click it. Then select what part you want on the page, then it will load it, that way you can just copy and paste it.

DOMParser appending <script> tags to <head>/<body> but not executing

I'm attempting to parse a string into a full HTML document via DOMParser and then overwrite the current page with the processed nodes. The string contains complete markup, including the <!doctype>, <html>, <head> and <body> nodes.
// parse the string into a DOMDocument element:
var parser = new DOMParser();
var doc = parser.parseFromString(data, 'text/html');
// set the parsed head/body innerHTML contents into the current page's innerHTML
document.getElementsByTagName('head')[0].innerHTML = doc.getElementsByTagName('head')[0].innerHTML;
document.getElementsByTagName('body')[0].innerHTML = doc.getElementsByTagName('body')[0].innerHTML;
This works in that it successfully takes the HTML nodes that were parsed and renders them on the page; however, any <script> tags that exist in either the <head> or <body> nodes inside the parsed string fail to execute =[. Testing directly with the html tag (opposed to head / body) yields the same result.
I've also tried to use .appendChild() instead of .innerHTML() too, but no change:
var elementHtml = document.getElementsByTagName('html')[0];
// remove the existing head/body nodes from the page
while (elementHtml.firstChild) elementHtml.removeChild(elementHtml.firstChild);
// append the parsed head/body tags to the existing html tag
elementHtml.appendChild(doc.getElementsByTagName('head')[0]);
elementHtml.appendChild(doc.getElementsByTagName('body')[0]);
Does anyone know of a way to convert a string to a full HTML page and have the javascript contained within it execute?
If there is an alternative to DOMParser that gives the same results (e.g. overwriting the full document), please feel free to recommend it/them =]
Note:
The reason I'm using this opposed to the much more simple alternative of document.write(data) is because I need to use this in a postMessage() callback in IE under SSL; document.write() is blocked in callback events such as post messages when accessing an SSL page in IE =[

You should use:
const sHtml = '<script>window.alert("Hello!")</script>';
const frag = document.createRange().createContextualFragment(sHtml)
document.body.appendChild( frag );

Using DOMParser() as described in the question will correctly set the <head> and <body> contents of the page, but more work is necessary to get any existing <script> tags to execute.
The basic approach here is to pull a list of all <script> tags in the page after the contents have been set, iterate over that list and dynamically create a new <script> tag with the contents of the existing one and then add the new one to the page.
Example:
// create a DOMParser to parse the HTML content
var parser = new DOMParser();
var parsedDocument = parser.parseFromString(data, 'text/html');
// set the current page's <html> contents to the newly parsed <html> content
document.getElementsByTagName('html')[0].innerHTML = parsedDocument.getElementsByTagName('html')[0].innerHTML;
// get a list of all <script> tags in the new page
var tmpScripts = document.getElementsByTagName('script');
if (tmpScripts.length > 0) {
// push all of the document's script tags into an array
// (to prevent dom manipulation while iterating over dom nodes)
var scripts = [];
for (var i = 0; i < tmpScripts.length; i++) {
scripts.push(tmpScripts[i]);
}
// iterate over all script tags and create a duplicate tags for each
for (var i = 0; i < scripts.length; i++) {
var s = document.createElement('script');
s.innerHTML = scripts[i].innerHTML;
// add the new node to the page
scripts[i].parentNode.appendChild(s);
// remove the original (non-executing) node from the page
scripts[i].parentNode.removeChild(scripts[i]);
}
}

Here is a working demo for jQuery 1.8.3 (link to jsFiddle):
var html = "<html><head><script>alert(42);</" + "script></head><body><h1>Hello World</h1></body></html>";
$(function () {
html = $($.parseXML(html));
$("head").append(html.find("script"));
$("body").append(html.find("h1"));
});
Thereby I used the function $.parseXML() which you only can use obviously, if your HTML is also valid XML. Unfortunately is the same code not working with jQuery 1.9.1 (the <script> tag is not found anymore): http://jsfiddle.net/6cECR/8/ Maybe its a bug (or a security fetaure...)

Sanitizing html string with javascript using browser to interpret html

I want to use a white list of tags, attributes and values to sanitize a html string, before I place it in the dom. Can safely I construct a dom element, and traverse over that to implement the white list filter, assuming that no malicious javascript could execute until I append the dom element to the document? Are there pitfalls to this approach?

It doesn't appear that anything will execute until you insert into the document, as per #rvighne's answer, but there are at least these (unusual) exceptions (tested in FF 27.0):
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("click", function(e) {
if (e.target.nodeName.toLowerCase() === 'a') {
alert("I will also cause side effects; I shouldn't run on the wrong link!");
}
});
el.getElementsByTagName('a')[0].click(); // Alerts "boo!" and "I will also cause side effects; I shouldn't run on the wrong link!"
...or...
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("cat", function(e) { this.getElementsByTagName('a')[0].click(); });
var event = new CustomEvent("cat", {"detail":{}});
el.dispatchEvent(event); // Alerts "boo!"
...or... (though setUserData is deprecated, it is still working):
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var span = document.createElement('span');
span.innerHTML = userInput;
span.setUserData('key', 10, {handle: function (n1, n2, n3, src) {
src.getElementsByTagName('a')[0].click();
}});
var div = document.createElement('div');
div.appendChild(span);
span.cloneNode(); // Alerts "Boo!"
var imprt = document.importNode(span, true); // Alerts "Boo!"
var adopt = document.adoptNode(span, true); // Alerts "Boo!"
...or during iteration...
var userInput = 'Link';
var span = document.createElement('span');
span.innerHTML = userInput;
var treeWalker = document.createTreeWalker(
span,
NodeFilter.SHOW_ELEMENT,
{ acceptNode: function(node) { node.click(); } },
false
);
var nodeList = [];
while(treeWalker.nextNode()) nodeList.push(treeWalker.currentNode); // Alerts 'Boo!'
But without these kind of (unusual) event interactions, the fact of building into the DOM alone would not, as far as I have been able to detect, cause any side effects (and of course the examples above are contrived and one wouldn't expect to encounter them very often if at all!).

No script embedded in the HTML can execute until it is put in the document. Try running this code on any page:
var html = "<script>document.body.innerHTML = '';</script>";
var div = document.createElement('div');
div.innerHTML = html;
You will notice nothing change. If the "malicious" script in the HTML was run, then the document should have vanished. So, you can use the DOM to sanitize HTML without worrying about bad JS being in the HTML. As long as you snip out the script in your sanitizer of course.
By the way, your approach is pretty safe and smarter than what most people try (parse it with regex, the poor fools). However, it's best to rely on good, trusted HTML sanitizing libraries for this, like HTML Purifier. Or, if you want to do it client-side, you can use ESAPI-JS (recommended by #Brett Zamir)

You can use a "sandboxed" iframe that won't execute anything.
var iframe = document.createElement('iframe');
iframe['sandbox'] = 'allow-same-origin';
From w3schools:
The sandbox attribute enables an extra set of restrictions for the
content in the iframe. When the sandbox attribute is present, and it will:
block form submission
block script execution
disable APIs
...
P.S. That's, by the way, exactly how we do it in our Html Sanitizer https://github.com/jitbit/HtmlSanitizer - we use the browser to interpret HTML and convert it to DOM. Feel free to check the code (or actually use the component)
(disclaimer: I'm the contributor to that OSS project)

Develop Reference

JavaScript is the programming language of the Web.

parsing HTML with Firefox - javascript

Related

importNode and Microsoft Edge

Set script tag from API call in header of index.html

How can I copy text from a website and use it in my own HTML file

DOMParser appending <script> tags to <head>/<body> but not executing

Sanitizing html string with javascript using browser to interpret html

Categories

Resources