Javascript: match using dom loaded page - javascript

I am trying to grab all links from google search result using Chrome console.
First I wanted to get the dom loaded source. I tried below code.
var source = document.documentElement.innerHTML;
Now when I type source in console source it shows the correct dom loaded source. But if I run alert(source); It's showing default html source of page.
So problem is when I run below code
source.match(/class="r"><a href="(.*?)"/);
It is returning null, because variable source has the source code before dom loaded.

You can use DOM API (i.e. getElementsByTagName) to find all a tags in a page. Take a look:
var anchors = document.getElementsByTagName('A');
var matchingHrefs = Array.prototype.slice.call(anchors).filter(function(a) {
return a.className == 'r';
}).map(function(a) {
return a.href;
});
A
B
C
The Array.prototype.slice.call call turns node list into regular array.

Probably, you need to add /g flag to your regex to match globally.
Like this:
yourHtml.match(/href="([^"]*")/g)

Related

what's the difference between appending a child element and setting innerHTML

In the first example, the script was executed, but not in the second example, the Dom results are the same.
// executable
var c = 'alert("append a div in which there is a script element")';
var div = document.createElement('div');
var script_2 = document.createElement('script');
script_2.textContent = c;
div.appendChild(script_2);
document.body.appendChild(div);
// unexecutable although the dom result is same as above case
var d = '<script>alert("append a div that has script tag as innerHTML")';
var div_d = document.createElement('div');
div_d.innerHTML = d;
document.body.appendChild(div_d);
.innerHTML allows you to add as much HTML as you want in one easy call.
.appendChild allows you to add a single element (Or multiple elements if you append a DocumentFragment).
If you use .innerHTML then you need to include the opening and closing tags correctly. Your HTML must be proper.
When elements that were created using document.createElement then auto generate the appropriate opening and closing tags.
Your example for .innerHTML is not properly formed. Instead of:
var d = '<script>alert("append a div that has script tag as innerHTML")';
it should be:
var d = '<script>alert("append a div that has script tag as innerHTML")</script>';
UPDATE:
Interesting!!
I know that, in the past, your second example would have worked. But it seems that, probably for security reasons, the browser no longer allows you to insert <script> through .innerHTML.
I tried on Chrome 62 and it fails. Firefox 57 fails and Safari 11.0.2 fails.
My best guess is that this is a security update.
Look here:
https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
And go down to the Security considerations section.
It reads:
It is not uncommon to see innerHTML used to insert text in a web page. This comes with a security risk.
const name = "John";
// assuming 'el' is an HTML DOM element
el.innerHTML = name; // harmless in this case
// ...
name = "<script>alert('I am John in an annoying alert!')</script>";
el.innerHTML = name; // harmless in this case
Although this may look like a cross-site scripting attack, the result is harmless. HTML5 specifies that a tag inserted via innerHTML should not execute.

JQuery search dom elements just after rendering and replace keys by its corresponding values

I am trying to apply my own localization method on my under development web application.
knowing that I am using JQuery 2.2.0, without any other framework or third party. I need to write some expressions in my pure html code such that:
ex-1:
<span>#{{lang.details}}</span>
ex-2:
<button value='#{{lang.save}}'>
And then, once the dom is completely rendered, I want to pass through the dom element and find out all these expressions that start with "#{{" and end with "}}", to replace them with the corresponding value in my JSON bundle.
My JSON bundle looks like:
var en = {
details: "details",
save: "save"
}
My app.js code:
var lang = en;
// jquery load en json file
// search the dom elements and replace the key surrounded
// by '#{{ }}' with its corresponding value from the bundle
What is the optimal way to find out these element either in JavaScript or in JQuery. I can't is there any other solution through a third party ?
Replacing HTML with JavaScript Variables after Page Load
var doneReplacing = false;
$(function() {
if (doneReplacing === false) {
$(document.documentElement).html(function(index, content) {
return content.replace(/#{{([^}]*)}}/g, function(match, $1) {
return eval($1);
});
});
doneReplacing = true;
}
});
DEMO
Regex101 Tested
Basically what it does is, after the page loads, it searches for the pattern #{{ variable }} and then replaces the match with the value of the variable.

Using document.implementation.createDocument to create a new HTML document

I am being passed html as a string. My goal is to create a new document from the html that has all the appropriate nodes so that I can do things like call doc.getElementsByTagName on the doc I create and have it work as expected. An example of my code is here.
var doc = window.document.implementation.createDocument
('http://www.w3.org/1999/xhtml', 'html', null);
doc.getElementsByTagName('html')[0].innerHTML =
'<head><script>somejs</script>' +
'<script>var x = 5; var y = 2; var foo = x + y;</script>' +
'</head><body></body>';
var scripts = doc.getElementsByTagName('script');
console.log(scripts[0] + " code = " + scripts[0].innerHTML);
I am having the following issues:
If something inside a script tag contains a character like < (eg in the example above in the "var foo = x + y;" statement change the + to a < symbol), I get an INVALID_STATE_ERR: DOM Exception 11.
Even if nothing inside the script tag uses such characters, when I run the above I get the output "[object Element] code =undefined"
So my questions are:
A. How do I handle characters such as < that give DOM Exception 11 when I try to use them in whatever I am setting the innerHTML to
B. How do I make the document properly parse the script tags and put their code into their innerHTML attribute so that I can later read it.
EDIT: As Ryan P pointed out this code actually works in FF. So if anyone could help me get it working in chrome that would be much appreciated!
Taken from https://github.com/rails/turbolinks,
why dont you try to create the document this way:
doc = document.implementation.createHTMLDocument("");
doc.open("replace");
doc.write(html);
doc.close();
where the html should be your html contents.
I havent tested it and dont know if you should escape characters first.
A. You need to convert any < to an HTML entity (<). The rules don't cease to apply just cause you're in a script tag.
B. You call your variable 'doc' but try to get the script tags from an undefined variable 'tempDoc'. When I run your code in my browser after changing that variable, it all seems to work fine.

Given a string of html code, how can I go through every tag and remove ones that is not in my whitelist (in JQuery)?

var whitelist = ['a','div','img', 'span'];
Given a block of HTML code, I want to go through every single tag using JQuery
Then, if that tag is NOT in my whitelist, remove it and all its children.
The final string should now be sanitized.
How do I do that?
By the way, this is my current code to remove specific tags (but I decided I want to do whitelist instead)
var canvas = '<div>'+canvas_html+'</div>';
var blacklist = ['script','object','param','embed','applet','app','iframe',
'form','input', 'link','meta','title','input','button','textarea'
'head','body','kbd'];
blacklist.forEach(function(r){
$(canvas).find(r).remove();
});
canvas_html = $(canvas).get('div').html();
Try this:
var whitelist = ['a','div','img', 'span'];
var output = $('<div>'+canvas_html+'</div>').find('*').each(function() {
if($.inArray(this.nodeName.toLowerCase(), whitelist)==-1) {
$(this).remove();
}
}).html();
// output contains the HTML with everything except those in the whitelist stripped off
try:
$(canvas).find(':not(' + whitelist.join(', ') + ')').remove().html();
The idea is to turn array of whitelist into "el1, el2, el3" format, then use :not selector to get the elements that's not in the whitelist, then delete.
This obviously could be expensive depending on the size of your html and whitelist.
Unfortunately, using jQuery to sanitize HTML in order to prevent XSS is not safe, as jQuery is not just parsing the HTML, but actually creating elements out of it. Even though it doesn't insert these into the DOM, in some cases embedded Javascript will be executed. So, for example, the snippet:
$('<img src="http://i.imgur.com/cncfg.gif" onload="alert(\'gotcha\');"/>')
will trigger an alert.

construct a DOM tree from a string without loading resources (specifically images)

So I am grabbing RSS feeds via AJAX. After processing them, I have a html string that I want to manipulate using various jQuery functionality. In order to do this, I need a tree of DOM nodes.
I can parse a HTML string into the jQuery() function.
I can add it as innerHTML to some hidden node and use that.
I have even tried using mozilla's nonstandard range.createContextualFragment().
The problem with all of these solutions is that when my HTML snippet has an <img> tag, firefox dutifully fetches whatever image is referenced. Since this processing is background stuff that isn't being displayed to the user, I'd like to just get a DOM tree without the browser loading all the images contained in it.
Is this possible with javascript? I don't mind if it's mozilla-only, as I'm already using javascript 1.7 features (which seem to be mozilla-only for now)
The answer is this:
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(htmlString, "text/html");
var jdoc = $(htmlDoc);
console.log(jdoc.find('img'));
If you pay attention to your web requests you'll notice that none are made even though the html string is parsed and wrapped by jquery.
The obvious answer is to parse the string and remove the src attributes from img tags (and similar for other external resources you don't want to load). But you'll have already thought of that and I'm sure you're looking for something less troublesome. I'm also assuming you've already tried removing the src attribute after having jquery parse the string but before appending it to the document, and found that the images are still being requested.
I'm not coming up with anything else, but you may not need to do full parsing; this replacement should do it in Firefox with some caveats:
thestring = thestring.replace("<img ", "<img src='' ");
The caveats:
This appears to work in the current Firefox. That doesn't meant that subsequent versions won't choose to handle duplicated src attributes differently.
This assumes the literal string "general purpose assumption, that string could appear in an attribute value on a sufficiently...interesting...page, especially in an inline onclick handler like this: <a href='#' onclick='$("frog").html("<img src=\"spinner.gif\">")'> (Although in that example, the false positive replacement is harmless.)
This is obviously a hack, but in a limited environment with reasonably well-known data...
You can use the DOM parser to manipulate the nodes.
Just replace the src attributes, store their original values and add them back later on.
Sample:
(function () {
var s = "<img src='http://www.google.com/logos/olympics10-skijump-hp.png' /><img src='http://www.google.com/logos/olympics10-skijump-hp.png' />";
var parser = new DOMParser();
var dom = parser.parseFromString("<div id='mydiv' >" + s + "</div>", "text/xml");
var imgs = dom.getElementsByTagName("img");
var stored = [];
for (var i = 0; i < imgs.length; i++) {
var img = imgs[i];
stored.push(img.getAttribute("src"));
img.setAttribute("myindex", i);
img.setAttribute("src", null);
}
$(document.body).append(new XMLSerializer().serializeToString(dom));
alert("Images appended");
window.setTimeout(function () {
alert("loading images");
$("#mydiv img").each(function () {
this.src = stored[$(this).attr("myindex")];
})
alert("images loaded");
}, 2000);
})();

Categories

Resources