Strict HTML parser in JavaScript

Strict HTML parser in JavaScript - javascript

In HTML, block elements can't be children of inline elements. Browsers however are happy to accept this HTML:
<i>foo <h4>bar</h4> fizz</i>
and render it intuitively as expected; neither do they choke on it using DOMparser.
But it's not valid and is therefore hard to convert to another schema. Pandoc parses the above as (option1):
<i>foo </i><h4>bar</h4> fizz
which is at least valid but not faithful. Another approach would be (option2):
<i>foo </i><h4><i>bar</i></h4><i> fizz</i>
Is there a way to force DOMparser to do a more strict parsing that would result in option 1 or 2? (It doesn't seem possible).
Alternatively, what would be the best approach to deal with this, that is, given the first string, get option 1 or 2 as a result? Is there a JS parser that does this (and other strict enforcing of the standard)?
Edit: it turns out the HTML parser of at least Chrome (78.0.3904.108) behaves differently when the content is in a p instead of, say, a div. When the HTML above is in a p then it gets parsed as option 2! But it's left as is when inside a div.
So I guess the question is now: how to enforce the behavior of ps onto divs?

Related

Most efficient way co compare two DOMStrings if they represent the same DOM node in JavaScript

I use ReactJs function renderToStaticMarkup to create HTML Markup. The markup is on another place in the App set as the innerHTML property to other DOM node. Since I would like to prevent images and iframes from re-rednering, I would like to make comparison, if the current innerHTML is the different from the one it should be set to.
if (!domNode.innerHTML !== newMarkup);
domNode.innerHTML = newMarkup;
}
For some reason, Reacts renderToStaticMarkup creates HTML Markup for images as:
<img src="https://placebeard.it/200x150" alt="" class="content-media-image-linked"/>
but the DOM innerHTML has a value of
<img src="https://placebeard.it/200x150" alt="" class="content-media-image-linked">
So basically the difference IS in the trailing / (but this does not need to be the rule of thumb)
I wonder what would be the most efficient/fast way to determine, whether those two DOMStrings represent the same DOM Node.
1. String Comparison
It would be probably enough to replace/remove all occurrences of />
2. Parsing/converting to DOMNodes
This is more safe method, but also much more expensive. I would have to use something like document.createRange().createContextualFragment (see this post) and than use the isEqualNode method.
Has aonyone some better sugeestion ?

As I think you know, the / in /> at the end of a tag has no meaning whatsoever in HTML (it does in XHTML and JSX), so my temptation would be
Change the /> to > and compare; if they match, they're the same.
If they don't match, parse it and use isEqualNode
The first gives you the quick win in what I assume will be the majority case. The second works the slower but more robust way, allowing attributes to be in a different order without introducing a difference (unless it makes a difference), etc.
When replacing the /> with >, be sure of course to only do so in the right place, which may or may not be tricky depending on the strings you're dealing with (e.g., to they contain nested HTML, etc.). ("Tricky" as in "you can't just use a simple regular expression" if the HTML isn't of a single element like your img example.)

A quick fix to my issue was performing sanitization of the HTML Markup produced by the _ renderToStaticMarkup_ call. In my case the markup is generated only occasionally, but the Dom-Node equality check very often, so I went with just plain string quality-check.
I tried multiple libraries to achieve that:
sanitize-html lokked promissing, but was not removing the trailing /
html-minifier worked, but I had issues using it with es6 imports
I ended up using dompurify

Using grease/tampermonkey (or another extension) to omit certain characters inside web elements

Firstly, I apologize if my terminology here isn't the most accurate; I'm very much a novice when it comes to programming. A forum I frequent has added a bunch of unneccessary, "glitchy" images and text to the page as a part of some promotion, but the result is that the forum is now difficult to use and read. I was able to script out most of it using adblock, but there's one last bit that shows up inside the forum elements themselves, and adblock wants to remove the whole element (which breaks the forum). This is part of the code in question, with the URLs changed:
<td class="windowbg" valign="middle" width="42%">&blk34;&blk34;&blk34;&blk34;&blk34;
Thread title <span class="smalltext"></span><img src="example.com/forumicon.gif"></td>
As you can see, the ▓ character shows up a bunch of times for no reason. Is there a way to make my browser ignore this character when it's inside of an element? If there's a way to do this using AdBlock, I am not smart enough to see it.

Here's one way to do it, using a NodeIterator:
var iter = document.createNodeIterator( document.body, NodeFilter.SHOW_TEXT );
var node;
while (node = iter.nextNode()) {
node.textContent = node.textContent.replace( /[\u2580-\u259f]+/g, '' );
}
This is just plain JavaScript code; you can paste it into the Firefox / Chrome JS console to test it. The regexp /[\u2580-\u259f]+/ matches any sequence of characters in the "Block Elements" Unicode block, including U+2593 Dark Shade (▓). You may want to tweak the regexp to match the characters you want to remove. (Tip: If you don't know what the codes for those characters are, copy and paste them into the "UTF8 String" box on this page.)
Ps. If these characters that you want to remove occur only in a certain part of the document, you can make this code a bit more efficient by replacing the root node (document.body above) with the specific DOM node that you want to remove the characters from. To find the nodes you want, you can use e.g. document.getElementById() or, more generally, document.querySelector() (or even document.querySelectorAll() and loop over the results).

Using variables with jQuery's replaceWith() method

Ok this one seems pretty simple (and it probably is). I am trying to use jQuery's replace with method but I don't feel like putting all of the html that will be replacing the html on the page into the method itself (its like 60 lines of HTML). So I want to put the html that will be the replacement in a variable named qOneSmall like so
var qOneSmall = qOneSmall.html('..........all the html');
but when I try this I get this error back
Uncaught SyntaxError: Unexpected token ILLEGAL
I don't see any reserved words in there..? Any help would be appreciated.

I think the solution is to only grab the element on the page you're interested in. You say you have like 60 lines. If you know exactly what you want to replace..place just that text in a div with an id='mySpecialText'. Then use jQuery to find and replace just that.
var replacementText = "....all the HTML";
$("#mySpecialText").text(replacementText);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="mySpecialText">Foo</div>

If you're only looking to replace text then jaj.laney's .text() approach can be used. However, that will not render the string as HTML.
The reason the way you're using .html() is likely illegal is that qSmallOne is not a JQuery object. The method cannot be performed on arbitrary variables. You can set the HTML string to a variable and pass that string to the .html() function like this:
var htmlstring = '<em>emphasis</em> and <strong>strong</strong>';
$('#target').html(htmlstring);
To see the difference between using .html() and .text() you can check out this short fiddle.
Edit after seeing the HTML
So there is a lot going on here. I'm just going to group these things into a list of issues
The HTML Strings
So I actually learned something here. Using the carriage return and tab keys in the HTML string is breaking the string. The illegal-ness is coming from the fact the string is never properly terminated since it thinks it ends at the first line. Strip out the white space in your strings and they're perfectly valid.
Variable Names
Minor thing, you've got a typo in qSmallOne. Be sure to check your spelling especially when working with these giant variables. A little diligence up front will save a bunch of headache later.
Selecting the Right Target
Your targets for the change in content are IDs that are in the strings in your variables and not in the actual DOM. While it looks like you're handling this, I found it rather confusing. I would use one containing element with a static ID and target that instead (that way you don't have to remember why you're handling multiple IDs for one container in the future).
Using replaceWith() and html()
.replaceWith() is used to replace an element with something else. This includes the element that is being targeted, so you need to be very aware of what you're wanting to replace. .html() may be a better way to go since it replaces the content within the target, not including the target itself.
I've made these updates and forked your fiddle here.

Dynamic Unicode Generation into the DOM

I have a function linking to a method for a JavaScript library I'm working on. Basically taking romanized Korean and converting it to specific Unicode sequences and re-inserting it into the DOM. The resulting strings that it generates are correct, but the re-insertion back into the DOM seems off...
For example: If I have the following in my DOM:
<ko>hangug-eo</ko>
The function is meant to convert it accordingly, replacing hangug-eo with 한국어 to show on the browser:
한국어 within the <ko> tags...
The function that does the string setting within the DOM is as follows:
function (){
var z=document.getElementsByTagName('ko');
for(x=z.length;x--;){
z[x].childNodes[0].data=kimchi.go(z[x].childNodes[0].data);
}
}
However, it seems that all this seems to be doing is just placing the &# Unicode entities straight into the DOM without it converting to their respective character equivalents... So all I'm seeing is 한국어
Can anyone please point out what I may be doing wrong?
kimchi.go() is the function that ultimately provides the Unicoded string...

You can always just set the text directly using textContent without having to use HTML entities:
z[x].textContent = '한국어';
But if you need to use HTML entities, just use innerHTML instead
z[x].innerHTML = kimchi.go(z[x].childNodes[0].data);
You can see the latter in the example below.
https://jsfiddle.net/nmL3to8w/1/

How to tell the type of a JavaScript and/or jQueryobject

This question pertains as much to the ECMAScript language implementation we know as JavaScript as it does to jQuery and the developer tools availble in most popular browsers.
When you execute a statement like so:
var theElement = $('#theId').closest();
what is the type of theElement?
I assume that in a jQuery situation like above, many jQuery methods including the one above actually return the jQuery object itself, which packages the stuff you actually want to get to. This, so that it may maintain a fluent API and let you join method calls in a single statement like so:
$('#selector').foo().bar().gar().har();
However, in the case of jQuery then, how do you determine what the real underlying type is? For example, if the element returned was a table row with the Id tableRowNumber25, how do you get to that, say, using FireBug.
When I look at either a jQuery returned object or a simple JavaScript object in the watches window of Firebug or any of the Developer Tools in most popular browsers, I see a long laundry list of properties/keys and I don't know which one to look at. In a jQuery object, most of the properties are lamdas.
So, really, my question is -- how do you know the underlying type, how do you know what's actually being returned?

The type of theElement will be [object jQuery].
If you want the HTML element itself, you have to select it:
console.log(theElement[0]) //Return <div id='theId'>
console.log(theElement.get(0)) //Return <div id='theId'>
If you want the node name, there is a property in the HTML node element call nodeName wich return the capitalised node name:
console.log(theElement[0].nodeName)// Return DIV

typeof(jQueryElementList.get(0)) will return a string of the type.
Some browsers might return this as upper or lower case, I think. IE probably uppercases (see Testing the type of a DOM element in JavaScript). Apparently you can check the nodeType attribute (jQueryElementList.get(0).nodeType) to determine whether it is an html object/tag.

Develop Reference

JavaScript is the programming language of the Web.

Strict HTML parser in JavaScript - javascript

Related

Most efficient way co compare two DOMStrings if they represent the same DOM node in JavaScript

Using grease/tampermonkey (or another extension) to omit certain characters inside web elements

Using variables with jQuery's replaceWith() method

Dynamic Unicode Generation into the DOM

How to tell the type of a JavaScript and/or jQueryobject

Categories

Resources