So we have an HTML that's generated from a YAML. We then process part of the DOM tree in the browser with remarkablejs. Namely, we get all elements with a "markdown" tag, parse its text with remarkablejs, and then replace its innerHTML. Unfortunately, either we don't render HTML entities correctly or we don't render HTML tags correctly.
Is there a recommended way to parse markdown in the browser?
// sample text:
// - markdown list with <b>`bold &`</b>
var elements = document.getElementsByClassName('markdown');
var count = elements.length;
// generates HTML entities correctly but not HTML tags
for (let index = 0; index < count; index++) {
var newText = md.render(elements[index].textContent);
elements[index].innerHTML = newText;
}
// generates HTML tags correctly but not HTML entities
for (let index = 0; index < count; index++) {
var newText = md.render(elements[index].innerHTML);
elements[index].innerHTML = newText;
}
Update: I think I understand what happens, remarkable treats & as & followed by amp;, and therefore, what I get back is & which doesn't render correctly. But then even if I change the input to just &, innerHTML changes that back to &.
Update: i was able to repo: https://jsfiddle.net/v1o7hLnr/. What's breaking remarkable is two things: 1) using the innerHTML to get the text to parse, and 2) wrapping the HTML entity in a code block. See here: https://jsfiddle.net/vLr1qbfc/
Here's what's happening (see https://jsfiddle.net/k8hy2mtf/):
We get the text to parse from the innerHTML.
The innerHTML encodes & to &.
Because the text is wrapped in a code block, ex: `x&y`, remarkable receives `x&y`.
remarkable then promptly ignores the HTML entity
(This is why textContent worked.)
So then my original question stands... what's the best way to properly render an HTML document with remarkable? Should we be parsing at build time instead?
Related
I am currently building a Chrome extension which has to find specific pages in a website specifically the Log In / Sign In page, the Sign Up / Register page, the About page and the Contact Us page.
I am trying to achieve this by first getting the list of elements in the page (which I have already done). Now I need to check the innerHTML of the element such that it is a leaf node in the DOM and contains a part of the keyword, and I am trying to do this with a regex. I managed to build a regex which successfully returns what's in between a start or end tag of an element (i.e. the tag name along with its attributes), but not the innerHTML. Below is what I have done so far (with the example for the About page:
var list = document.body.getElementsByTagName("*");
var aboutElement = /^[^<.+>].*About.*[^(<.+>]$/i;
for (var i = 0; i <= list.length; i++) {
if ((aboutElement.test(list[i].innerHTML)) || (aboutElement.test(list[i].alt))) {
list[i].click();
}
}
Any idea what I should add to it such that it only matches leaf nodes (nodes which do not contain other nodes) and not what's in a start or end tag? I also think that with what I've done it's going to match everything in the innerHTML because of the .* part so I may need to change that as well. Any help would be greatly appreciated!
Thanks to two of the answers in the comments I managed to solve the problem. I used .textContent and changed the regex as shown below and it worked.
var list = document.body.getElementsByTagName("*");
var aboutElement = /^(.*?\s*(\bAbout\b)[^$]*)$/i;
for (var i = 0; i <= list.length; i++) {
if ((aboutElement.test(list[i].textContent)) || (aboutElement.test(list[i].alt))) {
list[i].click();
}
}
I realize that there are several similar questions here but none of the answers solve my case.
I need to be able to take the innerHTML of an element and truncate it to a given character length with the text contents of any inner HTML element taken into account and all HTML tags preserved.
I have found several answers that cover this portion of the question fine as well as several plugins which all do exactly this.
However, in all cases the solution will truncate directly in the middle of any inner elements and then close the tag.
In my case I need the contents of all inner tags to remain intact, essentially allowing any "would be" truncated inner tags to exceed the given character limit.
Any help would be greatly appreciated.
EDIT:
For example:
This is an example of a link inside another element
The above is 51 characters long including spaces. If I wanted to truncate this to 23 characters, we would have to shorten the text inside the </a> tag. Which is exactly what most solutions out there do.
This would give me the following:
This is an example of a
However, for my use case I need to keep any remaining visible tags completely intact and not truncated in any way.
So given the above example, the final output I would like, when attempting to truncate to 23 characters is the following:
This is an example of a link
So essentially we are checking where the truncation takes place. If it is outside of an element we can split the HTML string to exactly that length. If on the other hand it is inside an element, we move to the closing tag of that element, repeating for any parent elements until we get back to the root string and split it there instead.
It sounds like you'd like to be able to truncate the length of your HTML string as a text string, for example consider the following HTML:
'<b>foo</b> bar'
In this case the HTML is 14 characters in length and the text is 7. You would like to be able to truncate it to X text characters (for example 2) so that the new HTML is now:
'<b>fo</b>'
Disclosure: My answer uses a library I developed.
You could use the HTMLString library - Docs : GitHub.
The library makes this task pretty simple. To truncate the HTML as we've outlined above (e.g to 2 text characters) using HTMLString you'd use the following code:
var myString = new HTMLString.String('<b>foo</b> bar');
var truncatedString = myString.slice(0, 2);
console.log(truncatedString.html());
EDIT: After additional information from the OP.
The following truncate function truncates to the last full tag and caters for nested tags.
function truncate(str, len) {
// Convert the string to a HTMLString
var htmlStr = new HTMLString.String(str);
// Check the string needs truncating
if (htmlStr.length() <= len) {
return str;
}
// Find the closing tag for the character we are truncating to
var tags = htmlStr.characters[len - 1].tags();
var closingTag = tags[tags.length - 1];
// Find the last character to contain this tag
for (var index = len; index < htmlStr.length(); index++) {
if (!htmlStr.characters[index].hasTags(closingTag)) {
break;
}
}
return htmlStr.slice(0, index);
}
var myString = 'This is an <b>example ' +
'of a link ' +
'inside</b> another element';
console.log(truncate(myString, 23).html());
console.log(truncate(myString, 18).html());
This will output:
This is an <b>example of a link</b>
This is an <b>example of a link inside</b>
Although HTML is notorious for being terribly formed and has edge cases which are impervious to regex, here is a super light way you could hackily handle HTML with nested tags in vanilla JS.
(function(s, approxNumChars) {
var taggish = /<[^>]+>/g;
var s = s.slice(0, approxNumChars); // ignores tag lengths for solution brevity
s = s.replace(/<[^>]*$/, ''); // rm any trailing partial tags
tags = s.match(taggish);
// find out which tags are unmatched
var openTagsSeen = [];
for (tag_i in tags) {
var tag = tags[tag_i];
if (tag.match(/<[^>]+>/) !== null) {
openTagsSeen.push(tag);
}
else {
// quick version that assumes your HTML is correctly formatted (alas) -- else we would have to check the content inside for matches and loop through the opentags
openTagsSeen.pop();
}
}
// reverse and close unmatched tags
openTagsSeen.reverse();
for (tag_i in openTagsSeen) {
s += ('<\\' + openTagsSeen[tag_i].match(/\w+/)[0] + '>');
}
return s + '...';
})
In a nutshell: truncate it (ignores that some chars will be invisible), regex match the tags, push open tags onto a stack, and pop off the stack as you encounter closing tags (again, assumes well-formed); then close any still-open tags at the end.
(If you want to actually get a certain number of visible characters, you can keep a running counter of how many non-tag chars you've seen so far, and stop the truncation when you fill your quota.)
DISCLAIMER: You shouldn't use this as a production solution, but if you want a super light, personal, hacky solution, this will get basic well-formed HTML.
Since it's blind and lexical, this solution misses a lot of edge cases, including tags that should not be closed, like <img>, but you can hardcode those edge cases or, you know, include a lib for a real HTML parser if you want. Fortunately, since HTML is poorly formed, you won't see it ;)
You've tagged your question regex, but you cannot reliably do this with regular expressions. Obligatory link. So innerHTML is out.
If you're really talking characters, I don't see a way to do it other than to loop through the nodes within the element, recursing into descendant elements, totalling up the lengths of the text nodes you find as you go. When you find the point where you need to truncate, you truncate that text node and then remove all following ones — or probably better, you split that text node into two parts (using splitText) and move the second half into a display: none span (using insertBefore), and then move all subsequent text nodes into display: none spans. (This makes it much easier to undo it.)
Thanks to T.J. Crowder I soon came to the realization that the only way to do this with any kind of efficiency is to use the native DOM methods and iterate through the elements.
I've knocked up a quick, reasonably elegant function which does the trick.
function truncate(rootNode, max){
//Text method for cross browser compatibility
var text = ('innerText' in rootNode)? 'innerText' : 'textContent';
//If total length of characters is less that the limit, short circuit
if(rootNode[text].length <= max){ return; }
var cloneNode = rootNode.cloneNode(true),
currentNode = cloneNode,
//Create DOM iterator to loop only through text nodes
ni = document.createNodeIterator(currentNode, NodeFilter.SHOW_TEXT),
frag = document.createDocumentFragment(),
len = 0;
//loop through text nodes
while (currentNode = ni.nextNode()) {
//if nodes parent is the rootNode, then we are okay to truncate
if (currentNode.parentNode === cloneNode) {
//if we are in the root textNode and the character length exceeds the maximum, truncate the text, add to the fragment and break out of the loop
if (len + currentNode[text].length > max){
currentNode[text] = currentNode[text].substring(0, max - len);
frag.appendChild(currentNode);
break;
}
else{
frag.appendChild(currentNode);
}
}
//If not, simply add the node to the fragment
else{
frag.appendChild(currentNode.parentNode);
}
//Track current character length
len += currentNode[text].length;
}
rootNode.innerHTML = '';
rootNode.appendChild(frag);
}
This could probably be improved, but from my initial testing it is very quick, probably due to using the native DOM methods and it appears to do the job perfectly for me. I hope this helps anyone else with similar requirements.
DISCLAIMER: The above code will only deal with one level deep HTML tags, it will not deal with tags inside tags. Though it could easily be modified to do so by keeping track of the nodes parent and appending the nodes to the correct place in the fragment. As it stands, this is fine for my requirements but may not be useful to others.
I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is
John Dow
I want to collect the content of href and "John Dow".
The links have class="r_lapi" in them that would identify the links I'm looking for.
What I have right now is:
var link_regex = new RegExp("/<a[^]*</a>/");
var match = content.match(link_regex, 'i');
console.log("match =", match );
Which does absolutely nothing. Any help is very much appreciated.
If you can use the DOM (you've said you want regex, but...)
var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:
var div, links, i;
// Create an element; note we don't append it anywhere
div = document.createElement('div');
// Fill it in with the HTML
div.innerHTML = text;
// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
Live Example, using this text returned via ajax:
John Dow
Don't pick me
Jane Bloggs
The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)
Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:
function handleLinks(data) {
var div, links, htmlIndex, linkIndex;
div = document.createElement('div');
for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
div.innerHTML = data.htmlList[htmlIndex];
links = div.querySelectorAll("a.r_lapi");
for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
// Use `links[linkIndex].innerHTML` here
}
}
}
Live Example, using this JSON returned via ajax:
{
"htmlList": [
"blah blah John Dow blah blah",
"Don't pick me",
"Two in this one Jane Bloggs and Trevor Bloggs"
]
}
If you really need to use regex:
Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.
You can get close with a couple of assumptions.
var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
var match = content.match(link_regex);
if (match) {
// Use match[1], which contains it
}
Live illustration
That looks for this:
The literal text <a
Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
Any number of characters, minimal-match
The literal text </a>
The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.
I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)
Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).
One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., John Dow>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.
If you're in a browser, you really should be using the native DOM.
If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:
var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";
Please note that this will fail on certain links like
test
John <b>Dow</b>
For a complete solution, use a HTML parser.
If I want to add an ascii symbol form js to a node somewhere?
Tried as a TextNode, but it didn't parse it as a code:
var dropdownTriggerText = document.createTextNode('blabla ∧');
You can't create nodes with HTML entities. Your alternatives would be to use unicode values
var dropdownTriggerText = document.createTextNode('blabla \u0026');
or set innerHTML of the element. You can of course directly input &...
createTextNode is supposed to take any text input and insert it into the DOM exactly like it is. This makes it impossible to insert for example HTML elements, and HTML entities. It’s actually a feature, so you don’t need to escape these first. Instead you just operate on the DOM to insert text nodes.
So, you can actually just use the & symbol directly:
var dropdownTriggerText = document.createTextNode('blabla &');
I couldn't find an automated way to do this. So I made a function.
// render HTML as text for inserting into text nodes
function renderHTML(txt) {
var tmpDiv = document.createElement("div"); tmpDiv.innerHTML = txt;
return tmpDiv.innerText || tmpDiv.textContent || txt;
}
I am using struts, and I am getting html text from database and I am storing it in a string and passing it to jsp. Now in jsp I have to extract pure text from that html string and has to display in the TextArea using javascript.
Please suggest some solutions, I am not allowed to use jquery.
You could try something like a mini-parser.
Like this function:
function HTMLtoBB(html) {
search = new Array( /\<b\>(.*?)\<\/b\>/g,
/\<i\>(.*?)\<\/i\>/g,
/\<u\>(.*?)\<\/u\>/g,
/\<font size=\'(.*?)\'\>(.*?)\<\/font\>/g,
/\<font color=\'(.*?)\'\>(.*?)\<\/font\>/g,
/\<img src=\'(.*?)\'\>/g,
/\<a href=\'(.*?)\'\>(.*?)\<\/a\>/g,
/\<blockqoute\>(.*?)\<\/blockquote\>/g,
/\<center\>(.*?)\<\/center\>/g
);
replace = new Array("[b]$1[/b]",
"[i]$1[/i]",
"[u]$1[/u]",
"[size=$1]$2[/size]",
"[color=$1]$2[/color]",
"[img=$1]",
"[url=$1]$2[/url]",
"[quote]$1[/quote]",
"[center]$1[/center]"
);
for (i = 0; i < search.length; i++) {
html = html.replace(search[i], replace[i]);
}
return html;
}
This will convert the HTML-Tags to BB-Codes. Or you replace the BB-Codes with something other.
You could attach the loaded HTML to the dom, and then use element.innerText to strip away all the HTML, leaving just the plain text (if this is what you want to do - I don't think it is completely clear from your question)