I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is
John Dow
I want to collect the content of href and "John Dow".
The links have class="r_lapi" in them that would identify the links I'm looking for.
What I have right now is:
var link_regex = new RegExp("/<a[^]*</a>/");
var match = content.match(link_regex, 'i');
console.log("match =", match );
Which does absolutely nothing. Any help is very much appreciated.
If you can use the DOM (you've said you want regex, but...)
var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:
var div, links, i;
// Create an element; note we don't append it anywhere
div = document.createElement('div');
// Fill it in with the HTML
div.innerHTML = text;
// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
Live Example, using this text returned via ajax:
John Dow
Don't pick me
Jane Bloggs
The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)
Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:
function handleLinks(data) {
var div, links, htmlIndex, linkIndex;
div = document.createElement('div');
for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
div.innerHTML = data.htmlList[htmlIndex];
links = div.querySelectorAll("a.r_lapi");
for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
// Use `links[linkIndex].innerHTML` here
}
}
}
Live Example, using this JSON returned via ajax:
{
"htmlList": [
"blah blah John Dow blah blah",
"Don't pick me",
"Two in this one Jane Bloggs and Trevor Bloggs"
]
}
If you really need to use regex:
Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.
You can get close with a couple of assumptions.
var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
var match = content.match(link_regex);
if (match) {
// Use match[1], which contains it
}
Live illustration
That looks for this:
The literal text <a
Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
Any number of characters, minimal-match
The literal text </a>
The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.
I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)
Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).
One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., John Dow>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.
If you're in a browser, you really should be using the native DOM.
If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:
var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";
Please note that this will fail on certain links like
test
John <b>Dow</b>
For a complete solution, use a HTML parser.
Related
I have an issue related to finding a regex for the link with some conditions. Here is the scenario:
I have created utils.ts it's a typescript. basically, it will take an API response as an input and return the formatted HTML supported text, like bold text, email, Images, Links.
So let's take one scenario which I am facing.
as a return of the utils.ts file, I am getting this.
https://www.google.com Click here
(Note: normal links and 'a' tag links can occure in any order)
from the above text, as you can see this part Click here is already in HTML supported method.
So I will get the following output on GUI
https://www.google.com Click here
so from this point, I want a regex which can format https://www.google.com but it must not manipulate Click here as it is already formated.
Here I also want to format https:///www.google.com as follow
Google
The main problem I am facing is when I am replacing the string with 'https://..' with tags it will also replace the links inside 'href' like this
Google Google">Click me</a>
Which is what I don't want.
Please share your thought on this.
Thank you
Not yet formatted links can be found using alternations. The idea is - if a link is formatted it's not captured to a group (don't be confused that the regex still finds something - you should only look at Group 1). Otherwise, the link is captured to a group.
The regex below is really simple, just to explain the idea. You might want to update it with a better URL search pattern.
demo
(?:href="https?\S+")|(https?\S+)
If I understood correctly, you want to extract from the text those web addresses that appear in the text and are not links. If so check out the following javascript:
//the data:
var txt1='https://www.google.com Click here http://other.domain.com';
// strip html tags
String.prototype.stripHTML = function () {
var reTag = /<(?:.|\s)*?>/g;
return this.replace(reTag, " ");
};
var txt2=txt1.stripHTML();
//console.log(txt2);
//split tokens
var regex1 = /\s/;
var tokens = txt2.split(regex1);
//console.log(tokens);
//build an address table
regex2=/^https?:\/\/.*/;
var i=0, j=0;
var addresses=[];
for (i in tokens) {
if (regex2.test(tokens[i])) {
addresses[j] = tokens[i];
j++;
}
i++;
}
console.log(addresses);
I'm writing a Firefox extension. I want to go through the entire plaintext, so not Javascript or image sources, and replace certain strings. I currently have this:
var text = document.documentElement.innerHTML;
var anyRemaining = true;
do {
var index = text.indexOf("search");
if (index != -1) {
// This does not just replace the string with something else,
// there's complicated processing going on here. I can't use
// string.replace().
} else {
anyRemaining = false;
}
} while (anyRemaining);
This works, but it will also go through non-text elements and HTML such as Javascript, and I only want it to do the visible text. How can I do this?
I'm currently thinking of detecting an open bracket and continuing at the next closing bracket, but there might be better ways to do this.
You can use xpath to get all the text nodes on the page and then do your search/replace on those nodes:
function replace(search,replacement){
var xpathResult = document.evaluate(
"//*/text()",
document,
null,
XPathResult.ORDERED_NODE_ITERATOR_TYPE,
null
);
var results = [];
// We store the result in an array because if the DOM mutates
// during iteration, the iteration becomes invalid.
while(res = xpathResult.iterateNext()) {
results.push(res);
}
results.forEach(function(res){
res.textContent = res.textContent.replace(search,replacement);
})
}
replace(/Hello/g,'Goodbye');
<div class="Hello">Hello world!</div>
You can either use regex to strip the HTML tags, might be easier to use javascript function to return the text without HTML. See this for more details:
How can get the text of a div tag using only javascript (no jQuery)
I realize that there are several similar questions here but none of the answers solve my case.
I need to be able to take the innerHTML of an element and truncate it to a given character length with the text contents of any inner HTML element taken into account and all HTML tags preserved.
I have found several answers that cover this portion of the question fine as well as several plugins which all do exactly this.
However, in all cases the solution will truncate directly in the middle of any inner elements and then close the tag.
In my case I need the contents of all inner tags to remain intact, essentially allowing any "would be" truncated inner tags to exceed the given character limit.
Any help would be greatly appreciated.
EDIT:
For example:
This is an example of a link inside another element
The above is 51 characters long including spaces. If I wanted to truncate this to 23 characters, we would have to shorten the text inside the </a> tag. Which is exactly what most solutions out there do.
This would give me the following:
This is an example of a
However, for my use case I need to keep any remaining visible tags completely intact and not truncated in any way.
So given the above example, the final output I would like, when attempting to truncate to 23 characters is the following:
This is an example of a link
So essentially we are checking where the truncation takes place. If it is outside of an element we can split the HTML string to exactly that length. If on the other hand it is inside an element, we move to the closing tag of that element, repeating for any parent elements until we get back to the root string and split it there instead.
It sounds like you'd like to be able to truncate the length of your HTML string as a text string, for example consider the following HTML:
'<b>foo</b> bar'
In this case the HTML is 14 characters in length and the text is 7. You would like to be able to truncate it to X text characters (for example 2) so that the new HTML is now:
'<b>fo</b>'
Disclosure: My answer uses a library I developed.
You could use the HTMLString library - Docs : GitHub.
The library makes this task pretty simple. To truncate the HTML as we've outlined above (e.g to 2 text characters) using HTMLString you'd use the following code:
var myString = new HTMLString.String('<b>foo</b> bar');
var truncatedString = myString.slice(0, 2);
console.log(truncatedString.html());
EDIT: After additional information from the OP.
The following truncate function truncates to the last full tag and caters for nested tags.
function truncate(str, len) {
// Convert the string to a HTMLString
var htmlStr = new HTMLString.String(str);
// Check the string needs truncating
if (htmlStr.length() <= len) {
return str;
}
// Find the closing tag for the character we are truncating to
var tags = htmlStr.characters[len - 1].tags();
var closingTag = tags[tags.length - 1];
// Find the last character to contain this tag
for (var index = len; index < htmlStr.length(); index++) {
if (!htmlStr.characters[index].hasTags(closingTag)) {
break;
}
}
return htmlStr.slice(0, index);
}
var myString = 'This is an <b>example ' +
'of a link ' +
'inside</b> another element';
console.log(truncate(myString, 23).html());
console.log(truncate(myString, 18).html());
This will output:
This is an <b>example of a link</b>
This is an <b>example of a link inside</b>
Although HTML is notorious for being terribly formed and has edge cases which are impervious to regex, here is a super light way you could hackily handle HTML with nested tags in vanilla JS.
(function(s, approxNumChars) {
var taggish = /<[^>]+>/g;
var s = s.slice(0, approxNumChars); // ignores tag lengths for solution brevity
s = s.replace(/<[^>]*$/, ''); // rm any trailing partial tags
tags = s.match(taggish);
// find out which tags are unmatched
var openTagsSeen = [];
for (tag_i in tags) {
var tag = tags[tag_i];
if (tag.match(/<[^>]+>/) !== null) {
openTagsSeen.push(tag);
}
else {
// quick version that assumes your HTML is correctly formatted (alas) -- else we would have to check the content inside for matches and loop through the opentags
openTagsSeen.pop();
}
}
// reverse and close unmatched tags
openTagsSeen.reverse();
for (tag_i in openTagsSeen) {
s += ('<\\' + openTagsSeen[tag_i].match(/\w+/)[0] + '>');
}
return s + '...';
})
In a nutshell: truncate it (ignores that some chars will be invisible), regex match the tags, push open tags onto a stack, and pop off the stack as you encounter closing tags (again, assumes well-formed); then close any still-open tags at the end.
(If you want to actually get a certain number of visible characters, you can keep a running counter of how many non-tag chars you've seen so far, and stop the truncation when you fill your quota.)
DISCLAIMER: You shouldn't use this as a production solution, but if you want a super light, personal, hacky solution, this will get basic well-formed HTML.
Since it's blind and lexical, this solution misses a lot of edge cases, including tags that should not be closed, like <img>, but you can hardcode those edge cases or, you know, include a lib for a real HTML parser if you want. Fortunately, since HTML is poorly formed, you won't see it ;)
You've tagged your question regex, but you cannot reliably do this with regular expressions. Obligatory link. So innerHTML is out.
If you're really talking characters, I don't see a way to do it other than to loop through the nodes within the element, recursing into descendant elements, totalling up the lengths of the text nodes you find as you go. When you find the point where you need to truncate, you truncate that text node and then remove all following ones — or probably better, you split that text node into two parts (using splitText) and move the second half into a display: none span (using insertBefore), and then move all subsequent text nodes into display: none spans. (This makes it much easier to undo it.)
Thanks to T.J. Crowder I soon came to the realization that the only way to do this with any kind of efficiency is to use the native DOM methods and iterate through the elements.
I've knocked up a quick, reasonably elegant function which does the trick.
function truncate(rootNode, max){
//Text method for cross browser compatibility
var text = ('innerText' in rootNode)? 'innerText' : 'textContent';
//If total length of characters is less that the limit, short circuit
if(rootNode[text].length <= max){ return; }
var cloneNode = rootNode.cloneNode(true),
currentNode = cloneNode,
//Create DOM iterator to loop only through text nodes
ni = document.createNodeIterator(currentNode, NodeFilter.SHOW_TEXT),
frag = document.createDocumentFragment(),
len = 0;
//loop through text nodes
while (currentNode = ni.nextNode()) {
//if nodes parent is the rootNode, then we are okay to truncate
if (currentNode.parentNode === cloneNode) {
//if we are in the root textNode and the character length exceeds the maximum, truncate the text, add to the fragment and break out of the loop
if (len + currentNode[text].length > max){
currentNode[text] = currentNode[text].substring(0, max - len);
frag.appendChild(currentNode);
break;
}
else{
frag.appendChild(currentNode);
}
}
//If not, simply add the node to the fragment
else{
frag.appendChild(currentNode.parentNode);
}
//Track current character length
len += currentNode[text].length;
}
rootNode.innerHTML = '';
rootNode.appendChild(frag);
}
This could probably be improved, but from my initial testing it is very quick, probably due to using the native DOM methods and it appears to do the job perfectly for me. I hope this helps anyone else with similar requirements.
DISCLAIMER: The above code will only deal with one level deep HTML tags, it will not deal with tags inside tags. Though it could easily be modified to do so by keeping track of the nodes parent and appending the nodes to the correct place in the fragment. As it stands, this is fine for my requirements but may not be useful to others.
I am looking for a way to use javascript for splitting a sentence with HTML into words, and leaving the inline HTML tags with the text content intact. Punctuation can be regarded as a part of the word it is closest to. I'd like to use regex, and probably preg_split() for splitting the sentences. Here follows an example:
A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>
Preferably, I would like to end up with the following:
[0] => A
[1] => word,
[2] => <a href='#' title=''>words within tags should remain intact</a>,
[3] => so
[4] => here's
[5] => <b>even more</b>
[6] => <u>words</u>
I know about the discussion on parsing HTML with Regex (I enjoyed reading Bobince' answer :-P ), but I need to split the words of a sentence without harming html-tags with attributes. I don't see how I can do this with JS in a different way than Regex. Of course, if there are alternatives, I'd be more than happy to adapt them, to achieve a similar result. :-)
Edit:
I searched for similar questions on Stackoverflow about this, but these don't tick the boxes for me. To put it a little into perspective:
splitting-up-html-code-tags-and-content: targets to split up the inline HTML, which is what I want to leave intact.
php-regex-to-match-outside-of-html-tags: targets all text nodes in a HTML snippet, even within HTML tags. But as a matter of fact, I want to target only the spaces outside of HTML elements (so even excluding the spaces within the text nodes being wrapped with HTML tags).
This is possible, but there will be some drawbacks to using a pure regex solution. The easiest to call out is nested HTML. The solution I'm about to show uses some back referencing to try get around this, but if you get some complicated nested HTML it will probably start failing in weird ways.
/(?:<(\w+)[^>]*>(?:[\w+]+(?:(?!<).*?)<\/\1>?)[^\s\w]?|[^\s]+)/g
Regex Demo
The regex uses back referencing and negative look behinds to get the work. You could potentially remove the back reference depending on your requirements. The back referencing helps with supporting nested tags.
JSFiddler Example - Check your console output for the example.
Here's the output from JS Fiddler (I formatted the output a bit)
[
"A",
"word,",
"<a href='#' title=''>words within tags should remain intact</a>,",
"so",
"here's",
"<b>even more</b>",
"<u>words</u>"
]
Depending on you're use case you'll need to modify it to work for you. I considered a word anything that wasn't a space, but you may have different criteria.
One negative to this method is if the start HTML tag is at the end of a word, it won't be picked up properly. ie. test<span>something else</span>.
You can use the following snippet:
function splitIntoWords(div) {
function removeEmptyStrings(k) {
return k !== '';
}
var rWordBoundary = /[\s\n\t]+/; // Includes space, newline, tab
var output = [];
for (var i = 0; i < div.childNodes.length; ++i) { // Iterate through all nodes
var node = div.childNodes[i];
if (node.nodeType === Node.TEXT_NODE) { // The child is a text node
var words = node.nodeValue.split(rWordBoundary).filter(removeEmptyStrings);
if (words.length) {
output.push.apply(output, words);
}
} else if (node.nodeType === Node.COMMENT_NODE) {
// What to do here? You can do what you want
} else {
output.push(node.outerHTML);
}
}
return output;
}
window.onload = function() {
var div = document.querySelector("div");
document.querySelector("pre").innerText = 'Output: ' + JSON.stringify(splitIntoWords(div));
}
<!-- Note you have to surround your html with a div element -->
<div>A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>
</div>
<pre></pre>
It iterates through all child nodes, takes the text nodes and splits them into words (you can do this safely since text nodes can't contain children).
This takes care of most issues. With this, HTML such as text<span>Test</span> will come out ["text", "<span>Test</span>"] unlike the answer above.
This may fail with <span>There are</span>: 4 words which results in ["<span>There are</span>", ":" /* Extra colon */, "4", "words"] (which it's supposed to do, but not sure if it is desirable).
I would think this is very safe with nested elements.
I'm trying to use javascript/jQuery to wrap any abbreviations in a paragraph in a <abbr title=""> tag.
For example, in a sentence like, The WHO eLENA clarifies guidance on life-saving nutrition interventions, and assists in scaling up action against malnutrition, WHO and eLENA would both be wrapped in an <abbr> tag. I'd like the title attribute to display the extended version of the abbreviation; i.e. WHO = World Health Organization.
Whats the best way of accomplishing this? I'm a bit new to javascript/jQuery so I'm fiddling in the dark here. So far I've created a variable that contains all the abbreviations as key/value pairs, and I can replace a specific instance of an abbreviation, but not much else.
First you must decide exactly what criteria you will use for selecting a replacement -- I would suggest doing it on a word boundary, such that "I work with WHO" will wrap "WHO" in an abbr, but "WHOEVER TOUCHED MY BIKE WILL REGRET IT" won't abbreviate "WHO". You should also decide if you are going to be case sensitive (probably you want to be, so that "The guy who just came in" doesn't abbreviate "who".)
Use jQuery to recurse over all of the text in the document. This can be done using the .children selector and stepping through elements and reading all the text.
For each text node, split the text into words.
For each word, look it up in your key value store to see if it matches a key. If so, get the value, and construct a new element <abbr title="value">key</abbr>.
Break up the text node into a) the text before the abbreviation (a text node), b) the abbreviation itself (an element), and c) the text after the abbreviation (a text node). Insert all three as child nodes of the original text node's parent, replacing the original text node.
Each of these steps will require a bit of work and looking up some API docs, but that is the basic process.
Firstly, this should really be done on the server, doing it on the client is very inefficient and much more prone to error. But having said that...
You can try processing the innerHTML of the element, but javascript and regular expressions are really bad at that.
The best way is to use DOM methods and parse the text of each element. When a matching word is found, replace it with an abbr element. This requires that where a match is found in a text node, the entire node is replaced because what was one text node will now be two text nodes (or more) either side of an abbr element.
Here is a simple function that goes close, but it likely has foibles that you need to address. It works on simple text strings, but you'll need to test it thoroughly on more complex strings. Naturally it should only ever be run once on a particular node or abbreviations will be doubly wrapped.
var addAbbrHelp = (function() {
var abbrs = {
'WHO': 'World Health Organisation',
'NATO': 'North Atlantic Treaty Organisation'
};
return function(el) {
var node, nodes = el.childNodes;
var word, words;
var adding, text, frag;
var abbr, oAbbr = document.createElement('abbr');
var frag, oFrag = document.createDocumentFragment()
for (var i=0, iLen=nodes.length; i<iLen; i++) {
node = nodes[i];
if (node.nodeType == 3) { // if text node
words = node.data.split(/\b/);
adding = false;
text = '';
frag = oFrag.cloneNode(false);
for (var j=0, jLen=words.length; j<jLen; j++) {
word = words[j];
if (word in abbrs) {
adding = true;
// Add the text gathered so far
frag.appendChild(document.createTextNode(text));
text = '';
// Add the wrapped word
abbr = oAbbr.cloneNode(false);
abbr.title = abbrs[word];
abbr.appendChild(document.createTextNode(word));
frag.appendChild(abbr);
// Otherwise collect the words processed so far
} else {
text += word;
}
}
// If found some abbrs, replace the text
// Otherwise, do nothing
if (adding) {
frag.appendChild(document.createTextNode(text));
node.parentNode.replaceChild(frag, node);
}
// If found another element, add abbreviation help
// to its content too
} else if (node.nodeType == 1) {
addAbbrHelp(node);
}
}
}
}());
For the markup:
<div id="d0">
<p>This is the WHO and NATO string.</p>
<p>Some non-NATO forces were involved.</p>
</div>
and calling:
addAbbrHelp(document.getElementById('d0'));
results in (my formatting):
<div id="d0">
<p>This is the<abbr title="World Health Organisation">WHO</abbr>
and <abbr title="North Atlantic Treaty Organisation">NATO</abbr>
string.</p>
<p>Some non-<abbr title="North Atlantic Treaty Organisation">NATO</abbr> forces were involved.</p>
</div>
Using the word break pattern to split words is interesting because in strings like "with non–NATO forces", the word NATO will still get wrapped but not the "non–" part. However, if the abbreviation is split across a text node or by a hyphen, it will not be recognised unless the same pattern is included as a property name in the abbrs object.
Check out the javascript replace method.
I'd use JQuery to pull out all the text in the paragraph
var text = $(p#paragraphId).html()
Use a for loop to loop through the list of abbreviations you have and then use the replace() method mentioned above to swap out the abbreviation for the tag you need.
Finally use JQuery to set the html of the paragraph back to your newly updated string.