Remove and extract text in javascript - javascript

I'm wanting to do the following in JavaScript as efficiently as possible:
Remove <ul></ul> tags from a string and everything in between.
For what remains, every string that is encased within <li> and </li> I want dumped in an array, without any newline characters lurking at the end.
I'm thinking regexes are the answer but I've never used them before. Guess I could figure out a way but eventually it would probably not be the most efficient.

As others have said, you do have to be careful parsing HTML with regexes. If the HTML is controlled and does not have nested ul or li tags in it and doesn't have embedded strings that contain valid HTML tags or < or > chars (e.g. the HTML is coming from a known source in a known format, it can work fine). Here's one way to do what I think you were asking for:
function parseList(str) {
var output = [], matches;
var re = /<\s*li[^>]*>(.*?)<\/li>/gi;
// remove newlines
str = str.replace(/\n|\r/igm, "");
// get text between ul tags
matches = str.match(/<\s*ul[^>]*>(.*?)<\/ul\s*>/);
if (matches) {
str = matches[1];
// get text between each li tag
while (matches = re.exec(str)) {
output.push(matches[1]);
}
}
return(output);
}
It is more foolproof to use an actual HTML parser that understands the finer points of the format (like nested tags, tag values in embedded strings, etc...), but if you have none of that, a simpler parser like this can be used.
You can see it work here: http://jsfiddle.net/jfriend00/c9ZLT/

Related

Tokenize HTML string in JavaScript [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 5 years ago.
I would like to split a string that looks like:
This is <strong>a</strong> test link and <br /> line. break
into the following with JavaScript:
[
'This',
'is',
'<strong>a</strong>',
'test',
'link',
'<br />',
'line.',
]
I tried splitting on spaces, and < >, but that obviously doesn't work for tags like strong and a. I'm not sure how to write a regex that doesn't split within HTML tags. I also tried to use jQuery children(), but it doesn't extract plain text, just the html tags. Any help would be great.
If the code is executing in a browser, using the browser's parser to separate the string into text and tag components may provide an alternative workaround:
var text = 'This is <strong>a</strong> link and <br /> line. break'
function splitHTML( text) {
var parts = [];
var div = document.createElement('DIV');
div.innerHTML = text;
div.normalize();
for( var node = div.firstChild; node; node=node.nextSibling) {
if( node.nodeType == Node.TEXT_NODE) {
parts.push.apply( parts, node.textContent.split(" "));
}
else if( node.nodeType == Node.ELEMENT_NODE) {
parts.push( node.outerHTML);
}
}
return parts;
}
console.log( splitHTML( text));
Note the line that adds text nodes split by spaces to the result
parts.push.apply( parts, node.textContent.split(" "));
is for demonstration and needs further work to prevent zero length strings in the ouput for spaces between text and html tagged elements. Also the html tags are reconstructed from the DOM element and may not exactly match the input: in this case the XHTML tags <br \> are returned as <br> HTML tags (which don't take a closing tag).
The general idea is to side step parsing html using a regex by parsing it with the browser. Understandably this may or may not fit with the target environment and a full set of requirements.
To achieve what you want, you need to consider this:
Rule 1) if no "<" occurred yet, simply split at " ".
Rule 2) if "<" occurred, look for "/>" or "/"..">" and split after it, then start at rule 1 again.
Apply those rules while looping through a string and you are golden.
Making this recursive, i.E. nested tags like
<div>
<p>Hi</p>
<p>Bye</p>
</div>
is harder. As mentioned above, actually parsing a html tree is very complex.
Try this:
#(?:(?!<)[^<>]+(?!>))|(?:<(?=[^/>]+\/>).*\/>)|(?:<([^\s]+).*>.*(?=<\/\1>)<\/\1>)#g
It should work in simple cases, All that I can thik of right now.
Use captured group to find out TAG name, then execute it recursivly for block elements as div.

Replace with RegExp only outside tags in the string

I have a strings where some html tags could present, like
this is a nice day for bowling <b>bbbb</b>
how can I replace with RegExp all b symbols, for example, with :blablabla: (for example) but ONLY outside html tags?
So in that case the resulting string should become
this is a nice day for :blablabla:owling <b>bbbb</b>
EDIT: I would like to be more specific, based on the answers I have received. So first of all I have just a string, not DOM element, or anything else. The string may or may not contain tags (opening and closing). The main idea is to be able to replace anywhere in the text except inside tags. For example if I have a string like
not feeling well today :/ check out this link http://example.com
the regexp should replace only first :/ with real smiley image, but should not replace second and third, because they are inside (and part of) tag. Here's an example snippet using the regexp from one of the answer.
var s = 'not feeling well today :/ check out this link http://example.com';
var replaced = s.replace(/(?:<[^\/]*?.*?<\/.*?>)|(:\/)/g, "smiley_image_here");
document.querySelector("pre").textContent = replaced;
<pre></pre>
It is strange but the DEMO shows that it captured the correct group, but the same regexp in replace function seem not to be working.
The regex itself to replace all bs with :blablabla: is not that hard:
.replace(/b/g, ":blablabla:")
It is a bit tricky to get the text nodes where we need to perform search and replace.
Here is a DOM-based example:
function replaceTextOutsideTags(input) {
var doc = document.createDocumentFragment();
var wrapper = document.createElement('myelt');
wrapper.innerHTML = input;
doc.appendChild( wrapper );
return textNodesUnder(doc);
}
function textNodesUnder(el){
var n, walk=document.createTreeWalker(el,NodeFilter.SHOW_TEXT,null,false);
while(n=walk.nextNode())
{
if (n.parentNode.nodeName.toLowerCase() === 'myelt')
n.nodeValue = n.nodeValue.replace(/:\/(?!\/)/g, "smiley_here");
}
return el.firstChild.innerHTML;
}
var s = 'not feeling well today :/ check out this link http://example.com';
console.log(replaceTextOutsideTags(s));
Here, we only modify the text nodes that are direct children of the custom-created element named myelt.
Result:
not feeling well today smiley_here check out this link http://example.com
var input = "this is a nice day for bowling <b>bbbb</b>";
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:')
+ c;
});
document.querySelector("pre").textContent = result;
<pre></pre>
You can do this:
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:') // you may do something else here
+ c;
});
Note that in most (no all but most) real complex use cases, it's much more convenient to manipulate a parsed DOM rather than just a string. If you're starting with a HTML page, you might use a library (some, like my one, accept regexes to do so).
I think you can use a regex like this : (Just for a simple data not a nested one)
/<[^\/]*?b.*?<\/.*?>|(b)/ig
[Regex Demo]
If you wanna use a regex I can suggest you use below regex to remove all tags recursively until all tags removed:
/<[^\/][^<]*>[^<]*<\/.*?>/g
then use a replace for finding any b.

Is there any way for me to work with this 100,000 item new-line separated string of words?

I've got a 100,000+ long list of English words in plain text. I want to use split() to convert the list into an array, which I can then convert to an associative array, giving each list item a key equal to its own name, so I can very efficiently check whether or not a string is an English word.
Here's the problem:
The list is new-line separated.
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
This means that var list = ' <copy/paste list> ' isn't going to work, because JavaScript quotes don't work multi-line.
Is there any way for me to work with this 100,000 item new-line separated string?
replace the newlines with commas in any texteditor before copying to your js file
One workaround would be to use paste the list into notepad++. Then select all and Edit>Line Operations>Join lines.
This removes new lines and replaces them with spaces.
If you're doing this client side, you can use jQuery's get function to get the words from a text file and do the processing there:
jQuery.get('wordlist.txt', function(results){
//Do your processing on results here
});
If you're doing this in Node.js, follow the guide here to see how to read a file into memory.
You can use notepad++ or any semi-advanced text editor.
Go to notepad++ and push Ctrl+H to bring up the Replace dialog.
Towards the bottom, select the "Extended" Search Mode
You want to find "\r\n" and replace it with ", "
This will remove the newlines and replace it with commas
jsfiddle Demo
Addressing this purely from having a string and trying to work with it in JavaScript through copy paste. Specifically the issues regarding, "This means that var list = ' ' isn't going to work, because JavaScript quotes don't work multi-line.", and "Is there any way for me to work with this 100,000 item new-line separated string?".
You can treat the string like a string in a comment in JavaScript . Although counter-intuitive, this is an interesting approach. Here is the main function
function convertComment(c) {
return c.toString().
replace(/^[^\/]+\/\*!?/, '').
replace(/\*\/[^\/]+$/, '');
}
It can be used in your situation as follows:
var s = convertComment(function() {
/*
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
*/
});
At which point you may do whatever you like with s. The demo simply places it into a div for displaying.
jsFiddle Demo
Further, here is an example of taking the list of words, getting them into an array, and then referencing a single word in the array.
//previously shown code
var all = s.match(/[^\r\n]+/g);
var rand = parseInt(Math.random() * all.length);
document.getElementById("random").innerHTML = "Random index #"+rand+": "+all[rand];
If the words are in a separate file, you can load them directly into the page and go from there. I've used a script element with a MIME type that should mean browsers ignore the content (provided it's in the head):
<script type="text/plain" id="wordlist">
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
</script>
<script>
var words = (function() {
var words = '\n' + document.getElementById('wordlist').textContent + '\n';
return {
checkWord: function (word) {
return words.indexOf('\n' + word + '\n') != -1;
}
}
}());
console.log(words.checkWord('aaliis')); // true
console.log(words.checkWord('ahh')); // false
</script>
The result is an object with one method, checkWord, that has access to the word list in a closure. You could add more methods like addWord or addVariant, whatever.
Note that textContent may not be supported in all browsers, you may need to feature detect and use innerText or an alternative for some.
For variety, another solution is to put the unaltered content into
A data attribute - HTML attributes can contain newlines
or a "non-script" script - eg. <SCRIPT TYPE="text/x-wordlist">
or an HTML comment node
or another hidden element that allows content
Then the content could be read and split/parsed. Since this would be done outside of JavaScript's string literal parsing it doesn't have the issue regarding embedded newlines.

regex replace characters within tags

I'm already using a html parser, but I need to create a regex that will select the < and > symbols within the first instance of <code> tags - in this case, the one with the class "html".
<code class="html">
<b>test</b><script>lol</script>
<code>test</code> <b> test </b>
<lol>
</lol>
<test>
</code>
So every < or > within the indented area starting from <b> to the start of the last </code> should be replaced, leaving the outer <code> tags alone.
I'm using javascript's .replace method and would like all < and > symbols within the code area to turn into ascii < and >.
I imagine its best to use a look forward/back regex using $1 etc. but can't figure out where to begin, so any help would be much appreciated.
How about something like this? In this example I'm creating a variable and populating the variable with html, just to get things started
var doc = document.createElement( 'div' );
doc.innerHTML = ---your input html here
Here I'm pulling the code tag
var string = doc.getElementsByTagName( 'code' ).innerHTML;
Once you have the string then simply replace the desired brackets with
var string = string .replace(/[<]/, "<)
var string = string .replace(/[>]/, ">)
then just reinsert the replaced value back into your source html
The easy way:
var elem = $('.html');
elem.text(elem.html());
This will not necessarily use literally < for escaping; if you're fine with a different escape, it's much simpler than anything else you can do, though.
If you have multiple elements like that, you might need to wrap the second line in an elem.each(); otherwise the html() method will probably just concatenate the content from all elements or something similarly pointless.

regex to get contents between <b> tag

I have used following regex to get only contents between <b> and </b> tags.
var bonly = defaultVal.match("<b>(.*?)</b>");
but it did not worked. I'm not getting proper result. Sample string I'm using regex on:
<b>Item1</b>: This is item 1 description.
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description.
<b>Item1</b>: <b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
Here item name is compulsory but it may have description or may not have description.
Why don't you skip regex and try...
var div = document.createElement('div');
div.innerHTML = str;
var b = div.getElementsByTagName('b');
for (var i = 0, length = b.length; i < length; i++) {
console.log(b[i].textContent || b[i].innerText);
}
jsFiddle.
There are a zillion questions/answers here on SO about using regex to match HTML tags. You can probably learn a lot with some appropriate searching.
You may want to start by turning your regular expression into a regular expression:
var defaultVal = "<b>Item1</b>: This is item 1 description.";
var bonly = defaultVal.match(/<b>(.*?)<\/b>/);
if (bonly && (bonly.length > 1)) {
alert(bonly[1]); // alerts "Item1"
}
You may also need to note that regular expressions are not well suited for HTML matching because there can be arbitrary strings as attributes on HTML tags that can contain characters that can really mess up the regex match. Further, line breaks can be an issue in some regex engines. Further capitalization can mess you up. Further, an extra space here or there can mess you up. Some of this can be accounted for with a more complicated regex, but it still may not be the best tool.
Depending upon the context of what you're trying to do, it may be easier to create actual HTML objects with this HTML (letting the browser do all the complex parsing) and then use DOM access methods to fetch the info you want.
It works here: http://jsfiddle.net/jfriend00/Man2J/.
try this regexp
var bonly = defaultVal.match(/<([A-z0-9]*)\b[^>]*>(.*?)<\/\1>/)

Categories

Resources