read characters and words in HTML text using javascript - javascript

consider that i am getting a HTML format string and
want to read the number of words & characters,
Consider, i am getting,
var HTML = '<p>BODY Article By Archie(Used By Story Tool)</p>';
now i want to get number of words and characters
above html will look like:
BODY Article By Archie(Used By Story Tool)
IMPORTANT
i want to avoid html tags while counting words or character
avoid keywords like ** ** etc..
Ex. words and character should be counted of : (for current example)
BODY Article By Archie(Used By Story Tool)
please help,
Thank You.

To give an example for adamantium's suggestion:
var e = document.createElement("span");
e.innerHTML = '<p>BODY Article By Archie(Used By Story Tool)</p>';
var text = e.textContent || e.innerText;
var characterCount = text.length;
var wordCount = text.split(/[\s\.\(\),]+/).length;
Update: Added other word-stop characters

Use a hidden HTML element that can render text like span or p
Assign the string to the innerHTML of the hidden element.
Count the characters using length property of innerText/textContent.
To read the word count you can
Split the innerText/textContent using empty space
Get the length of the returned array.

Algorithm:
Sweep through the entire html
Perform regex replaces
replace <.*> (regex for anyting tat stays withing <>)by nothing
replace /&nbsp/ by nothing
tip: can be done by replace function in javascript. hunt on w3schools.com
Now you have the clutter out!
then perform a simple word/character count

Related

Find and replace words between characters

I am working on this project, below is a replica of a string that I'm working on, but it is only for example purpose so it doesn't make much sense. My goal is to figure out the word between <ebm> and </ebm> and replace it accordingly.
var string = "“You know you're in love when <ebm>img-1</ebm> you can't fall asleep because reality <ebm>img-2</ebm>is finally better than your dreams.” <ebm>img-3</ebm>"
For example, if the word between <ebm> and </ebm> is
"img-1" then replace it with "Strong" (remove the <ebm> tags)
"img-2" then replace it with "Weak" (remove the <ebm> tags)
"img-3" then replace it with "Nice" (remove the <ebm> tags)
I cannot just simply use string.replace() because I have hundred lists of these words that has to be replaced accordingly. I need to know what's inside the word between tags so that I can use it to extract the approriate value on my array list.
Do a regex replacement with a callback function:
var terms = {};
terms['img-1'] = 'Strong';
terms['img-2'] = 'Weak';
terms['img-3'] = 'Nice';
var text = "“You know you're in love when <ebm>img-1</ebm> you can't fall asleep because reality <ebm>img-2</ebm>is finally better than your dreams.” <ebm>img-3</ebm>";
text = text.replace(/<ebm>(.+?)<\/ebm>/g, function(match, contents, offset, input_string)
{
return (terms[contents]);
});
console.log(text);
The idea here is to match every <ebm>...</ebm> tag, passing each match to a callback function. We then take the text captured in between the tags, and do a lookup in an associative array, which, for example, maps img-1 to Strong.

Finding a html tag by its content using regex (JS)

What I want to do is find the tag that has the string "test string" even when that tag is nested inside other tags.
HTML example:
<section class="test-class1"><div><p class="test-class2">something else....test string</p></div></section>
Regex :
/.*<([a-zA-Z]*).*>.*?test string/g
Output:
p
I'm using https://regex101.com/#javascript, for the testing;
This regex works well when the html is small, but when the size of the HTML increases, it times out.
Is there a way to improve the performance of the regex ?
< *(\w+)[^<>]*>[^<]*(?:<[^>]*)*test string
matches p in the first capturing group ($1). Is not possible to speed it up so much. You'd better to use pure JS functions.
Try this <(\w+)[^>]+>[^>]+test string
var data = '<section class="test-class1"><div><p class="test-class2">something else....test string</p></div></section>';
var regex = /<(\w+)[^>]+>[^>]+test string/
var output = regex.exec(data);
alert(output[1]);
Online Regex

regex replace characters within tags

I'm already using a html parser, but I need to create a regex that will select the < and > symbols within the first instance of <code> tags - in this case, the one with the class "html".
<code class="html">
<b>test</b><script>lol</script>
<code>test</code> <b> test </b>
<lol>
</lol>
<test>
</code>
So every < or > within the indented area starting from <b> to the start of the last </code> should be replaced, leaving the outer <code> tags alone.
I'm using javascript's .replace method and would like all < and > symbols within the code area to turn into ascii < and >.
I imagine its best to use a look forward/back regex using $1 etc. but can't figure out where to begin, so any help would be much appreciated.
How about something like this? In this example I'm creating a variable and populating the variable with html, just to get things started
var doc = document.createElement( 'div' );
doc.innerHTML = ---your input html here
Here I'm pulling the code tag
var string = doc.getElementsByTagName( 'code' ).innerHTML;
Once you have the string then simply replace the desired brackets with
var string = string .replace(/[<]/, "<)
var string = string .replace(/[>]/, ">)
then just reinsert the replaced value back into your source html
The easy way:
var elem = $('.html');
elem.text(elem.html());
This will not necessarily use literally < for escaping; if you're fine with a different escape, it's much simpler than anything else you can do, though.
If you have multiple elements like that, you might need to wrap the second line in an elem.each(); otherwise the html() method will probably just concatenate the content from all elements or something similarly pointless.

Remove and extract text in javascript

I'm wanting to do the following in JavaScript as efficiently as possible:
Remove <ul></ul> tags from a string and everything in between.
For what remains, every string that is encased within <li> and </li> I want dumped in an array, without any newline characters lurking at the end.
I'm thinking regexes are the answer but I've never used them before. Guess I could figure out a way but eventually it would probably not be the most efficient.
As others have said, you do have to be careful parsing HTML with regexes. If the HTML is controlled and does not have nested ul or li tags in it and doesn't have embedded strings that contain valid HTML tags or < or > chars (e.g. the HTML is coming from a known source in a known format, it can work fine). Here's one way to do what I think you were asking for:
function parseList(str) {
var output = [], matches;
var re = /<\s*li[^>]*>(.*?)<\/li>/gi;
// remove newlines
str = str.replace(/\n|\r/igm, "");
// get text between ul tags
matches = str.match(/<\s*ul[^>]*>(.*?)<\/ul\s*>/);
if (matches) {
str = matches[1];
// get text between each li tag
while (matches = re.exec(str)) {
output.push(matches[1]);
}
}
return(output);
}
It is more foolproof to use an actual HTML parser that understands the finer points of the format (like nested tags, tag values in embedded strings, etc...), but if you have none of that, a simpler parser like this can be used.
You can see it work here: http://jsfiddle.net/jfriend00/c9ZLT/

regex to get contents between <b> tag

I have used following regex to get only contents between <b> and </b> tags.
var bonly = defaultVal.match("<b>(.*?)</b>");
but it did not worked. I'm not getting proper result. Sample string I'm using regex on:
<b>Item1</b>: This is item 1 description.
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description.
<b>Item1</b>: <b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
Here item name is compulsory but it may have description or may not have description.
Why don't you skip regex and try...
var div = document.createElement('div');
div.innerHTML = str;
var b = div.getElementsByTagName('b');
for (var i = 0, length = b.length; i < length; i++) {
console.log(b[i].textContent || b[i].innerText);
}
jsFiddle.
There are a zillion questions/answers here on SO about using regex to match HTML tags. You can probably learn a lot with some appropriate searching.
You may want to start by turning your regular expression into a regular expression:
var defaultVal = "<b>Item1</b>: This is item 1 description.";
var bonly = defaultVal.match(/<b>(.*?)<\/b>/);
if (bonly && (bonly.length > 1)) {
alert(bonly[1]); // alerts "Item1"
}
You may also need to note that regular expressions are not well suited for HTML matching because there can be arbitrary strings as attributes on HTML tags that can contain characters that can really mess up the regex match. Further, line breaks can be an issue in some regex engines. Further capitalization can mess you up. Further, an extra space here or there can mess you up. Some of this can be accounted for with a more complicated regex, but it still may not be the best tool.
Depending upon the context of what you're trying to do, it may be easier to create actual HTML objects with this HTML (letting the browser do all the complex parsing) and then use DOM access methods to fetch the info you want.
It works here: http://jsfiddle.net/jfriend00/Man2J/.
try this regexp
var bonly = defaultVal.match(/<([A-z0-9]*)\b[^>]*>(.*?)<\/\1>/)

Categories

Resources