End-of-string regex match too slow - javascript

Demo here. The regex:
([^>]+)$
I want to match text at the end of a HTML snippet that is not contained in a tag (i.e., a trailing text node). The regex above seems like the simplest match, but the execution time seems to scale linearly with the length of the match-text (and has causes hangs in the wild when used in my browser extension). It's also equally slow for matching and non-matching text.
Why is this seemingly simple regex so bad?
(I also tried RegexBuddy but can't seem to get an explanation from it.)
Edit: Here's a snippet for testing the various regexes (click "Run" in the console area).
Edit 2: And a no-match test.

Consider an input like this
abc<def>xyz
With your original expression, ([^>]+)$, the engine starts from a, fails on >, backtracks, restarts from b, then from c etc. So yes, the time grows with size of the input. If, however, you force the engine to consume everything up to the rightmost > first, as in:
.+>([^>]+)$
the backtracking will be limited by the length of the last segment, no matter how much input is before it.
The second expression is not equivalent to the first one, but since you're using grouping, it doesn't matter much, just pick matches[1].
Hint: even when you target javascript, switch to the pcre mode, which gives you access to the step info and debugger:
(look at the green bars!)

You could use the actual DOM instead of Regex, which is time consuming:
var html = "<div><span>blabla</span></div><div>bla</div>Here I am !";
var temp = document.createElement('div');
temp.innerHTML = html;
var lastNode = temp.lastChild || false;
if(lastNode.nodeType == 3){
alert(lastNode.nodeValue);
}

Related

Slice text in two without breaking tags in jQuery

I have the following code that I managed to put up by combining different resources. What this does is that it takes html of a content and breaks it into two halves (for a read more application). Following code is such that it doesn't break a word (waits until the end of word).
var minCharCount = 600;
var divcontent = $('#myDiv').html();
var firstHalf = divcontent.substr(0, minCharCount);
firstHalf = firstHalf.substr(0, Math.min(firstHalf.length, firstHalf.lastIndexOf(" ")));
var secondHalf = divcontent.substr(firstHalf.length, divcontent.length);
However, one last issue with this is that it can break html tags resulting in bad code. Is there a way to make sure that the code breaks them in two after any potential tag ends?
Edit: may be it was a little difficult to understand. What I want is:
long text comes here with tags like <b>bold</b> or even <i>italic</i>.
^1 ^2 ^3
So my point is if we break at 1 its fine, but if we break at 2 and append the two parts somewhere, we get problems. So before breaking at 2 we need to check if it is in the middle of a tag. If it is then wait until the tag ends and then break: i.e. at 3.
WORKING DEMO
instead of
$('#myDiv').html();
run the function on the string returned from
$('#myDiv').text();
This way you don't get any html tags in the input string.
http://api.jquery.com/text/
UPDATE:
(in response to comment)
since you want the html tags, then you can loop through the children() of the target, measure the length of their .text(), and add them to the out put until you reach the minChars amount. Then do the same for the last child you reached, until you reach the closest amount of text to the target char count.
children() excludes text nodes, so you have to use contents().
however, this approach is cumbersome. I think your best bet is to create a range object
see here: https://developer.mozilla.org/en-US/docs/Web/API/Document.createRange

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!
First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Javascript wildcard regex search gives inconsistent results

I am making an interface for searching a fairly limited dictionary of words (2700 or so entries). The words are stored in an XML file thus:
<root>
<w>aunt</w>
<w>active volcano</w>
<w>Xamoschi</w>
</root>
It is fairly basic - the user enters a string, and any matches are spit back out. The problem came when I wanted to include a wildcard character. If a user enters a string with asterisks, each asterisk is replaced by a regex to match zero or more characters, which can be anything.
So, when the user hits search, the script cycles through the XML tags and matches each nodeValue against the pattern srch:
var wildcardified = userinput.replace(/\*/g, ".*?");
var srch = new RegExp(wildcardified, "gi");
//for loop cycles through the xml, and tests with this:
if (srch.test(tag[i].firstChild.nodeValue) {
//it's a match!
}
For the most part, it works as I'd hoped. But I'm getting some inconsistent results that I can't explain. For the values in the XML tags above, this is what happens with various inputs:
a* matches all three
a*n matches aunt and active volcano
a*t only matches aunt
a*ti only matches active volcano
Should #3 not also match the act in active volcano?
I see the same kind of results with other similar sets of words. I've tried to isolate the specific issue, but I can't for the life of me figure out what it is.
The Question: Can someone explain why #3 is not returning "active volcano", and what I can do to fix such behaviour?
Incidentally, I want it to be non-greedy, but just in case that was the issue, I tested both with and without the ?. Both returned the same inconsistent results above.
It's the g modifier in new RegExp(wildcardified, "gi"); that is causing you trouble. For an explanation and a workaround see Why does the "g" modifier give different results when test() is called twice?

Line length for lines with strings

Being obsessed with neatness in Javascript lately, I was curious about whether there is some type of common practice about how to deal with lines that span over 80 cols due to string length. With innerHTML I can mark line breaks with a backslash and indentation spaces won't show up in the content of the element, but that doesn't seem to go for eg. console.log().
Are there any conventions for this or should I just learn to live with lines longer than 80 cols? :)
There's no universal convention. With modern high-res monitors you can easily fit 160 columns and still have room for IDE toolbars without needing to scroll, so I wouldn't be concerned about sticking to 80 columns.
Some people go out of their way to never have any line of code go past n columns, where n might be 80, or 160, or some other arbitrary number based on what fits for their preferred font and screen resolution. Some people I work with don't care and have lines that go way off to the right regardless of whether it is due to a long string or a function with lots of parameters or whatever.
I try to avoid any horizontal scrolling but I don't obsess about it so if I have a string constant that is particularly long I will probably put it all on one line. If I have a string that is built up by concatenating constants and variables I will split it over several lines, because that statement will already have several + operators that are a natural place to add line breaks. If I have a function with lots of parameters, more than would fit without scrolling, I will put each parameter on a newline. For an if statement with a lot of conditions I'd probably break that over several lines.
Regarding what you mentioned about innerHTML versus console.log(): if you break a string constant across lines in your source code by including a backslash and newline then any indenting spaces you put on the second line will become part of the string:
var myString1 = "This has been broken\
into two lines.";
// equivalent to
var myString2 = "This has been broken into two lines.";
// NOT equivalent to
var myString3 = "This has been broken\
into two lines.";
If you use that string for innerHTML the spaces will be treated the same as spaces in your HTML source, i.e., the browser will display it with multiple spaces compressed down to a single space. But for any other uses of the string in your code including console.log() the space characters will all be included.
If horizontal scrolling really bothers you and you have a long string the following method lets you have indenting without extra spaces in the string:
var myString3 = "Hi there, this has been broken"
+ " into several lines by concatenating"
+ " a number of shorter strings.";

Recommendations on Triming Large Amounts of Text from a DOM Object

I'm doing some in browser editing, and I have some content that's on the order of around 20k characters long in a <textarea>.
So it looks something like:
<textarea>
Text 1
Text 2
Text 3
Text 4
[...]
Text 20,000
</textarea>
I'd like to use jquery to trim it down when someone hits a button to chop, but I'm having trouble doing it without overloading the browser. Assume I know that the character numbers are at 16,510 - 17,888, and what I'd like to do is trim it.
I was using:
jQuery('#textsection').html(jQuery('textarea').html().substr(range.start));
But browsers seem to enjoy crashing when I do this. Alternatives?
EDIT
Solution from the comments:
var removeTextNode = document.getElementById('textarea').firstChild;
removeTextNode.splitText(indexOfCharacterToRemoveEverythingBefore);
removeTextNode.parentNode.removeChild(removeTextNode);
Not sure about jQuery, but with plain vanilla Javascript, this can be done by using the splitText() method of the textNode object. Your <pre> has a textNode child which contains all the text inside of it. (You can get it from the childNodes collection.) Split it at the desired index, then use removeChild() to delete the part you don't need.
What browser are you testing on?
substr takes the starting index, and an optional length. If the length is omitted, then it extracts upto the end of the string. substring takes the starting and ending index of the string to extract, which I think might be a better option since you already have those available.
I've created a small example at fiddle using the book Alice's Adventures in Wonderland, by Lewis Carroll. The book is about 160,000 characters in length, and you can try with various starting/ending indexes and see if it crashes the browser. Seems to work fine on my Chrome, Firefox, and Safari. Unfortunately I don't have access to IE. Here's the function that's used:
function chop(start, end) {
var trimmedText = $('#preId').html().substring(start, end);
$('textarea').val(trimmedText);
}
​

Categories

Resources