Performance issue using regex to replace/clear substring - javascript

I have a string containing things like this:
<a#{style}#{class}#{data} id="#{attr:id}">#{child:content} #{child:whatever}</a>
Everything to do here is just clear #{xxx}, except sub-strings starting with #{child: .
I used str.match() to get all sub-strings "#{*}" in an array to search and keep all #{child: substrings:
var matches = str.match(new RegExp("#\{(.*?)\}",'g'));
if (matches && matches.length){
for(var i=0; i<matches.length; i++){
if (matches[i].search("#{child:") == -1) str = str.replace(matches[i],'');
}
}
I got it running ok, but it's too slow when string becomes bigger (~2 seconds / +1000 nodes like this one on top)
Is there some alternative to do it, maybe using a rule (if exists) to escape #{child: direct in regex and improve performance?

If I understand your question correctly you don't want to remove the #{child:...} sub-strings but everything else of the format #{...} should go. In which case can you could change the regular expression to check that child: is not matched when you perform the replace:
var str = '<a#{style}#{class}#{data} id="#{attr:id}">#{child:content} #{child:whatever}</a>';
str = str.replace(/#\{((?!child:)[\s\S])+?\}/g, '');
This seems pretty fast.

Related

Finding punctuation marks in text with string-methods

how can I find out when a punctuation(?!;.) or "<" character comes in the string. I don’t want to use an array or compare any letter, but try to solve it with string methods. Something like that:
var text = corpus.substr(0, corpus.indexOf(".");
Ok, if I explicitly specify a character like a punct, it works fine. The problem with my parsing is that with a long text in a loop, I no longer know how a sentence ends, whether with question marks or exclamation points. I tried following, but it doesn’t work:
var text = corpus.substr(0, corpus.indexOf(corpus.search("."));
I want to loop through a long string and use every punctuation found to use it as the end-of-sentence character.
Do you know how can I solve my problem?
You can start with RegExp and weight it against going character by character and compare ascii codes essentially. Split is another way ( just posted above ).
RegExp solution
function getTextUpToPunc( text ) {
const regExp = /^.+(\!|\?|\.)/mg;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}
getTextUpToPunc(
"what a chunky funky monkey! this is really someting else"
)
The key advantage here is that you do not need to loop through the entire string and hold control over the iteration by doing regExp.exec( text ).
The split solution posted earlier would work but split will loop over the entire string. Typically that would not be an issue but if your strings are thousands upon thousands of characters and you do this operation a lot that it would make sense to think about performance.
And if this function will be ran many many times, a small performance improvement would be to memoize the RegExp creation:
const regExp = /^.+(\!|\?|\.)/mg;
Into something like this
function getTextUpToPunc( text ) {
if( !this._regExp ) this._regExp = /^.+(\!|\?|\.)/mg;;
const regExp = this._regExp;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}
Use a regular expression:
var text = corpus.split(/[(?!;.)<]/g);

Get javascript node raw content

I have a javascript node in a variable, and if I log that variable to the console, I get this:
"​asekuhfas eo"
Just some random string in a javascript node. I want to get that literally to be a string. But the problem is, when I use textContent on it, I get this:
​asekuhfas eo
The special character is converted. I need to get the string to appear literally like this:
​asekuhfas eo
This way, I can deal with the special character (recognize when it exists in the string).
How can I get that node object to be a string LITERALLY as it appears?
As VisionN has pointed out, it is not possible to reverse the UTF-8 encoding.
However by using charCodeAt() you can probably still achieve your goal.
Say you have your textContent. By iterating through each character, retrieving its charCode and prepending "&#" as well as appending ";" you can get your desired result. The downside of this method obviously being that you will have each and every character in this annotation, even those do not require it. By introducing some kind of threshold you can restrict this to only the exotic characters.
A very naive approach would be something like this:
var a = div.textContent;
var result = "";
var treshold = 1000;
for (var i = 0; i < a.length; i++) {
if (a.charCodeAt(i) > 1000)
result += "&#" + a.charCodeAt(i) + ";";
else
result += a[i];
}
textContent returns everything correctly, as ​ is the Unicode Character 'ZERO WIDTH SPACE' (U+200B), which is:
commonly abbreviated ZWSP
this character is intended for invisible word separation and for line break control; it has no width, but its presence between two characters does not prevent increased letter spacing in justification
It can be easily proven with:
var div = document.createElement('div');
div.innerHTML = '​xXx';
console.log( div.textContent ); // "​xXx"
console.log( div.textContent.length ); // 4
console.log( div.textContent[0].charCodeAt(0) ); // 8203
As Eugen Timm mentioned in his answer it is a bit tricky to convert UTF characters back to HTML entities, and his solution is completely valid for non standard characters with char code higher than 1000. As an alternative I may propose a shorter RegExp solution which will give the same result:
var result = div.textContent.replace(/./g, function(x) {
var code = x.charCodeAt(0);
return code > 1e3 ? '&#' + code + ';' : x;
});
console.log( result ); // "​xXx"
For a better solution you may have a look at this answer which can handle all HTML special characters.

change regex to match some words instead of all words containing PRP

This regex matches all characters between whitespace if the word contains PRP.
How can I get it to match all words, or characters in-between whitepsace, if they contain PRP, but not if they contain me in any case.
So match all words containing PRP, but not containing ME or me.
Here is the regex to match words containing PRP: \S*PRP\S*
You can use negative lookahead for this:
(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)
Working Demo
PS: Use group #1 for your matched word.
Code:
var re = /(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)/;
var s = 'word abcPRP def';
var m = s.match(re);
if (m) console.log(m[1]); //=> abcPRP
Instead of using complicated regular expressions which would be confusing for almost anyone who's reading it, why don't you break up your code into two sections, separating the words into an array and filtering out the results with stuff you don't want?
function prpnotme(w) {
var r = w.match(/\S+/g);
if(r == null)
return [];
var i=0;
while(i<r.length) {
if(!r[i].contains('PRP') || r[i].toLowerCase().contains('me'))
r.splice(i,1);
else
i++;
}
return r;
}
console.log(prpnotme('whattttttt ok')); // []
console.log(prpnotme('MELOLPRP PRPRP PRPthemeok PRPmhm')); // ['PRPRP', 'PRPmhm']
For a very good reason why this is important, imagine if you ever wanted to add more logic. You're much more likely to make a mistake when modifying complicated regex to make it even more complicated, and this way it's done with simple logic that make perfect sense when reading each predicate, no matter how much you add on.

Matching hashes using regex, but not when they are part of an url

I am struggling with a regex in javascript that needs the text after # to the first word boundary, but not match it if it is part of an url. So
#test - should match test
sometext#test2 - should match test2
xx moretext#test3 - should match test3
http://test.com#tab1 - should not match tab1
I am replacing the text after the hash with a link (but not the hash character itself). There can be more than one hash in the text, and it should match them all (I guess I should use /g for that).
Matching the part after the hash is quite easy: /#\b(.+?)\b/g, but not matching it if the string itself starts with "http" is something I cannot solve. I should probably use a negative look-around, but I am having problems getting my head around that.
Any help is greatly appreciated!
Try this regex using a negative lookahead instead since JS doesn't support lookbehinds:
/^(?!http:\/\/).*#\b(.+?)\b/
You may want to check for www too, depending on your conditions.
Edit: Then you can do this:
str = str.replace(re.exec(str)[1], 'replaced!');
http://jsfiddle.net/j7c79/2/
Edit 2: Sometimes a regex alone is not the way to go if it gets too complicated. Try a different approach:
var txt = "asdfgh http://asdf#test1 #test2 woot#test3";
function replaceHashWords(str, rep) {
var isUrl = /^http/.test(str), result = [];
!isUrl && str.replace(/#\b(.+?)\b/g, function(a,b){ result.push(b); });
return str.replace((new RegExp('('+ result.join('|') +')','g')), rep);
}
alert(replaceHashWords(txt, 'replaced!'));
// asdfgh http://asdf#replaced! #replaced! woot#replaced!
As regex is, often (if not always), quite expensive to use, I'd suggest using basic string, and array, methods to determine whether a given set of characters represents an URL (though I'm assuming that all URLS will start with the http string):
$('ul li').each(
function() {
var t = $(this).text(),
words = t.split(/\s+/),
foundHashes = [],
word = '';
for (var i = 0, len = words.length; i < len; i++) {
word = words[i];
if (word.indexOf('http') == -1 && word.indexOf('#') !== -1) {
var match = word.substring(word.indexOf('#') + 1);
foundHashes.push(match);
}
}
// the following just shows what, if anything, was found
// and can definitely be safely omitted
if (foundHashes.length) {
var newSpan = $('<span />', {
'class': 'matchedWords'
}).text(foundHashes.join(', ')).appendTo($(this));
}
});
JS Fiddle demo (with some timing information printed to the console).
References:
jQuery:
appendTo().
each().
text().
'Vanilla' JavaScript
Array.join().
String.indexOf().
String.split().
String.substring().
This would require a lookbehind, something sadly lacking from JavaScript's capabilities.
However, if your subject string is some HTML and those URLs are in href attributes, you can create a document out of it and search for text nodes, only replacing their nodeValues instead of the whole HTML string.

Splitting Nucleotide Sequences in JS with Regexp

I'm trying to split up a nucleotide sequence into amino acid strings using a regular expression. I have to start a new string at each occurrence of the string "ATG", but I don't want to actually stop the first match at the "ATG". Valid input is any ordering of a string of As, Cs, Gs, and Ts.
For example, given the input string: ATGAACATAGGACATGAGGAGTCA
I should get two strings: ATGAACATAGGACATGAGGAGTCA (the whole thing) and ATGAGGAGTCA (the first match of "ATG" onward). A string that contains "ATG" n times should result in n results.
I thought the expression /(?:[ACGT]*)(ATG)[ACGT]*/g would work, but it doesn't. If this can't be done with a regexp it's easy enough to just write out the code for, but I always prefer an elegant solution if one is available.
If you really want to use regular expressions, try this:
var str = "ATGAACATAGGACATGAGGAGTCA",
re = /ATG.*/g, match, matches=[];
while ((match = re.exec(str)) !== null) {
matches.push(match);
re.lastIndex = match.index + 3;
}
But be careful with exec and changing the index. You can easily make it an infinite loop.
Otherwise you could use indexOf to find the indices and substr to get the substrings:
var str = "ATGAACATAGGACATGAGGAGTCA",
offset=0, match=str, matches=[];
while ((offset = match.indexOf("ATG", offset)) > -1) {
match = match.substr(offset);
matches.push(match);
offset += 3;
}
I think you want is
var subStrings = inputString.split('ATG');
KISS :)
Splitting a string before each occurrence of ATG is simple, just use
result = subject.split(/(?=ATG)/i);
(?=ATG) is a positive lookahead assertion, meaning "Assert that you can match ATG starting at the current position in the string".
This will split GGGATGTTTATGGGGATGCCC into GGG, ATGTTT, ATGGGG and ATGCCC.
So now you have an array of (in this case four) strings. I would now go and take those, discard the first one (this one will never contain nor start with ATG) and then join the strings no. 2 + ... + n, then 3 + ... + n etc. until you have exhausted the list.
Of course, this regex doesn't do any validation as to whether the string only contains ACGT characters as it only matches positions between characters, so that should be done before, i. e. that the input string matches /^[ACGT]*$/i.
Since you want to capture from every "ATG" to the end split isn't right for you. You can, however, use replace, and abuse the callback function:
var matches = [];
seq.replace(/atg/gi, function(m, pos){ matches.push(seq.substr(pos)); });
This isn't with regex, and I don't know if this is what you consider "elegant," but...
var sequence = 'ATGAACATAGGACATGAGGAGTCA';
var matches = [];
do {
matches.push('ATG' + (sequence = sequence.slice(sequence.indexOf('ATG') + 3)));
} while (sequence.indexOf('ATG') > 0);
I'm not completely sure if this is what you're looking for. For example, with an input string of ATGabcdefghijATGklmnoATGpqrs, this returns ATGabcdefghijATGklmnoATGpqrs, ATGklmnoATGpqrs, and ATGpqrs.

Categories

Resources