I need help with regular expression.
Using javascript I am going through each line of a text file and I want to replace any match of [0-9]{6,9} with a '*', but, I don't want to replace numbers with prefix 100. So, a number like 1110022 should be replaced (matched), but 1004567 should not (no match).
I need a single expression that will do the trick (just the matching part). I can’t use ^ or $ because the number can appear in the middle of the line.
I have tried (?!100)[0-9]{6,9}, but it doesn't work.
More examples:
Don't match: 10012345
Match: 1045677
Don't match:
1004567
Don't match: num="10034567" test
Match just the middle number in the line: num="10048876" 1200476, 1008888
Thanks
You need to use a leading word boundary to check if a number starts with some specific digit sequence:
\b(?!100)\d{6,9}
See the regex demo
Here, the 100 is checked right after a word boundary, not inside a number.
If you need to replace the matches with just a single asterisk, just use the "*" as a replacement string (see snippet right below).
var re = /\b(?!100)\d{6,9}/g;
var str = 'Don\'t match: 10012345\n\nMatch: 1045677\n\nDon\'t match:\n\n1004567\n\nDon\'t match: num="10034567" test\n\nMatch just the middle number in the line: num="10048876" 1200476, 1008888';
document.getElementById("r").innerHTML = "<pre>" + str.replace(re, '*') + "</pre>";
<div id="r"/>
Or, if you need to replace each digit with *, you need to use a callback function inside a replace:
String.prototype.repeat = function (n, d) {
return --n ? this + (d || '') + this.repeat(n, d) : '' + this
};
var re = /\b(?!100)\d{6,9}/g;
var str = '123456789012 \nDon\'t match: 10012345\n\nMatch: 1045677\n\nDon\'t match:\n\n1004567\n\nDon\'t match: num="10034567" test\n\nMatch just the middle number in the line: num="10048876" 1200476, 1008888';
document.getElementById("r").innerHTML = "<pre>" + str.replace(re, function(m) { return "*".repeat(m.length); }) + "</pre>";
<div id="r"/>
The repeat function is borrowed from BitOfUniverse's answer.
Related
Sorry for one more to the tons of regexp questions but I can't find anything similar to my needs. I want to output the string which can contain number or letter 'A' as the first symbol and numbers only on other positions. Input is any string, for example:
---INPUT--- -OUTPUT-
A123asdf456 -> A123456
0qw#$56-398 -> 056398
B12376B6f90 -> 12376690
12A12345BCt -> 1212345
What I tried is replace(/[^A\d]/g, '') (I use JS), which almost does the job except the case when there's A in the middle of the string. I tried to use ^ anchor but then the pattern doesn't match other numbers in the string. Not sure what is easier - extract matching characters or remove unmatching.
I think you can do it like this using a negative lookahead and then replace with an empty string.
In an non capturing group (?:, use a negative lookahad (?! to assert that what follows is not the beginning of the string followed by ^A or a digit \d. If that is the case, match any character .
(?:(?!^A|\d).)+
var pattern = /(?:(?!^A|\d).)+/g;
var strings = [
"A123asdf456",
"0qw#$56-398",
"B12376B6f90",
"12A12345BCt"
];
for (var i = 0; i < strings.length; i++) {
console.log(strings[i] + " ==> " + strings[i].replace(pattern, ""));
}
You can match and capture desired and undesired characters within two different sides of an alternation, then replace those undesired with nothing:
^(A)|\D
JS code:
var inputStrings = [
"A-123asdf456",
"A123asdf456",
"0qw#$56-398",
"B12376B6f90",
"12A12345BCt"
];
console.log(
inputStrings.map(v => v.replace(/^(A)|\D/g, "$1"))
);
You can use the following regex : /(^A)?\d+/g
var arr = ['A123asdf456','0qw#$56-398','B12376B6f90','12A12345BCt', 'A-123asdf456'],
result = arr.map(s => s.match(/(^A|\d)/g).join(''));
console.log(result);
I have a text that has sentences that may not have space after a dot like:
See also vadding.Constructions on this term abound.
How can I add a space after a dot that is not before the domain name? The text may have URLs like:
See also vadding.Constructions on this term abound. http://example.com/foo/bar
Match and capture an URL and just match all other dots to replace with a dot+space:
var re = /((?:https?|ftps?):\/\/\S+)|\.(?!\s)/g;
var str = 'See also vadding.Constructions on this term abound.\nSee also vadding.Constructions on this term abound. http://example.com/foo/bar';
var result = str.replace(re, function(m, g1) {
return g1 ? g1 : ". ";
});
document.body.innerHTML = "<pre>" + result + "</pre>";
The URL regex - (?:https?|ftps?):\/\/\S+ - matches http or https or ftp, ftps, then :// and 1+ non-whitespaces (\S+). It is one of the basic ones, you can use a more complex one that you can easily find on SO. E.g. see What is a good regular expression to match a URL?.
The approach in more detail:
The ((?:https?|ftps?):\/\/\S+)|\.(?!\s) regex has 2 alternatives: the URL matching part (described above), or (|) the dot matching part (\.(?!\s)).
NOTE that (?!\s) is a negative lookahead that allows matching a dot that is NOT followed with a whitespace.
When we run string.replace() we can specify an anonymous callback function as the second argument and pass the match and group arguments to it. So, here, we have 1 match value (m) and 1 capture group value g1 (the URL). If the URL was matched, g1 is not null. return g1 ? g1 : ". "; means we do not modify the group 1 if it was matched, and if it was not, we matched a standalone dot, thus, we replace with with . .
You can try using RegExp /(\.)(?!=[a-z]{2}\/|[a-z]{3}\/|\s+|$)/g to match . character if not followed by two or three lowercase letters or space character
"See also vadding.Constructions on this term abound. http://example.com/foo/bar"
.replace(/(\.)(?!=[a-z]{2}\/|[a-z]{3}\/|\s+|$)/g, "$1 ")
Using idea from #MarcelKohls
var text = "See also vadding.Constructions on this term abound. http://example.com/foo/bar";
var url_re = /(\bhttps?:\/\/(?:(?:(?!&[^;]+;)|(?=&))[^\s"'<>\]\[)])+\b)/gi;
text = text.split(url_re).map(function(text) {
if (text.match(url_re)) {
return text;
} else {
return text.replace(/\.([^ ])/g, '. $1');
}
}).join('');
document.body.innerHTML = '<pre>' + text + '</pre>';
Use this pattern:
/\.(?! )((?:ftp|http)[^ ]+)?/g
Online Demo
I wrote regex for finding urls in text:
/(http[^\s]+)/g
But now I need same as that but that expression doesn't contain certain substring, for instance I want all those urls which doesn't contain word google.
How can I do that?
Here is a way to achieve that:
http:\/\/(?!\S*google)\S+
See demo
JS:
var re = /http:\/\/(?!\S*google)\S+/g;
var str = 'http://ya.ru http://yahoo.com http://google.com';
var m;
while ((m = re.exec(str)) !== null) {
document.getElementById("r").innerHTML += m[0] + "<br/>";
}
<div id="r"/>
Regex breakdown:
http:\/\/ - a literal sequence of http://
(?!\S*google) - a negative look-ahead that performs a forward check from the current position (i.e. right after http://), and if it finds 0-or-more-non-spaces-heregoogle the match will be cancelled.
\S+ - 1 or more non-whitespace symbols (this is necessary since the lookahead above does not really consume the characters it matches).
Note that if you have any punctuation after the URL, you may add \b right at the end of the pattern:
var re1 = /http:\/\/(?!\S*google)\S+/g;
var re2 = /http:\/\/(?!\S*google)\S+\b/g;
document.write(
JSON.stringify(
'http://ya.ru, http://yahoo.com, http://google.com'.match(re1)
) + "<br/>"
);
document.write(
JSON.stringify(
'http://ya.ru, http://yahoo.com, http://google.com'.match(re2)
)
);
We would like to split a string on instances of the pipe character |, but not if that character is preceded by an escape character, e.g. \|.
ex we would like to see the following string split into the following components
1|2|3\|4|5
1
2
3\|4
5
I'm expecting to be able to use the following javascript function, split, which takes a regular expression. What regex would I pass to split? We are cross platform and would like to support current and previous versions (1 version back) of IE, FF, and Chrome if possible.
Instead of a split, do a global match (the same way a lexical analyzer would):
match anything other than \\ or |
or match any escaped char
Something like this:
var str = "1|2|3\\|4|5";
var matches = str.match(/([^\\|]|\\.)+/g);
A quick explanation: ([^\\|]|\\.) matches either any character except '\' and '|' (pattern: [^\\|]) or (pattern: |) it matches any escaped character (pattern: \\.). The + after it tells it to match the previous once or more: the pattern ([^\\|]|\\.) will therefor be matches once or more. The g at the end of the regex literal tells the JavaScript regex engine to match the pattern globally instead of matching it just once.
What you're looking for is a "negative look-behind matching regular expression".
This isn't pretty, but it should split the list for you:
var output = input.replace(/(\\)?|/g, function($0,$1){ return $1?$1:$0+'\n';});
This will take your input string and replace all of the '|' characters NOT immediately preceded by a '\' character and replace them with '\n' characters.
A regex solution was posted as I was looking into this. So I just went ahead and wrote one without it. I did some simple benchmarks and it is -slightly- faster (I expected it to be slower...).
Without using Regex, if I understood what you desire, this should do the job:
function doSplit(input) {
var output = [];
var currPos = 0,
prevPos = -1;
while ((currPos = input.indexOf('|', currPos + 1)) != -1) {
if (input[currPos-1] == "\\") continue;
var recollect = input.substr(prevPos + 1, currPos - prevPos - 1);
prevPos = currPos;
output.push(recollect);
}
var recollect = input.substr(prevPos + 1);
output.push(recollect);
return output;
}
doSplit('1|2|3\\|4|5'); //returns [ '1', '2', '3\\|4', '5' ]
text = '#container a.filter(.top).filter(.bottom).filter(.middle)';
regex = /(.*?)\.filter\((.*?)\)/;
matches = text.match(regex);
log(matches);
// matches[1] is '#container a'
//matchss[2] is '.top'
I expect to capture
matches[1] is '#container a'
matches[2] is '.top'
matches[3] is '.bottom'
matches[4] is '.middle'
One solution would be to split the string into #container a and rest. Then take rest and execute recursive exec to get item inside ().
Update: I am posting a solution that does work. However I am looking for a better solution. Don't really like the idea of splitting the string and then processing
Here is a solution that works.
matches = [];
var text = '#container a.filter(.top).filter(.bottom).filter(.middle)';
var regex = /(.*?)\.filter\((.*?)\)/;
var match = regex.exec(text);
firstPart = text.substring(match.index,match[1].length);
rest = text.substring(matchLength, text.length);
matches.push(firstPart);
regex = /\.filter\((.*?)\)/g;
while ((match = regex.exec(rest)) != null) {
matches.push(match[1]);
}
log(matches);
Looking for a better solution.
This will match the single example you posted:
<html>
<body>
<script type="text/javascript">
text = '#container a.filter(.top).filter(.bottom).filter(.middle)';
matches = text.match(/^[^.]*|\.[^.)]*(?=\))/g);
document.write(matches);
</script>
</body>
</html>
which produces:
#container a,.top,.bottom,.middle
EDIT
Here's a short explanation:
^ # match the beginning of the input
[^.]* # match any character other than '.' and repeat it zero or more times
#
| # OR
#
\. # match the character '.'
[^.)]* # match any character other than '.' and ')' and repeat it zero or more times
(?= # start positive look ahead
\) # match the character ')'
) # end positive look ahead
EDIT part II
The regex looks for two types of character sequences:
one ore more characters starting from the start of the string up to the first ., the regex: ^[^.]*
or it matches a character sequence starting with a . followed by zero or more characters other than . and ), \.[^.)]*, but must have a ) ahead of it: (?=\)). This last requirement causes .filter not to match.
You have to iterate, I think.
var head, filters = [];
text.replace(/^([^.]*)(\..*)$/, function(_, h, rem) {
head = h;
rem.replace(/\.filter\(([^)]*)\)/g, function(_, f) {
filters.push(f);
});
});
console.log("head: " + head + " filters: " + filters);
The ability to use functions as the second argument to String.replace is one of my favorite things about Javascript :-)
You need to do several matches repeatedly, starting where the last match ends (see while example at https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp/exec):
If your regular expression uses the "g" flag, you can use the exec method multiple times to find successive matches in the same string. When you do so, the search starts at the substring of str specified by the regular expression's lastIndex property. For example, assume you have this script:
var myRe = /ab*/g;
var str = "abbcdefabh";
var myArray;
while ((myArray = myRe.exec(str)) != null)
{
var msg = "Found " + myArray[0] + ". ";
msg += "Next match starts at " + myRe.lastIndex;
print(msg);
}
This script displays the following text:
Found abb. Next match starts at 3
Found ab. Next match starts at 9
However, this case would be better solved using a custom-built parser. Regular expressions are not an effective solution to this problem, if you ask me.
var text = '#container a.filter(.top).filter(.bottom).filter(.middle)';
var result = text.split('.filter');
console.log(result[0]);
console.log(result[1]);
console.log(result[2]);
console.log(result[3]);
text.split() with regex does the trick.
var text = '#container a.filter(.top).filter(.bottom).filter(.middle)';
var parts = text.split(/(\.[^.()]+)/);
var matches = [parts[0]];
for (var i = 3; i < parts.length; i += 4) {
matches.push(parts[i]);
}
console.log(matches);