Regular Expression confused by use of double and single quotes - javascript

I have this JavaScript (running in Chrome 48.0.2564.103 m):
var s1 = 'label1="abc" label2=\'def\' ';
var s2 = 'label1="abc" label2=\'def\' label3="ghi"';
var re = /\b(\w+)\b=(['"]).*?def.*?\2/;
re.exec(s1); // --> ["label2='def'", "label2", "'"]
re.exec(s2); // --> ["label1="abc" label2='def' label3="", "label1", """]
The first exec() matches label2, as I intended. However, the second gets confused by the double quote after 'label3=' and matches label1 instead.
I had expected the use of .*? to tell the regular expression to make the match as tightly as possible, but clearly it doesn't always. Is there a way to tighten up my regular expression?

Just exclude what was seen as a quote
/\b(\w+)\b=(['"])(?:.(?!\2))*def(?:.(?!\2))*.?\2/
So the change was replacing your .*? with (?:.(?!\2))*.
Break down:
(?!) is negative look ahead, non-capturing
(?:) is non-capturing group.
The last letter right before the closing quote would not match if it's not def, need .? to fix
This allows you to combine other rules when you want to allow a='\'' or a="\"" or further a="\\\"":
/\b(\w+)\b=(['"])(?:\\\\|\\\2|.(?!\2))*def(?:\\\\|\\\2|.(?!\2))*.?\2/

The reason s2 gives a different result is that you add a " on the right side of the "def" after label2, which allows the pattern to correctly match everything between the first and last double quote in the string.
I can only guess that the reason a sparse match (?) doesn't have any effect is that at that point the regex engine has already decided to match " rather than '. Regex does its thing left-to-right after all.
The "simplest" way of solving this is to match only non-quotes, rather than using ., between the quotes:
var re = /\b(\w+)\b=(['"])[^'"]*def[^'"]*\2/;
re.exec(s1); // --> ["label2='def'", "label2", "'"]
re.exec(s2); // --> ["label2='def'", "label2", "'"]
The problem with this is that now you can't put any kind of quotes in the value, even if they are perfectly legal:
// This won't match because of the " after def
var s2 = 'label1="abc" label2=\'def"\' label3="ghi"'
// This won't match because there's an escaped single quote in the value
var s2 = 'label1="abc" label2=\'def\\\'\' label3="ghi"'
But basically, regex isn't made for parsing HTML, so if these limitations are a problem you should look into proper parsing.

Related

Extract content of code which start with a curly bracket and ends with a curly bracket followed by closing parenthesis

I'm completely mess with Regular Expressions right now(lack of practice).
I'm writing a node script, which goes through a bunch of js files, each file calls a function, with one of the arguments being a json. The aim is to get all those json arguments and place them in one file. The problem I'm facing at the moment is the extraction of the argument part of the code, here is the function call part of that string:
$translateProvider.translations('de', {
WASTE_MANAGEMENT: 'Abfallmanagement',
WASTE_TYPE_LIST: 'Abfallarten',
WASTE_ENTRY_LIST: 'Abfalleinträge',
WASTE_TYPE: 'Abfallart',
TREATMENT_TYPE: 'Behandlungsart',
TREATMENT_TYPE_STATUS: 'Status Behandlungsart',
DUPLICATED_TREATMENT_TYPE: 'Doppelte Behandlungsart',
TREATMENT_TYPE_LIST: 'Behandlungsarten',
TREATMENT_TARGET_LIST: 'Ziele Behandlungsarten',
TREATMENT_TARGET_ADD: 'Ziel Behandlungsart hinzufügen',
SITE_TARGET: 'Gebäudeziel',
WASTE_TREATMENT_TYPES: 'Abfallbehandlungsarten',
WASTE_TREATMENT_TARGETS: '{{Abfallbehandlungsziele}}',
WASTE_TREATMENT_TYPES_LIST: '{{Abfallbehandlungsarten}}',
WASTE_TYPE_ADD: 'Abfallart hinzufügen',
UNIT_ADD: 'Einheit hinzufügen'
})
So I'm trying to write a regular expression which matches the segment of the js code, which starts with "'de', {" and ends with "})", while it can have any characters between(single/double curly brackets included).
I tried something like this \'de'\s*,\s*{([^}]*)})\ , but that doesn't work. The furthest I got was with this \'de'\s*,\s*{([^})]*)}\ , but this ends at the first closing curly bracket within the json, which is not what I want.
It seems, that even the concepts of regular exressions I understood before, now I completely forgot.
Any is help is much appreciated.
You did not state the desired output. Here is a solution that parses the text, and creates an array of arrays. You can easily transform that to a desired output.
const input = `$translateProvider.translations('de', {
WASTE_MANAGEMENT: 'Abfallmanagement',
WASTE_TYPE_LIST: 'Abfallarten',
WASTE_ENTRY_LIST: 'Abfalleinträge',
WASTE_TYPE: 'Abfallart',
TREATMENT_TYPE: 'Behandlungsart',
TREATMENT_TYPE_STATUS: 'Status Behandlungsart',
DUPLICATED_TREATMENT_TYPE: 'Doppelte Behandlungsart',
TREATMENT_TYPE_LIST: 'Behandlungsarten',
TREATMENT_TARGET_LIST: 'Ziele Behandlungsarten',
TREATMENT_TARGET_ADD: 'Ziel Behandlungsart hinzufügen',
SITE_TARGET: 'Gebäudeziel',
WASTE_TREATMENT_TYPES: 'Abfallbehandlungsarten',
WASTE_TREATMENT_TARGETS: '{{Abfallbehandlungsziele}}',
WASTE_TREATMENT_TYPES_LIST: '{{Abfallbehandlungsarten}}',
WASTE_TYPE_ADD: 'Abfallart hinzufügen',
UNIT_ADD: 'Einheit hinzufügen'
})`;
const regex1 = /\.translations\([^{]*\{\s+(.*?)\s*\}\)/s;
const regex2 = /',[\r\n]+\s*/;
const regex3 = /: +'/;
let result = [];
let m = input.match(regex1);
if(m) {
result = m[1].split(regex2).map(line => line.split(regex3));
}
console.log(result);
Explanation of regex1:
\.translations\( -- literal .translations(
[^{]* -- anything not {
\{\s+ -- { and all whitespace
(.*?) -- capture group 1 with non-greedy scan up to:
\s*\}\) -- whitespace, followed by })
s flag to make . match newlines
Explanation of regex2:
',[\r\n]+\s* -- ',, followed by newlines and space (to split lines)
Explanation of regex3:
: +' -- literal : ' (to split key/value)
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This can be done with lookahead, lookbehind, and boundary-type assertions:
/(?<=^\$translateProvider\.translations\('de', {)[\s\S]*(?=}\)$)/
(?<=^\$translateProvider\.translations\('de', {) is a lookbehind assertion that checks for '$translateProvider.translations('de', {' at the beginning of the string.
(?=}\)$) is a lookahead assertion that checks for '})' at the end of the string.
[\s\S]* is a character class that matches any sequence of space and non-space characters between the two assertions.
Here is the regex101 link for you to test
Hope this helps.

JS conditional RegEx that removes different parts of a string between two delimiters

I have a string of text with HTML line breaks. Some of the <br> immediately follow a number between two delimiters «...» and some do not.
Here's the string:
var str = ("«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>");
I’m looking for a conditional regex that’ll remove the number and delimiters (ex. «1») as well as the line break itself without removing all of the line breaks in the string.
So for instance, at the beginning of my example string, when the script encounters »<br> it’ll remove everything between and including the first « to the left, to »<br> (ex. «1»<br>). However it would not remove «2»some text<br>.
I’ve had some help removing the entire number/delimiters (ex. «1») using the following:
var regex = new RegExp(UsedKeys.join('|'), 'g');
var nextStr = str.replace(/«[^»]*»/g, " ");
I sure hope that makes sense.
Just to be super clear, when the string is rendered in a browser, I’d like to go from this…
«1»
«2»some text
«3»
«4»more text
«5»
«6»even more text
To this…
«2»some text
«4»more text
«6»even more text
Many thanks!
Maybe I'm missing a subtlety here, if so I apologize. But it seems that you can just replace with the regex: /«\d+»<br>/g. This will replace all occurrences of a number between « & » followed by <br>
var str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\d+»<br>/g, '')
console.log(newStr)
To match letters and digits you can use \w instead of \d
var str = "«a»<br>«b»some text<br>«hel»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\w+?»<br>/g, '')
console.log(newStr)
This snippet assumes that the input within the brackets will always be a number but I think it solves the problem you're trying to solve.
const str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>";
console.log(str.replace(/(«(\d+)»<br>)/g, ""));
/(«(\d+)»<br>)/g
«(\d+)» Will match any brackets containing 1 or more digits in a row
If you would prefer to match alphanumeric you could use «(\w+)» or for any characters including symbols you could use «([^»]+)»
<br> Will match a line break
//g Matches globally so that it can find every instance of the substring
Basically we are only removing the bracketed numbers if they are immediately followed by a line break.

How to check if a string contains specific words in different languages [duplicate]

I have simple regex which founds some word in text:
var patern = new RegExp("\bsomething\b", "gi");
This match word in text with spaces or punctuation around.
So it match:
I have something.
But doesn't match:
I havesomething.
what is fine and exactly what I need.
But I have issue with for example Arabic language. If I have regex:
var patern = new RegExp("\bرياضة\b", "gi");
and text:
رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي
The keyword which I am looking for is at the end of the text.
But this doesn't work, it just doesn't find it.
It works if I remove \b from regex:
var patern = new RegExp("رياضة", "gi");
But that is now what I want, because I don't want to find it if it's part of another word like in english example above:
I havesomething.
So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.
We have first to understand what does \b mean:
\b is an anchor that matches at a position that is called a "word boundary".
In your case, the word boundaries that you are looking for are not having other Arabic letters.
To match only Arabic letters in Regex, we use unicode:
[\u0621-\u064A]+
Or we can simply use Arabic letters directly
[ء-ي]+
The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:
[^ء-ي]ARABIC TEXT[^ء-ي]
The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.
Consider this example that you gave us which I modified a little bit:
أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا
If we are trying to match only رياض, this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.
var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);
If you would like to account for أآإا with one code, you could use something like this [\u0622\u0623\u0625\u0627] or simply list them all between square brackets [أآإا]. Here is a complete code
var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);
Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج, all diacritics َ ً ُ ٌ ِ ٍ ّ, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠, use this regex: [،-٩]+
Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
This doesn't work because of the Arabic language which isn't supported on the regex engine.
You could search for the unicode chars in the text (Unicode ranges).
Or you could use encoding to convert the text into unicode and then make somehow the regex (i never have tried this but it should work).
I used this ء-ي٠-٩ and it works for me
If you don't need a complicated RegEx (for instance, because you're looking for a particular word or a short list of words), then I've found that it's actually easier to tokenize the search text and find it that way:
>>> text = 'رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي '
>>> tokens = text.split()
>>> print(tokens)
['رياضة', 'أنا', 'أحب', 'رياضتي', 'وأنا', 'سعيد', 'حقا', 'هنا', 'لها', 'حبي']
>>> search_words = ['رياضة', 'رياضت']
>>> found = [w for w in tokens if w in search_words]
>>> print(found)
['رياضة'] # returns only full-word match
I'm sure that this is slower than RegEx, but not enough that I've ever noticed.
If your text had punctuation, you could do a more sophisticated tokenization (so it would find things like 'رياضة؟') using NLTK.

Regexp to capture comma separated values

I have a string that can be a comma separated list of \w, such as:
abc123
abc123,def456,ghi789
I am trying to find a JavaScript regexp that will return ['abc123'] (first case) or ['abc123', 'def456', 'ghi789'] (without the comma).
I tried:
^(\w+,?)+$ -- Nope, as only the last repeating pattern will be matched, 789
^(?:(\w+),?)+$ -- Same story. I am using non-capturing bracket. However, the capturing just doesn't seem to happen for the repeated word
Is what I am trying to do even possible with regexp? I tried pretty much every combination of grouping, using capturing and non-capturing brackets, and still not managed to get this happening...
If you want to discard the whole input when there is something wrong, the simplest way is to validate, then split:
if (/^\w+(,\w+)*$/.test(input)) {
var values = input.split(',');
// Process the values here
}
If you want to allow empty value, change \w+ to \w*.
Trying to match and validate at the same time with single regex requires emulation of \G feature, which assert the position of the last match. Why is \G required? Since it prevents the engine from retrying the match at the next position and bypass your validation. Remember than ECMA Script regex doesn't have look-behind, so you can't differentiate between the position of an invalid character and the character(s) after it:
something,=bad,orisit,cor&rupt
^^ ^^
When you can't differentiate between the 2 positions, you can't rely on the engine to do a match-all operation alone. While it is possible to use a while loop with RegExp.exec and assert the position of last match yourself, why would you do so when there is a cleaner option?
If you want to savage whatever available, torazaburo's answer is a viable option.
Live demo
Try this regex :
'/([^,]+)/'
Alternatively, strings in javascript have a split method that can split a string based on a delimeter:
s.split(',')
Split on the comma first, then filter out results that do not match:
str.split(',').filter(function(s) { return /^\w+$/.test(s); })
This regex pattern separates numerical value in new line which contains special character such as .,,,# and so on.
var val = [1234,1213.1212, 1.3, 1.4]
var re = /[0-9]*[0-9]/gi;
var str = "abc123,def456, asda12, 1a2ass, yy8,ghi789";
var re = /[a-z]{3}\d{3}/g;
var list = str.match(re);
document.write("<BR> list.length: " + list.length);
for(var i=0; i < list.length; i++) {
document.write("<BR>list(" + i + "): " + list[i]);
}
This will get only "abc123" code style in the list and nothing else.
May be you can use split function
var st = "abc123,def456,ghi789";
var res = st.split(',');

Regex to get string between curly braces

Unfortunately, despite having tried to learn regex at least one time a year for as many years as I can remember, I always forget as I use them so infrequently. This year my new year's resolution is to not try and learn regex again - So this year to save me from tears I'll give it to Stack Overflow. (Last Christmas remix).
I want to pass in a string in this format {getThis}, and be returned the string getThis. Could anyone be of assistance in helping to stick to my new year's resolution?
Related questions on Stack Overflow:
How can one turn regular quotes (i.e. ', ") into LaTeX/TeX quotes (i.e. `', ``'')
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Try
/{(.*?)}/
That means, match any character between { and }, but don't be greedy - match the shortest string which ends with } (the ? stops * being greedy). The parentheses let you extract the matched portion.
Another way would be
/{([^}]*)}/
This matches any character except a } char (another way of not being greedy)
/\{([^}]+)\}/
/ - delimiter
\{ - opening literal brace escaped because it is a special character used for quantifiers eg {2,3}
( - start capturing
[^}] - character class consisting of
^ - not
} - a closing brace (no escaping necessary because special characters in a character class are different)
+ - one or more of the character class
) - end capturing
\} - the closing literal brace
/ - delimiter
If your string will always be of that format, a regex is overkill:
>>> var g='{getThis}';
>>> g.substring(1,g.length-1)
"getThis"
substring(1 means to start one character in (just past the first {) and ,g.length-1) means to take characters until (but not including) the character at the string length minus one. This works because the position is zero-based, i.e. g.length-1 is the last position.
For readers other than the original poster: If it has to be a regex, use /{([^}]*)}/ if you want to allow empty strings, or /{([^}]+)}/ if you want to only match when there is at least one character between the curly braces. Breakdown:
/: start the regex pattern
{: a literal curly brace
(: start capturing
[: start defining a class of characters to capture
^}: "anything other than }"
]: OK, that's our whole class definition
*: any number of characters matching that class we just defined
): done capturing
}: a literal curly brace must immediately follow what we captured
/: end the regex pattern
Try this:
/[^{\}]+(?=})/g
For example
Welcome to RegExr v2.1 by #{gskinner.com}, #{ssd.sd} hosted by Media Temple!
will return gskinner.com, ssd.sd.
Try this
let path = "/{id}/{name}/{age}";
const paramsPattern = /[^{}]+(?=})/g;
let extractParams = path.match(paramsPattern);
console.log("extractParams", extractParams) // prints all the names between {} = ["id", "name", "age"]
Here's a simple solution using javascript replace
var st = '{getThis}';
st = st.replace(/\{|\}/gi,''); // "getThis"
As the accepted answer above points out the original problem is easily solved with substring, but using replace can solve the more complicated use cases
If you have a string like "randomstring999[fieldname]"
You use a slightly different pattern to get fieldname
var nameAttr = "randomstring999[fieldname]";
var justName = nameAttr.replace(/.*\[|\]/gi,''); // "fieldname"
This one works in Textmate and it matches everything in a CSS file between the curly brackets.
\{(\s*?.*?)*?\}
selector {.
.
matches here
including white space.
.
.}
If you want to further be able to return the content, then wrap it all in one more set of parentheses like so:
\{((\s*?.*?)*?)\}
and you can access the contents via $1.
This also works for functions, but I haven't tested it with nested curly brackets.
You want to use regex lookahead and lookbehind. This will give you only what is inside the curly braces:
(?<=\{)(.*?)(?=\})
i have looked into the other answers, and a vital logic seems to be missing from them . ie, select everything between two CONSECUTIVE brackets,but NOT the brackets
so, here is my answer
\{([^{}]+)\}
Regex for getting arrays of string with curly braces enclosed occurs in string, rather than just finding first occurrence.
/\{([^}]+)\}/gm
var re = /{(.*)}/;
var m = "{helloworld}".match(re);
if (m != null)
console.log(m[0].replace(re, '$1'));
The simpler .replace(/.*{(.*)}.*/, '$1') unfortunately returns the entire string if the regex does not match. The above code snippet can more easily detect a match.
Try this one, according to http://www.regextester.com it works for js normaly.
([^{]*?)(?=\})
This one matches everything even if it finds multiple closing curly braces in the middle:
\{([\s\S]*)\}
Example:
{
"foo": {
"bar": 1,
"baz": 1,
}
}
You can use this regex recursion to match everythin between, even another {} (like a JSON text) :
\{([^()]|())*\}
Even this helps me while trying to solve someone's problem,
Split the contents inside curly braces ({}) having a pattern like,
{'day': 1, 'count': 100}.
For example:
#include <iostream>
#include <regex>
#include<string>
using namespace std;
int main()
{
//string to be searched
string s = "{'day': 1, 'count': 100}, {'day': 2, 'count': 100}";
// regex expression for pattern to be searched
regex e ("\\{[a-z':, 0-9]+\\}");
regex_token_iterator<string::iterator> rend;
regex_token_iterator<string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) cout << " [" << *a++ << "]";
cout << endl;
return 0;
}
Output:
[{'day': 1, 'count': 100}] [{'day': 2, 'count': 100}]
Your can use String.slice() method.
let str = "{something}";
str = str.slice(1,-1) // something

Categories

Resources