Javascript RegExp Tokenizing

Javascript RegExp Tokenizing - javascript

Given a string, I want to use a regular expression to tokenize it. The pattern is as follows: any character (including new line, etc.), until "<", followed by a space zero or more times, followed by "%".
I tried
var patt = /(.)*<(\s)*%/;
but it does not yield the desired result. I would appreciate an explanation along with the pattern.

Use this:
"some string".split(/.*<\s*%/);

/^[\s\S]*?< *%/
should do what you want.
^ causes it to match at the beginning of the string.
[\s\S] matches any character. Literally, it means any space or non-space character, and works around the fact that . does not match newlines.
*? matches zero or more but the fewest necessary for the rest of the pattern to match.
< matches a literal '<'
* (note the space) matches zero or more spaces. This is more readable if written as [ ]*.
% finally matches that character.
If you want to match the entire string (i.e. the % should be the last character in the string), then you can put a $ before the last /.

Related

How to extract the last word in a string with a JavaScript regex?

I need is the last match. In the case below the word test without the $ signs or any other special character:
Test String:
$this$ $is$ $a$ $test$
Regex:
\b(\w+)\b

The $ represents the end of the string, so...
\b(\w+)$
However, your test string seems to have dollar sign delimiters, so if those are always there, then you can use that instead of \b.
\$(\w+)\$$
var s = "$this$ $is$ $a$ $test$";
document.body.textContent = /\$(\w+)\$$/.exec(s)[1];
If there could be trailing spaces, then add \s* before the end.
\$(\w+)\$\s*$
And finally, if there could be other non-word stuff at the end, then use \W* instead.
\b(\w+)\W*$

In some cases a word may be proceeded by non-word characters, for example, take the following sentence:
Marvelous Marvin Hagler was a very talented boxer!
If we want to match the word boxer all previous answers will not suffice due the fact we have an exclamation mark character proceeding the word. In order for us to ensure a successful capture the following expression will suffice and in addition take into account extraneous whitespace, newlines and any non-word character.
[a-zA-Z]+?(?=\s*?[^\w]*?$)
https://regex101.com/r/D3bRHW/1
We are informing upon the following:
We are looking for letters only, either uppercase or lowercase.
We will expand only as necessary.
We leverage a positive lookahead.
We exclude any word boundary.
We expand that exclusion,
We assert end of line.
The benefit here are that we do not need to assert any flags or word boundaries, it will take into account non-word characters and we do not need to reach for negate.

var input = "$this$ $is$ $a$ $test$";
If you use var result = input.match("\b(\w+)\b") an array of all the matches will be returned next you can get it by using pop() on the result or by doing: result[result.length]

Your regex will find a word, and since regexes operate left to right it will find the first word.
A \w+ matches as many consecutive alphanumeric character as it can, but it must match at least 1.
A \b matches an alphanumeric character next to a non-alphanumeric character. In your case this matches the '$' characters.
What you need is to anchor your regex to the end of the input which is denoted in a regex by the $ character.
To support an input that may have more than just a '$' character at the end of the line, spaces or a period for instance, you can use \W+ which matches as many non-alphanumeric characters as it can:
\$(\w+)\W+$

Avoid regex - use .split and .pop the result. Use .replace to remove the special characters:
var match = str.split(' ').pop().replace(/[^\w\s]/gi, '');
DEMO

how to replace all occurrances of "\\" string in java script

This seems a very simple question but I haven't been able to get this to work.
How do I convert the following string:
var origin_str = "abc/!/!"; // Original string
var modified_str = "abc!!"; // replaced string
I tried this:
console.log(origin_str.replace(/\\/,''));
This only removes the first occurrence of backslash. I want to replaceAll. I followed this instruction in SO: How to replace all occurrences of a string in JavaScript?
origin_str.replace(new RegExp('\\', 'g'), '');
This code throws me an error SyntaxError: Invalid regular expression: /\/: \ at end of pattern. What's the regex for removing backslash in javascript.

A quick basic overview of regular expressions in JavaScript
When using regular expressions you can define the expression on two ways.
Either directly in the function or variable by using /regular expression/
Or by using the regExp contructor: new RegExp('regular expression').
Please note the difference between the two ways of defining. In the first the search pattern is encapsuled by forward slashes, while in the second one the search pattern is passed as a string.
Remember that regular expressions is in fact a search language with it's own syntax. Some characters are used to define actions: /, \, ^, $, . (dot), |, ?, *, +, (, ), [, {, ', ". These characters are called metacharacters and need to be escaped if you want them to be part of the search pattern. If not they will be treated as an option or generate script errors. Escaping is done by using the backslash. E.g. \\ escapes the second backslash and the search pattern will now search for backslashes.
There are a multitude of options you can add to your search pattern.:
Examples
adding \d will make the pattern search for a numeric value between [0-9] and/or the underscore. Simple regular expressions are parsed from left to right.
/javascript/
Searches for the word javascript in a string.
/[a-z]/
When a pattern is put between square bracket the search pattern searches for a character matching any one of the values inside the square brackets. This will find d in 229302d34330
You can build a regular expression with multiple blocks.
/(java)|(emca)script/
Find javascript or emcascript in a string. The | is the or operator.
/a/ vs. /a+/
The first matches the first a in aaabbb, the second matches a repetition of a until another character is found. So the second matches: aaa.
The plus sign + means find a one or more times. You can also use * which means zero or more times.
/^\d+$/
We've seen the \d earlier and also the plus sign. This means find one or more numeric characters. The ^ (caret) and $ (dollar sign) are new. The ^ says start searching from the begin of the string, while the $ says until the end of the string. This expression will match: 574545485 but not d43849343, 549854fff or 4348d8788.
Flags
Flags are operators and are declared after the regular expression /regular expression/flags
JavaScript has three flags you can use:
g (global) Searches multiples times for the pattern.
i (ignore case) Ignores case in pattern.
m (multiline) treat beginning and end characters (^ and $) as working over multiple lines (i.e., match the beginning or end of each line (delimited by \n or \r), not only the very beginning or end of the whole input string)
So a regular expression like this:
/d[0-9]+/ig
matches D094938 and D344783 in 98498D094938A37834D344783.
The i makes the search case-insensitive. Matching a D because of the d in the pattern. If D is followed by one or more numbers then the pattern is matched. The g flag commands the expression to look for the pattern globally or simply said: multiple times.
In your case #Qwerty provided the correct regex:
origin_str.replace(/\//g, "")
Where the search pattern is a single forward slash /. Escaped by the backslash to prevent script errors. The g flags commands the replace function to search for all occurrences of the forward slash in the string and replace them with an empty string "".
For a comprehensive tutorial and reference : http://www.regular-expressions.info/tutorial.html

Looking for this?
origin_str.replace(/\//g, "")
The syntax for replace is
.replace(/pattern/flags, replacement)
So in my case the pattern is \/ - an escaped slash
and g is global flag.

regex pattern for name is not working in adobe cq5

i need a regex pattern for name field in which i want to allow character a-z, A-Z and ' - and white space. i am currently using this code but it is giving an error.
final String regexpattern = "/[a-zA-Z\s-']*/";

You have to add another \ before \s, and add ^ and $ for match begin and end.
final String regexpattern = "/^[a-zA-Z\\s'-]*$/";

Your regex (when it is no longer malformed - see below how to fix that) will always match because you do not use anchors (^ - start of string, and $ - end of string) and use a * quantifier (match 1 or more symbols matching the preceding subpattern, as many as possible).
However, just adding ^ and $ will not fix the pattern, because you are using 1 backslash with \s inside a C string (i.e. inside a string literal "..."). The backslash is treated as an escape symbol and is not taken into account as "\s" is an invalid escape sequence. Inside a character class in regex (i.e. in [...]), a hyphen creates a range. Your s-' creates a range from s (dec. code 115) to ' (dec. code 39). Since in a range inside a character class the codes must go from the lowest to the highest, an error is thrown.
You may just use
final String regexpattern = "/^[a-zA-Z\\s-']*$/";
It is possible because the hyphen after a shorthand class \s does not create a range and is considered a literal. As best practice, you can move it to the end of the character class as xdazz did.
As for your additional question in comment:
i have to allow the all characters excluding ^<>%*()#!?
Just use a negated character class: a pair of square brackets with a starting ^ inside it:
final String regexpattern = "/^[^<>%*()#!^?]*$/";
1 2 3
Here, the first ^ is the start-of-string anchor, and the second caret is the negation of the characters inside the character class. The third caret is a literal symbol ^.

RegExp issues - character limit and whitespace ignoring

I need to validate a string that can have any number of characters, a comma, and then 2 characters. I'm having some issues. Here's what I have:
var str="ab,cdf";
var patt1=new RegExp("[A-z]{2,}[,][A-z]{2}");
if(patt1.test(str)) {
alert("true");
}
else {
alert("false");
}
I would expect this to return false, as I have the {2} limit on characters after the comma and this string has three characters. When I run the fiddle, though, it returns true. My (admittedly limited) understanding of RegExp indicates that {2,} is at least 2, and {2} is exactly two, so I'm not sure why three characters after the comma are still returning true.
I also need to be able to ignore a possible whitespace between the comma and the remaining two characters. (In other words, I want it to return true if they have 2+ characters before the comma and two after it - the two after it not including any whitespace that the user may have entered.)
So all of these should return true:
var str = "ab, cd";
var str = "abc, cd";
var str = "ab,cd";
var str = "abc,dc";
I've tried adding the \S indicator after the comma like this:
var patt1=new RegExp("[A-z]{2,}[,]\S[A-z]{2}");
But then the string returns false all the time, even when I have it set to ab, cd, which should return true.
What am I missing?

{2,} is at least 2, and {2} is exactly two, so I'm not sure why three characters after the comma are still returning true.
That's correct. What you forgot is to anchor your expression to string start and end - otherwise it returns true when it occurs somewhere in the string.
not including any whitespace: I've tried adding the \S indicator after the comma
That's the exact opposite. \s matches whitespace characters, \S matches all non-whitespace characters. Also, you probably want some optional repetition of the whitespace, instead of requiring exact one.
[A-z]
Notice that this character range also includes the characters between Z and a, namely []^_`. You will probably want [A-Za-z] instead, or use [a-z] and make your regex case-insensitive.
Combined, this is what your regex should look like (using a regular expression literal instead of the RegExp constructor with a string literal):
var patt1 = /^[a-z]{2,},\s*[a-z]{2}$/i;

You are missing ^,$.Also the range should be [a-zA-Z] not [A-z]
Your regex should be
^[a-zA-Z]{2,}[,]\s*[A-Za-z]{2}$
^ would match string from the beginning...
$ would match string till end.
Without $,^ it would match anywhere in between the string
\s* would match 0 to many space..

Why the return of the regex is false?

The code is showed as follows:
alert(/symbol([.\n]+?)symbol/gi.test('symbolbbbbsymbol'));
or
alert(/#([.\n]+?)#/gi.test('#bbbb#'));

Because you are looking for dots with a character class inside of < and >. Remove the character class:
/<(.+?)>/
Clarification after question edit:
First code block should be using this pattern: /symbol(.+?)symbol/
Second code block should be using this pattern: /#(.+?)#/

The regex returns false because a dot loses its special power to match any character (but newlines) when placed within a character class [] - it only matches a simple ".".
To match and capture the substring delimited at either end by the same single character, the most efficient pattern to use is
/#([^#]+)#/
To match and capture the substring delimited at either end by the same sequence of characters, the pattern to use is
/symbol(.+?)symbol/
or, if you want to match across newlines
/symbol([\s\S]+?)symbol/
where [\s\S] matches any space or non-space character, which equates to any character.
The ? is inlcuded to make the pattern match lazily, i.e. to make sure the match ends on the first occurence of "symbol".

Develop Reference

JavaScript is the programming language of the Web.

Javascript RegExp Tokenizing - javascript

Use this: "some string".split(/.<\s%/);

Related

How to extract the last word in a string with a JavaScript regex?

how to replace all occurrances of "\\" string in java script

regex pattern for name is not working in adobe cq5

RegExp issues - character limit and whitespace ignoring

Why the return of the regex is false?

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Javascript RegExp Tokenizing - javascript

Use this: "some string".split(/.*<\s*%/);

Related

How to extract the last word in a string with a JavaScript regex?

how to replace all occurrances of "\\" string in java script

regex pattern for name is not working in adobe cq5

RegExp issues - character limit and whitespace ignoring

Why the return of the regex is false?

Categories

Resources

Use this: "some string".split(/.<\s%/);