Regex allow multiple work in a sentence - javascript

I'm trying to parse following sentences with regex (javascript) :
I wish a TV
I want some chocolate
I need fire
Currently I'm trying : I(\b[a-zA-Z]*\b){0,5}(TV|chocolate|fire) but it doesn't work. I also made some test with \w but no luck.
I want to allow any word (max 5 words) between "I" and the last word witch is predefined.

To account for non-word chars in-between words, you may use
/I(?:\W+\w+){0,5}\‌​W+(?:TV|chocolate|fir‌​e)/
See the regex demo
The point is that you added word boundaries, but did not account for spaces, punctuation, etc. (all the other non-word chars) between "words".
Pattern details:
I - matches the left delimiter
(?:\W+\w+){0,5}\‌​W+ - matches 0 to 5 sequences (due to the limiting quantifier {n,m}) of 1+ non-word chars (\W+) and 1+ word chars after them (\w+), and a \W+ at the end matches 1 or more non-word chars that must be present to separate the last matched word chars from the...
(?:TV|chocolate|fir‌​e) - matches the trailing delimiter

You need to add the whitespace after the I. Otherwise it wouldn´t capture the whole sentence.
I(\b[a-zA-Z ]*\b){0,5}(TV|chocolate|fire)
I greate site to test regex expressions is regexr

If you don't care about the spaces, use:
/I(\s[a-zA-Z]*\s?){0,5}(TV|chocolate|fire)/

Try
/I\s+(?:\w+\s+){0,5}(TV|chocolate|fire)/
(Test here)
Based on Stefan Kert version, but rely on right side spaces of each extra word instead of word boundaries.
It also accepts any valid "word" (\w) character words of any length and any valid spacing character (not caring for repetitions).

Related

Finding all words ending in "ion" with regex in JavaScript [duplicate]

I need help putting together a regex that will match word that ends with "Id" with case sensitive match.
Try this regular expression:
\w*Id\b
\w* allows word characters in front of Id and the \b ensures that Id is at the end of the word (\b is word boundary assertion).
Gumbo gets my vote, however, the OP doesn't specify whether just "Id" is an allowable word, which means I'd make a minor modification:
\w+Id\b
1 or more word characters followed by "Id" and a breaking space. The [a-zA-Z] variants don't take into account non-English alphabetic characters. I might also use \s instead of \b as a space rather than a breaking space. It would depend if you need to wrap over multiple lines.
This may do the trick:
\b\p{L}*Id\b
Where \p{L} matches any (Unicode) letter and \b matches a word boundary.
How about \A[a-z]*Id\z? [This makes characters before Id optional. Use \A[a-z]+Id\z if there needs to be one or more characters preceding Id.]
I would use
\b[A-Za-z]*Id\b
The \b matches the beginning and end of a word i.e. space, tab or newline, or the beginning or end of a string.
The [A-Za-z] will match any letter, and the * means that 0+ get matched. Finally there is the Id.
Note that this will match words that have capital letters in the middle such as 'teStId'.
I use http://www.regular-expressions.info/ for regex reference
Regex ids = new Regex(#"\w*Id\b", RegexOptions.None);
\b means "word break" and \w means any word character. So \w*Id\b means "{stuff}Id". By not including RegexOptions.IgnoreCase, it will be case sensitive.

How do i write a RegEx that starts reading from behind?

I have a series of words I try to capture.
I have the following problem:
The string ends with a fixed set of words
It is not clearly defined how many words the string consists of. However, it should capture all words that start with a upper case letter (German language). Therefore, the left anchor should be the first word starting with lower case.
Example (bold is what I try to capture):
I like Apple Bananas And Cars.
building houses Might Be Salty + Hard said Jessica.
This is the RegEx I tried so far, it only works, if the "non-capture" string does not include any upper case words:
/(?:[a-zäöü]*)([\p{L} +().&]+[Cars|Hard])/gu
You might start the match with an uppercase character allowing German uppercase chars as well, and then optionally repeat matching either words that start with an uppercase character, or a "special character.
Then end the match with an alternation matching either Hard or Cars.
(?<!\S)[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*(?:\s+(?:[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*|[+()&]))*\s+(?:Hard|Cars)\b
Explanation
(?<!\S) Assert a whitespace boundary to the left to prevent starting the match after a non whitespace char
[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]* Match a word that starts with an uppercase char
(?: Non capture group to match as a whole part
\s+ Match 1+ whitespace chars
(?: Non capture group
[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]* Match a word that starts with uppercase
| Or
[+()&] Match one of the "special" chars
) Close the non capture group
)* Close the non capture group and optionally repeat it
\s+ Match 1+ whitespace chars
(?:Hard|Cars) Match one of the alternatives
\b A word boundary to prevent a partial word match
See a regex demo.
Use \p{Lu} for uppercase letters:
(?:[\p{Lu}+()&][\p{L}+()&]* )+(?:Cars|Hard)
See live demo (showing matching umlauted letters and ß).

regex to match question sentences in long text

I have a long text in form of a string.
This text includes a lot of questions that are at the same time the headers of sections.
These headers always start with a number+dot+whitespace character combination and end with a question mark, I am trying to extract these strings.
This is what I've got so far: longString.match(/\d\.\s+[a-zA-Z]+\s\\?/g).
Sure enough this doesn't work.
In your example you use [a-zA-Z]+, but you might extend that to matching 1 or more word characters using \w+
This part at the end of the pattern \s\\? matches an expected whitespace char followed by an optional backslash.
To match multiple words, you can optionally repeat the pattern to match a word preceded by 1 or more whitespace characters.
You one option is to use
\d\.\s+\w+(?:\s+\w+)*\s*\?
Explanation
\d\. Match a single digit (for 1 or digits use \d+)
\s+\w+ Match a . and 1+ whitspace chars and 1+ word chars
(?:\s+\w+)* Optionally repeat 1+ whitspace chars and 1+ word chars
\s*\? Match 0+ whitespace chars and a question mark.
Regex demo
A broader match might be matching at least a single time any char except a question mark or whitespace char after the digit, dot and whitespace:
\d\.\s+[^\s?]+(?:\s+[^\s?]+)*\?
Regex demo

How to extract the last word in a string with a JavaScript regex?

I need is the last match. In the case below the word test without the $ signs or any other special character:
Test String:
$this$ $is$ $a$ $test$
Regex:
\b(\w+)\b
The $ represents the end of the string, so...
\b(\w+)$
However, your test string seems to have dollar sign delimiters, so if those are always there, then you can use that instead of \b.
\$(\w+)\$$
var s = "$this$ $is$ $a$ $test$";
document.body.textContent = /\$(\w+)\$$/.exec(s)[1];
If there could be trailing spaces, then add \s* before the end.
\$(\w+)\$\s*$
And finally, if there could be other non-word stuff at the end, then use \W* instead.
\b(\w+)\W*$
In some cases a word may be proceeded by non-word characters, for example, take the following sentence:
Marvelous Marvin Hagler was a very talented boxer!
If we want to match the word boxer all previous answers will not suffice due the fact we have an exclamation mark character proceeding the word. In order for us to ensure a successful capture the following expression will suffice and in addition take into account extraneous whitespace, newlines and any non-word character.
[a-zA-Z]+?(?=\s*?[^\w]*?$)
https://regex101.com/r/D3bRHW/1
We are informing upon the following:
We are looking for letters only, either uppercase or lowercase.
We will expand only as necessary.
We leverage a positive lookahead.
We exclude any word boundary.
We expand that exclusion,
We assert end of line.
The benefit here are that we do not need to assert any flags or word boundaries, it will take into account non-word characters and we do not need to reach for negate.
var input = "$this$ $is$ $a$ $test$";
If you use var result = input.match("\b(\w+)\b") an array of all the matches will be returned next you can get it by using pop() on the result or by doing: result[result.length]
Your regex will find a word, and since regexes operate left to right it will find the first word.
A \w+ matches as many consecutive alphanumeric character as it can, but it must match at least 1.
A \b matches an alphanumeric character next to a non-alphanumeric character. In your case this matches the '$' characters.
What you need is to anchor your regex to the end of the input which is denoted in a regex by the $ character.
To support an input that may have more than just a '$' character at the end of the line, spaces or a period for instance, you can use \W+ which matches as many non-alphanumeric characters as it can:
\$(\w+)\W+$
Avoid regex - use .split and .pop the result. Use .replace to remove the special characters:
var match = str.split(' ').pop().replace(/[^\w\s]/gi, '');
DEMO

Match ":)" smiley followed by word boundary

I am trying to match smileys followed by a word boundary \b.
Let's say I wanna match :p and :) followed by \b.
/(:p)\b/ is working fine but why is /(:\))\b/ behaving the opposite?
You cannot use a word boundary here as ) is a non-word character.
Simply put: \b allows you to perform a whole words only search using
a regular expression in the form of \bword\b. A word character is a
character that can be used to form words. All characters that are not
word characters are non-word characters.
Use (:\)) to match :) and capture it in the first capturing group.
Use /(:\))(?![a-z0-9_])/i in order to avoid matching any :)s with letters after the smiley. It is an equivalent of (:\))\B.
\B is the negated version of \b. \B matches at every position where \b
does not. Effectively, \B matches at any position between two word
characters as well as at any position between two non-word characters.
See demo 1 and demo 2.
Addition to stribizhev's answer.. you can use (:\))\B
Examples for when to use what:
\b : string = That man is batman. regex = \bman\b matches only man and not the man in batman because position between tm is not a word boundary (it is a word).
\B : string = I am bat-man and he is super - man. regex = \B-\B matches - in super - man whereas \b-\b matches - in bat-man since position between t- and -m are word boundaries.. and (space) -, - (space) is not.
Note: It is easy to understand if you consider \b or \B as a position between two characters and if the transition from character to character is word to word or word to non word

Categories

Resources