regex to match question sentences in long text - javascript

I have a long text in form of a string.
This text includes a lot of questions that are at the same time the headers of sections.
These headers always start with a number+dot+whitespace character combination and end with a question mark, I am trying to extract these strings.
This is what I've got so far: longString.match(/\d\.\s+[a-zA-Z]+\s\\?/g).
Sure enough this doesn't work.

In your example you use [a-zA-Z]+, but you might extend that to matching 1 or more word characters using \w+
This part at the end of the pattern \s\\? matches an expected whitespace char followed by an optional backslash.
To match multiple words, you can optionally repeat the pattern to match a word preceded by 1 or more whitespace characters.
You one option is to use
\d\.\s+\w+(?:\s+\w+)*\s*\?
Explanation
\d\. Match a single digit (for 1 or digits use \d+)
\s+\w+ Match a . and 1+ whitspace chars and 1+ word chars
(?:\s+\w+)* Optionally repeat 1+ whitspace chars and 1+ word chars
\s*\? Match 0+ whitespace chars and a question mark.
Regex demo
A broader match might be matching at least a single time any char except a question mark or whitespace char after the digit, dot and whitespace:
\d\.\s+[^\s?]+(?:\s+[^\s?]+)*\?
Regex demo

Related

How do i write a RegEx that starts reading from behind?

I have a series of words I try to capture.
I have the following problem:
The string ends with a fixed set of words
It is not clearly defined how many words the string consists of. However, it should capture all words that start with a upper case letter (German language). Therefore, the left anchor should be the first word starting with lower case.
Example (bold is what I try to capture):
I like Apple Bananas And Cars.
building houses Might Be Salty + Hard said Jessica.
This is the RegEx I tried so far, it only works, if the "non-capture" string does not include any upper case words:
/(?:[a-zäöü]*)([\p{L} +().&]+[Cars|Hard])/gu
You might start the match with an uppercase character allowing German uppercase chars as well, and then optionally repeat matching either words that start with an uppercase character, or a "special character.
Then end the match with an alternation matching either Hard or Cars.
(?<!\S)[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*(?:\s+(?:[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*|[+()&]))*\s+(?:Hard|Cars)\b
Explanation
(?<!\S) Assert a whitespace boundary to the left to prevent starting the match after a non whitespace char
[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]* Match a word that starts with an uppercase char
(?: Non capture group to match as a whole part
\s+ Match 1+ whitespace chars
(?: Non capture group
[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]* Match a word that starts with uppercase
| Or
[+()&] Match one of the "special" chars
) Close the non capture group
)* Close the non capture group and optionally repeat it
\s+ Match 1+ whitespace chars
(?:Hard|Cars) Match one of the alternatives
\b A word boundary to prevent a partial word match
See a regex demo.
Use \p{Lu} for uppercase letters:
(?:[\p{Lu}+()&][\p{L}+()&]* )+(?:Cars|Hard)
See live demo (showing matching umlauted letters and ß).

JavaScript regex with below rules

Need to create a regex for a string with below criteria
Allowable characters:
uppercase A to Z A-Z
lowercase a to z a-z
hyphen `
apostrophe '
single quote '
space
full stop .
numerals 0 to 9 0-9
Validations:
Must start with an alphabetic character a-zA-Z or apostrophe
Cannot have consecutive non-alpha characters except for a full stop followed by a space.
The regex I have from the previous question in this forum. Business came back and want to allow string starting with apostrophe along with [a-zA-Z]. This break some previous validations.
eg: a1rte is valid
'tyer4 is valid
'4rt is invalid
^(?!.*[0-9'`\.\s-]{2})[a-zA-Z][a-zA-Z0-9-`'.\s]+$
Please advise.
You might use
^(?=[a-zA-Z0-9`'. -]+$)(?!.*[0-9'` -]{2})[a-zA-Z'][^\r\n.]*(?:\.[ a-z][^\r\n.]*)*$
Explanation
^ Start of string
(?=[a-zA-Z0-9`'. -]+$) Assert only allowed characters
(?!.*[0-9'` -]{2}) Assert not 2 consecutive listed characters
[a-zA-Z'] Match either a char a-zA-Z or apostrophe
[^\r\n.]* Optionally match any char except a newline or a dot
(?:\.[ a-z][^\r\n.]*)* Optionally repeat matching a dot only followed by a space or char a-z
$ End of string
Regex demo

Regex to disallow certain characters in specific sequence

I have a regular expression that allows only letters, numbers, spaces or hyphens. However, I'd like to disallow the user to do the following:
hello--world Have more than one hyphen sitting next to each other
--hello Have a hyphen in the beginning. It must have a number or letter first
How do I accomplish this? My current regex looks like this:
let alphanumericTest = new RegExp("^\s*([0-9a-zA-Z- ]*)\s*$");
You can try this regex expression. ^\s*[0-9a-zA-Z](?:(?!--)[0-9a-zA-Z- ])*$
This is a demo.
You could make you match a bit more efficient without using a negative lookahead for matching non consecutive hyphens using repeating groups which can optionally start with an hyphen after the first word.
^[ ]*[0-9a-zA-Z]+(?:-[0-9a-zA-Z]+)*-?(?:[ ]+-?(?:[0-9a-zA-Z]+-?)*)*$
(Used [ ] to match a space for clarity)
Explanation
^ Start of string
[ ]* Match 0+ spaces
[0-9a-zA-Z]+ Match 1+ times any of the listed
(?:-[0-9a-zA-Z]+)* Repeat 0+ times matching a hyphen and 1+ what is listed
-? Match optional hyphen
(?: Non capturing group
[ ]+-?(?:[0-9a-zA-Z]+-?)* Match 1+ spaces, optional hyphen, repeat 0+ times what is listed and optional hyphen
)* Close outer non capturing group and repeat 0+ times
$ End of string
Regex demo
Try:
let alphanumericTest = new RegExp("^(?!-)(?!.*--)[0-9a-zA-Z- ]+(?<!-)$");
This checks that the first character is not a - and that there are no consecutive --s anywhere in the string

How to match any string that contains no consecutively repeating letter

My regular expression should match if there aren't any consecutive letters that are the same.
for example :
"ploplir" should match
"ploppir" should not match
so I use this regular expression:
/([.])\1{1,}/
But It does the exact contrary of what I want. How can I make the match work correctly?
Code
See regex in use here
\b(?!\w*(\w)\1)\w+\b
var r = /\b(?!\w*(\w)\1)\w+\b/g
var s = "ploplir ploppir"
console.log(s.match(r))
Explanation
\b Assert position as a word boundary
(?!\w*(\w)\1\w*) Negative lookahead ensuring what follows doesn't match
\w* Match any number of word characters
(\w) Capture a word character into capture group 1
\1 Match the same text as most recently matched by the 1st capture group
\w+ Match one or more word characters
\b Assert position as a word boundary
Maybe you could use lookarounds to check if there are no consecutive letters in the string:
^(?!.*(.)(?=\1)).*$
Explanation
From the beginning of the string ^
A negative look ahead (?!
Which asserts that following .* a character (.) is not followed by the same character (?=\1) using the group reference \1
Close the negative lookahead
Match zero or more characters .*
The end of the string

Regex allow multiple work in a sentence

I'm trying to parse following sentences with regex (javascript) :
I wish a TV
I want some chocolate
I need fire
Currently I'm trying : I(\b[a-zA-Z]*\b){0,5}(TV|chocolate|fire) but it doesn't work. I also made some test with \w but no luck.
I want to allow any word (max 5 words) between "I" and the last word witch is predefined.
To account for non-word chars in-between words, you may use
/I(?:\W+\w+){0,5}\‌​W+(?:TV|chocolate|fir‌​e)/
See the regex demo
The point is that you added word boundaries, but did not account for spaces, punctuation, etc. (all the other non-word chars) between "words".
Pattern details:
I - matches the left delimiter
(?:\W+\w+){0,5}\‌​W+ - matches 0 to 5 sequences (due to the limiting quantifier {n,m}) of 1+ non-word chars (\W+) and 1+ word chars after them (\w+), and a \W+ at the end matches 1 or more non-word chars that must be present to separate the last matched word chars from the...
(?:TV|chocolate|fir‌​e) - matches the trailing delimiter
You need to add the whitespace after the I. Otherwise it wouldn´t capture the whole sentence.
I(\b[a-zA-Z ]*\b){0,5}(TV|chocolate|fire)
I greate site to test regex expressions is regexr
If you don't care about the spaces, use:
/I(\s[a-zA-Z]*\s?){0,5}(TV|chocolate|fire)/
Try
/I\s+(?:\w+\s+){0,5}(TV|chocolate|fire)/
(Test here)
Based on Stefan Kert version, but rely on right side spaces of each extra word instead of word boundaries.
It also accepts any valid "word" (\w) character words of any length and any valid spacing character (not caring for repetitions).

Categories

Resources