How to count two words as 1 in same line - javascript

In the text file I've got, each sentence is represented with a specific type such as: contrast.
A contrasting sentence can either be represented with a tag "CONTRAST" or "CONTR" or "WEAKCONTR". For instance:
IMPSENT_CONTRAST_VIS(Studying networks in this way can help to
identify the people from whom an individual learns , where
conflicts_MD:+ in understanding_MD:+ may originate , and which
contextual factors influence learning .)
So I count these with following expression: /(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\())/g which works perfectly fine.
Now the problem is some sentences are expressed with more than one contrast tag such as CONTR & WEAKCONTR together. For instance:
IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS(Studying_MD:+ networks in this way
can help to identify_MD:+ the people from whom an individual learns
, where conflicts_MD:+ in understanding_MD:+ may originate , and
which contextual factors influence learning .)
At this point I have to count these as 1 not 2. Do you have any idea how possible this is with RegExp?

You can use lookaheads to assert it, and then count the matches:
(?=\w*_(?:WEAK|CONTRAST|CONTR[_)]))\b\w+\b
Demo here: http://regex101.com/r/xP2yI7/3
Notice the match count.
This will match the whole IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS expression, but only if it matches the part in the lookahead, which filters for the keywords you're looking after. This will match even if you have multiple such sentences on the same line.
Also, I've simplified your regex a bit, retaining the same meaning. Notice you don't have to escape the _.

You really just care if the tag shows up in the line at all, so just grab the whole line, provided it has a tag, like so:
/^([A-Z_]+(WEAK|CONTRAST|CONTR)+[A-Z_]*)/gm
From the start of the line ^ look for a word block with A-Z or _ followed by the tag, optionally followed by more words/underscores.
DEMO

Can you try adding \w+:
/(\_(WEAK\w+))|(\_CONTRAST\w+)|(\_CONTR(\_\w+|\())/g

Something like this?
(^(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\()))

Related

Regex for bible references

I am working on some code for an online bible. I need to identify when references are written out. I have looked all through stackoverflow and tried various regex examples but they all seem to fail with single books (eg Jude) as they require a number to proceed the book name. Here is my solution so far :
/((?:(I+|1st|2nd|3rd|First|Second|Third|1|2|3))?( )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Jude|Revelation))(([ .)\n|])([^a-zA-Z]))([\d])?([:\d])?([:\d])?/gi;
Here is the regex code with some sample text to match:
https://regexr.com/5pfg3
On the above you will notice, Jude if double spaced will work. If I put a full stop after it will work. I know the issue is this section :
(([ .)\n|])([^a-zA-Z]))
What I want is to match spaces, brackets, new lines BUT not a letter.
It does not match as it expects 2 characters using (([ .)\n|])([^a-zA-Z])) where the second one can not be a char a-zA-Z due to the negated character class, so it can not match the s in Jude some.
What you might do is make the character class in the second part optional, if you intent to keep all the capture groups.
You could also add word boundaries \b to make the pattern a bit more performant as it is right now.
See a regex demo
(Note that Jude is listed twice in the alternation)
If you only want to use 3 groups, you can write the first part as:
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?
The second part will be the alternation with the names, and in the 3rd part you can match one of the character class followed by the digit part and make that optional as a whole (so you don't match a trailing space or char after the word without the digits).
(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
The full pattern will look like
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation)\b(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
Regex demo of the full pattern

Regex: Match number between two strings with line breaks

I want to match only the value of this string that ends with ZW-Summe with RegEx (JavaScript, so no lookbehind - please consider: I must use regex):
[here is a lot more data with line breaks and so on...]
2,550%Zinsen 83,72ZW-Summe U
St 83,72Umsatzs? [more lines...]
Problem: There can be line breaks everyvery, even this could happen:
[data....] 2,550%Zinsen 83,7
2ZW-Summe U
St 83,72Umsatzs? [more lines...]
My goal is to match 83,72 only, without ZW-Summe and of course the value can change. Possible values:
1.000,22
0,22
222,22
100.000,22 and so on.
I have to identify the value with the ZW-Summe String because there can be more occurrences of values.
My first attempt is ((\d{0,3}((\.\d{3})){0,2}),\d{2})ZW-Sum but this does also match ZW-Sum and I am not able to access groups - and the main problem is that I does not ignore possible line breaks.
I hope this is even possible to match something (-VALUE-ZW-Summe) and than ignore the ZW-Summe in the result?
Thank you for any suggestions.
This worked for me, and is quite simple: /([\d,.\n]+)(?=ZW-Summe)/g.
It basically just matches a series of digits, commas, periods, and newlines where after said series is ZW-Summe (?=, positive lookahead). The g flag makes it ignore line breaks.
After you run it, though, make sure to strip the newlines (i.e. match = match.replace('\n', '');).

Regular expression handle multiple matches like one, how to fix?

I have a regex, and a string that includes some matches for this regex. My regex handle all this matches like it is only one big match (of course I don't want such behaviour), let me show you an example:
My test string (sorry for scribble, but this doesn't matter):
sdfsd -dsf- sdfsdfssdfsfdsfsd -sdfsdf-
my regex in js code:
view.replace(/(\-(.+)\-)/g, '<span style="background-color:yellow">$1</span>');
my result:
sdfsd<span style="background-color:yellow">-dsf- sdfsdfssdfsfdsfsd -sdfsdf-</span>
As you can see each of this strings in the "-" must be enclosed in span, but there is only one span. How I can fix this? (honestly I don't want change my (.+) regex part, which I think might be a problem, but if there is no other way to do this, let me know).
In other words, result must be:
sdfsd<span style="background-color:yellow">-dsf-</span> sdfsdfssdfsfdsfsd <span style="background-color:yellow">-sdfsdf-</span>
Feel free to ask me in the comments, and thanks for your help.
honestly I don't want change my (.+) regex part, which I think might be a problem
Why not, it is actually the source of the problem, you can try the following regex which would work:
/(\-([^-]+)\-)/g
and if you think that dashes - can appear between - and - themselves then you can use the less efficient:
/(\-(.+?)\-)/g
+? causes a lazy match, or in other words after matching the initial -, then .+? matches a single character then it moves control to the following - which tries to match a dash, if it couldn't then .+? reads (consumes) another character from the input and so on until the following - is able to match.
You can try:
view.replace(/-([^-]+)-/g, '<span style="background-color:yellow">$1</span>');

Optional lookahead in javascript

I was trying to build a regex for a user input. Im building a form based on the user Input. Lets assume that the user assigns the css property as to "Icon-[anything]" (Bootstrap Icon). Now in this case i have to ensure that "--" is not repeated more than once and also should ensure that "icon-white" should be the only class assigned beside the other one; this 'icon-white' has to optional as well.
/^icon-[a-z-]+(\ icon-white)?$/ - this regex works fine for the OPTIONAL Icon-white Scenario, but having some issue in avoiding the repetition of '--'.
If you want to match "icon-somevalue" but not "icon-white" try
icon-(?!white).*
If I understand correctly (although I'm not sure I do, sorry...) I think you're saying that the following two scenarios are allowed:
icon-white
icon-[anything] where [anything] can be any lower-case text and include a hyphen, but never two (or more) hyphens directly next to each other like --.
You've not said where this pattern might occur, although your original regex suggests this pattern will occur anchored to the start of your test string, so I'll assume that's the case. In which case, this regex should help:
^icon-white$|^icon-([a-z]+-?)+$
Breaking that down:
^icon-white$ Match the literal string that contains exactly "icon-white"
| or
^icon-([a-z]+-?)+$ Match the literal string that starts with exactly "icon-" and then immediately ends with "something" which is ([a-z]+-?)+.
Now, to be clear - I don't get the relationship between icon-white and icon-[something]. That is, as far as I see it there's no reason why the icon-[something] pattern at 3 above can't cover the "icon-white" literal too. ie 1 and 2 above are redundant. But I've included them here so you can maybe piece something more suitable together.
Breaking that "something" down from 3:
( )+ means one or more instances of whatever's inside the parenthesis, which is [a-z]+-?
Breaking that [a-z]+-? down:
[a-z]+ At least one character "a" through "z" (note hyphen is NOT allowed here to avoid additional hyphen immediately after the previous one)
-? An optional hyphen (ie exactly 0 or 1 hyphen)
This matches the following test cases:
icon-white
icon-x
icon-xx
icon-x-
icon-xx-
icon-x-x
icon-xx-x
icon-xx-xx
icon-x-x-x-
icon-x-xx-xx-x-xxxxx-
... and so on
This DOES NOT match the following test cases:
any case where a capital letter used (you've specified only lower-case)
icon- (because we need one or more characters for "something".
icon--
icon--x
icon-x--
I hope this covers your needs, but I doubt it does (because I didn't really understand your explanation "ensure that "icon-white" should be the only class assigned beside the other one"), but hopefully my breakdown will give you the pieces you need.
EDIT:
I think maybe you're saying the scenarios allowed are:
icon-[something]
icon-[something] icon-white
icon-white icon-[something]
where [something] is any combinations of lower-case text and hypens, so long as there's never a double-hyphen, and so long as it's not "white".
So... this defines "icon-[something]" : icon-(?!white$)([a-z]+-?)+
This means our 3 above scenarios are:
^icon-(?!white$)([a-z]+-?)+$
^icon-(?!white$)([a-z]+-?)+ icon-white$
^icon-white icon-(?!white$)([a-z]+-?)+$
And hence, putting it all together:
^icon-(?!white$)([a-z]+-?)+$|^icon-(?!white$)([a-z]+-?)+ icon-white$|^icon-white icon-(?!white$)([a-z]+-?)+$
I tried doing this with the icon-white section as an optional group, but had trouble with the negative lookahead from the first section capturing it... so... this'll do ;-)

Match altered version of first match with only one expression?

I'm writing a brush for Alex Gorbatchev's Syntax Highlighter to get highlighting for Smalltalk code. Now, consider the following Smalltalk code:
aCollection do: [ :each | each shout ]
I want to find the block argument ":each" and then match "each" every time it occurrs afterwards (for simplicity, let's say every occurrence an not just inside the brackets).
Note that the argument can have any name, e.g. ":myArg".
My attempt to match ":each":
\:([\d\w]+)
This seems to work. The problem is for me to match the occurrences of "each". I thought something like this could work:
\:([\d\w]+)|\1
But the right hand side of the alternation seems to be treated as an independent expression, so backreferencing doesn't work.
Is it even possible to accomplish what I want in a single expression? Or would I have to use the backreference within a second expression (via another function call)?
You could do it in languages that support variable-length lookbehind (AFAIK only the .NET framework languages do, Perl 6 might). There you could highlight a word if it matches (?<=:(\w+)\b.*)\1. But JavaScript doesn't support lookbehind at all.
But anyway this regex would be very inefficient (I just checked a simple example in RegexBuddy, and the regex engine needs over 60 steps for nearly every character in the document to decide between match and non-match), so this is not a good idea if you want to use it for code highlighting.
I'd recommend you use the two-step approach you mentioned: First match :(\w+)\b (word boundary inserted for safety, \d is implied in \w), then do a literal search for match result \1.
I believe the only thing stored by the Regex engine between matches is the position of the last match. Therefore, when looking for the next match, you cannot use a backreference to the match before.
So, no, I do not think that this is possible.

Categories

Resources