Regex: Match number between two strings with line breaks - javascript

I want to match only the value of this string that ends with ZW-Summe with RegEx (JavaScript, so no lookbehind - please consider: I must use regex):
[here is a lot more data with line breaks and so on...]
2,550%Zinsen 83,72ZW-Summe U
St 83,72Umsatzs? [more lines...]
Problem: There can be line breaks everyvery, even this could happen:
[data....] 2,550%Zinsen 83,7
2ZW-Summe U
St 83,72Umsatzs? [more lines...]
My goal is to match 83,72 only, without ZW-Summe and of course the value can change. Possible values:
1.000,22
0,22
222,22
100.000,22 and so on.
I have to identify the value with the ZW-Summe String because there can be more occurrences of values.
My first attempt is ((\d{0,3}((\.\d{3})){0,2}),\d{2})ZW-Sum but this does also match ZW-Sum and I am not able to access groups - and the main problem is that I does not ignore possible line breaks.
I hope this is even possible to match something (-VALUE-ZW-Summe) and than ignore the ZW-Summe in the result?
Thank you for any suggestions.

This worked for me, and is quite simple: /([\d,.\n]+)(?=ZW-Summe)/g.
It basically just matches a series of digits, commas, periods, and newlines where after said series is ZW-Summe (?=, positive lookahead). The g flag makes it ignore line breaks.
After you run it, though, make sure to strip the newlines (i.e. match = match.replace('\n', '');).

Related

Replace part of a string and upper case the result

everything good?
I would like some help from you, I have the following scenario:
STRING_ONE:
string_two
STRING_three :
string_four:
stringfive:
I need to identify words from the beginning of the line that end with :
after identifying the words, I need to erase the spaces and convert them to uppercase,
i tried doing some regex but as the words change I can't set the default for the replacement because i need to keep the same word, just removing the spaces and converting to capital letters
The result I'm trying to get is this:
STRING_ONE:
string_two
STRING_three :
STRING_FOUR:
STRINGFIVE:
I can capture the words that match this pattern, with the following regex, but I don't know how to replace it by just erasing the spaces, keeping the rest of the string the same, and doing the upper case
^.*\b:
I tried to replace like this but it didn't work
"$1".toUpperCase()
Can anyone help please?
Thanks!
As you are not really replacing anything but instead you are looking for a pattern I used the pattern in a String.prototype.match() to identify the lines in which .trim() and .toLowerCase() need to be applied. The .split("\n") turns the initial string into an array over which I can then .map() the individual lines. At the end I .join() everything together again.
const str=`STRING_ONE:
string_two
STRING_three :
string_four:
stringfive:`;
console.log(str
.split("\n")
.map(s=>s.match(/^\s*\w+:\s*$/) && s.trim().toUpperCase() || s )
.join("\n")
);
Our regex patterns differ slightly:
while yours (/^.*\b:/) will match any line that has at least one word-end followed by a colon in it
mine (/^\s*\w+:\s*$/) is stricter and demands that there is exactly one word followed by a colon in a line that can optionally be padded by any number of whitespace characters on either side.

Need help to find the right regex pattern to match

my RegEx is not working the way i think, it should.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
I will use this regex in a javascript, to check if a string match with it.
Should match:
12345612-OSM34
12-OSM34
OSM56
7-OSM
OSM
Should not match:
-OSM
a-OSM
rOSMann
rOSMa
asdrOSMa
rOSM89
01-OSMann
OSMond
23OSM
45OSM678
One line, represents a string in my javascript.
https://www.regex101.com/r/xQ0zG1/3
The rules for matching:
match OSM if it stands alone
optional match if line starts with digit/s AND is followed by a -
optional match if line ends with digit/s
match all 3 above combined
no match if line starts with a character/word except OSM
no match if line end with chracter/word except OSM
I Hope someone can help.
You can use the following simplified pattern using anchors:
^(?:\d+-)?OSM\d*$
The flags needed (if matching multi-line paragraph) would be: g for global match and m for multi-line match, so that ^ and $ match the begin/end of each line.
EDIT
Changed the (\d+-) match to (?:\d+-) so that it doesn't group.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
[^a-zA-Z] In regex, you specify what you want, not what you don't want. This piece of code says there must be one character that isn't a letter. I believe what you wanted to say is to match the start of a line. You don't need to specify that there's no letter, you're about to specify what there will be on the line anyway. The start of a regex is represented with ^ (outside of brackets). You'll have to use the m flag to make the regex multi-line.
(\d+-)? means one or more digits followed by a - character. The ? means this whole block isn't required. If you don't want foreign digits, you might want to use [0-9] instead, but it's not as important. This part of the code, you got right. However, if you don't need capture blocks, you could write (?:) instead of ().
\d*(?![a-zA-Z]) uses lookahead, but you almost never need to do that. Again, specifying what you don't want is a bad idea because then I could write OSMé and it would match because you didn't specify that é is forbidden. It's much simpler to specify what is allowed. In your case since you want to match line ends. So instead, you can write \d*$ which means zero or more digits followed by the end of the line.
/^(?:\d+-)?OSM\d*$/gm is the final result.

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Complex string parsing in Javascript

I am attempting to parse a complex string in JavaScript, and I'm pretty horrible with Regular Expressions, so I haven't had much luck. The data is loaded into a variable formatted as follows:
Miami 2.5 O (207.5) 125.0 | Oklahoma City -2.5 U (207.5) -145.0 (Feb 20, 2014 08:05 PM)
I am trying to parse that string following these parameters:
1) Each value must be loaded into their own variable (IE: separate variables for Miami, 2.5 O, (207.5) ect)
2) String must split at pipe character (I have this working with .split(" | ") )
3) I am dealing with city names that include spaces
4) The date at the end must be isolated and removed
I have a feeling regular expressions must be used, but I'm seriously hoping there is a different way to approach this. The example provided is just that, an example from a much larger data set. I can provide the full data set if requested.
More direct version of my question: Given the data above, what concepts / procedures can I use to intelligently parse the string elements into their own variables?
If RegEx must be used, will I need multiple expressions?
Thanks in advance for your help!
EDIT: In an effort to supply multiple pathways to a solution I'll explain the overarching problem as well. This data is the return of a RSS / XML item. The string mentioned above is sports odds, and is all contained in the title node of the feed I'm using. If anyone has a better XML / RSS feed for sports odds, I would be ecstatic for that as well.
EDIT 2: Thanks to the replies, I can run a RegEx that matches the data points needed. I'm now having trouble iterating through the matches and returning them correctly. I have the RegEx loaded into its own function:
function regExExtract (txt){
var exp = /([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g;
var comp_arr = exp.exec(txt);
return comp_arr;
}
And it is being called with:
var title_arr = regExExtract(title);
Title is loaded with the data string listed above. I assume I'm using the global flag correctly to ensure all matches are considered, but I'm not sure I'm loading the matches correctly. I apologize for my ignorance, this is all brand new to me.
As requested below, my expected output is ultimately a table with a row for each city, and its subsequent data. Each cell in each row corresponds to a data point.
I have created a JS Fiddle with what I've done, and what the expected output is:
http://jsfiddle.net/vDkQD/2/
Potential Final Edit: With the assistance of Robin and rewt, I have come up with:
http://jsfiddle.net/hMJx3/
Wouldn't a regex like
/([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g
do the trick? Obviously, this is based on the example string you gave, and if there are other patterns possible this should be updated... But if it is that fixed it's not so complicated.
Afterwards you just have to go through the captured groups for each match, and you'll have your data parsed. Live demo for fun: http://regex101.com/r/kF5zD3
Explanation
[^|\d] evrything but a pipe or a digit. This is to account for strange city name that [a-zA-Z ] might not catch
[-\d.] a digit, a dot or a hyphen
\([^)]+\) opening parenthesis, everything that isn't a closing parenthesis, closing parenthesis.
Quick incomplete pointers on regex
Here, the regex is the part between the /. The g after is a flag, thanks to it the regex won't stop after hitting the first match and will return every match
The match is what the whole expression will find. Here, the match will be everything between two | in your string. The capturing groups are a very useful tool that allows you too extract data from this match: they are delimited by parenthesis, which are a special character in regex. (a)b will match ab, the first captured group of this match will be a
[...] is means every character inside will do. [abc] will match a or b or c.
+ is a quantifier, another special character, meaning "one or more of what precedes me". a+ means "one or more a and will match aaaaa.
\d is a shortcut for [0-9] (yes, - is a special range character inside of [...]. That's why in [-\d.], which is equivalent to [-0-9.], it's directly following the opening bracket)
since parenthesis are special characters, when you actually want to match a parenthesis you need to escape: regex (\(a\))b will match (a)b, the first captured group of this match will be (a) with the parenthesis
? means what precedes is optional (zero or one instances)
^ when put at the beginning of a [...] statement means "everything but what's in the brackets". [^a]+ will match bcd-*ù but not aa
If you really know nothing about regex, as I believe they're the right tool for your case, I suggest your take a quick overview of a tuto, just to get a better idea of what you're dealing with. The way to set flags, loop through matches and their respective captured groups will depend on your language and how you call your regex.
[A-z][a-z]+( [A-z][a-z]+)* -?[0-9]+\.[0-9] [OU] \(-?[0-9]+\.[0-9]\) -?[0-9]+\.[0-9]
This should match a single part of your long string under the following assumptions:
The city consists only of alpha characters, each word starts with an uppercase character and is at least 2 characters long.
Numbers have an optional sign and exactly one digit after the decimal point
the single character is either O or U
Now it is up to you to:
Properly create capturing parentheses
Check whether my assumptions are right
In order to match the date:
\([JFMASOND][a-z]{2} [0-9]?[0-9], [0-9]{4} [0-9]{2}:[0-9]{2} [AP]M\)$

Javascript Regex: how to simulate "match without capture" behavior of positive lookbehind?

I have a relatively simple regex problem - I need to match specific words in a string, if they are entire words or a prefix. With word boundaries, it would look something like this:
\b(word1|word2|prefix1|prefix2)
However, I can't use the word boundary condition because some words may start with odd characters, e.g. .999
My solution was to look for whitespace or starting token for these odd cases.
(\b|^|\s)(word1|word2|prefix1|prefix2)
Now words like .999 will still get matched correctly, BUT it also captures the whitespace preceding the matched words/prefixes. For my purposes, I can't have it capture the whitespace.
Positive lookbehinds seem to solve this, but javascript doesn't support them. Is there some other way I can get the same behavior to solve this problem?
You can use a non-capturing group using (?:):
/(?:\b|^|\s)(word1|word2|prefix1|prefix2)/
UPDATE:
Based on what you want to replace it with (and #AlanMoore's good point about the \b), you probably want to go with this:
var regex = /(^|\s)(word1|word2|prefix1|prefix2)/g;
myString.replace(regex,"$1<span>$2</span>");
Note that I changed the first group back to a capturing one since it'll be part of the match but you want to keep it in the replacement string (right?). Also added the g modifier so that this happens for all occurrences in the string (assuming thats what you wanted).
Let's get the terminology straight first. A regex normally consumes everything it matches. When you do a replace(), everything that was consumed is overwritten. You can also capture parts of the matched text separately and plug them back in using $1, $2, etc.
When you were using the word boundary you didn't have to worry about this, because \b doesn't consume anything. But now you're consuming the leading whitespace character if there is one, so you have to plug it back in. I don't know what you're replacing the match with, so I'll just replace them with nothing for this demonstration.
result = subject.replace(/(^|\s)(word1|word2|prefix1|prefix2)/g, "$1");
Note that the \b isn't needed any more. In fact, you must remove it, or it will match things like .999 in xyz.999, because \b matches between z and .. I'm pretty sure you don't want that.

Categories

Resources