Splitting a text into chunks (Javascript, regex)

Splitting a text into chunks (Javascript, regex) - javascript

I tried to split a text into several smaller chunks in order to parse it, using Javascript and RegEx. I have illustrated my best shot here, example included:
https://regex101.com/r/jfzTlr/1
I have a set of rules to follow: I would like to receive blocks. Every block starts with an asterics (*) as the first sign (if not indented, otherwise the tab), followed by 2-3 uppercase letters, a comma, a (possible) space and a code that could be A, R, T, RS or RSS. Followed by that is an optional dot. Linebreak afterwards, where the text comes. That text ends where the next asterics comes in, following the same pattern as above.
Could someone help me to figure out how to split this accordingly? This is my pattern so far:
[^\t](.{2,3}),\s?.{1,3}\.?\n.*
Thanks a lot!

Since you're going with JavaScript, why not do it with a split which gives you the captured string to split on and the separated parts as well? Then bind the headings together in an array that looks like
[[heading1, block1], [heading2, block2], ...]
This way, you immediately have the data in a nice format to process down the line. Just an idea!
const s = `*GW, A
This is my very first line. The asterics defines a new block, followed by the initials (2-3 chars), a comma, a (possible) space and a code that could be A, R, T, RS or RSS. Followed by that is an optional dot. Linebreak afterwards, where the text comes.
*JP, R.
New block here, as the line (kind of) starts with an asterics. Indentations with 4 spaces or a tab means that it is a second level thing only, that does not need to be stripped away necessarily.
But as you can see, a block can be devided into several
lines,
even with multiple lines.
*GML, T.
And so we continue...
Let's just make sure that a line can start with an
*asterics, without breaking the whole thing.
*GW, RS
Yet another block here.
*GW, RSS.
And a very final one.
Spread over several lines.
*TA, RS.
First level all of a sudden again.
*PA, RSX
Just a line to check whether RSX is a separate block.
`;
const splits = s.split(/\*([A-Z]{2,3}),\s?([AT]|RS{0,2})(\.?)\n/).slice(1);
const grouped = [];
for (let i = 0; i < splits.length; i += 4) {
const group = splits.slice(i, i+3);
group[3] = splits[i+3].trim().split(/\s*[\r\n]+\s*/g);
grouped.push(group);
}
console.log(grouped);

You could use
^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?[\n\r](?:(?!^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?)[\s\S])+
See a demo on regex101.com.
Broken into parts:
^[ \t]*\*[A-Z]{2,3} # start of the line, spaces or tabs and 2-3 UPPERCASE letters
,\s*(?:[ART]|RSS?)\.?[\n\r] # comma, space (optional), code, dot and newline
(?: # non-capturing group
(?!^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?)
# neg. lookahead with the same pattern as above
[\s\S] # \s + \S = effectively matching every character
)+
The technique is called a tempered greedy token.

Hope this is what you wanted. This works.
([\*\t])+(.{2,3}),\s?.[A,R,T,RS,RSS]{1,3}\.?\n.*

Related

Regex to follow pattern except between braces

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:
For example:
Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:
0213123123130
012312312312303123123
01231230123123031230
etc.
But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:
01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123
and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.
01213123123123{21310030123012301}31231230123
EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.

Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.
https://regex101.com/r/iFSabs/3
Here's the rundown on how the regex works:
^ anchors the match to start at the beginning of the line
0 matches a literal zero character
([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
[123]+ matches one or more instances of 1, 2, or 3
(?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
{[^}]*} matches a literal { followed by any number of non } characters followed by }
[123]* matches zero or more instances of 1, 2, or 3
Then this whole non-capturing group can be matched 0 or more times.
The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)
The unrolling the loop technique is based on the hypothesis that in
most case, you [know] in a [repeated] alternation, which case should be
the most usual and which one is exceptional. We will called the first
one, the normal case and the second one, the special case. The general
syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
Which could means something like, match the normal case, if you find a
special case, matched it than match the normal case again. [You'll] notice
that part of this syntax could [potentially] lead to a super-linear
match.
Example using Regex#test and Regex#match:
const strings = [
'0213123123130',
'012312312312303123123',
'01231230123123031230',
'01213123123123{21310030123012301}31231230123',
'01212121{hello 0}121312',
'012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
'012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/
console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))
console.log('matches');
let matches = strings
.map((string) => regex.exec(string))
.map((match) => (match ? match[1] : undefined));
console.log(matches);
Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.

How about the other way around? Checking the string with curly tags removed:
const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');
const result = /^(0[123]+0)/.test(stringWithoutTags);

You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:
^(0(?:[123]|{.+?})+0)
That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.
https://regex101.com/r/K4teGY/2

You might use
^0[123]*(?:{[^{}]*}[123]*)*0
^ Start of string
0 Match a zero
[123]* Match 0+ times either 1, 2 or 3
(?: Non capture group
{[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
)* Close group and repeat 0+ times
0 Match a zero
Regex demo

JS RegEx for finding number of lines in a page, separated by form feed \f

I have a use case that requires a plain-text file to have lines to consist of at most 38 characters, and 'pages' to consist of at most 28 lines. To enforce this, I'm using regular expressions. I was able to enforce the line-length without any problems, but the page-length is proving to be much trickier.
After several iterations, I came to the following as a regular expression that I feel should work, but it isn't.
let expression = /(([^\f]*)(\r\n)){29,}\f/;
It simply results in no matches.
If anyone could provide some feedback, I'd greatly appreciate it! - Jacob
Edit 1 - removed code block around second expression, it was probably making my question confusing.
Edit 2 - removed following text, it's not pertinent:
As a comparison, the following expression results in a single match, the entire document. I'm assuming it's matching all lines up until the final
let expression = /(.*(\r\n)){29,}
Edit 3 - So after some thinking, I realized that my issue is due to the initial section of the regex that matches any characters before a newline is including newlines. Therefore, I believe I need to match any characters before a newline EXCEPT (\f\r\n). However, I'm now having trouble implementing this. I tried the following:
let expression = /([^\f^\r^\n]*(\r\n)){29,}\f/;
But it's also not matching. I'm assuming that my negations are wrong...
Edit 4 - I have the following regex that matches each line: let expression = /([^\f\r\n]{0,}(\r\n))/;
This is pretty close to what I want. All I need now is to match any instances of 29 or more lines followed by \f

Thanks for all the help to those who commented, a friend ended up helping me get the final regex
let expression = /([^\f\r\n]*?\r??\n){29,}?\f/;

Edit:
As you clarified more your problem, and provided your updated regex:
/([^\f^\r^\n]*(\r\n)){29,}\f/;
Your negations are not right here, use [^\f\r\n] instead of [^\f^\r^\n]. This will negate all of \f, \r, and \n.
So, your regex becomes:
/([^\f\r\n]*(\r\n)){29,}\f/;
This will match 29 or more lines of characters (that can be anything but \f, \r or \n), the whole thing followed by a single \f.
Original answer:
Your current regular expression:
let expression = /(([^\f]*)(\r\n)){29,}\f/;
Matches strings that consist of 29 or more lines (separated by \r\n), the whole thing followed by one single \f.
As far as I understood, you want each of your lines to end with \f. Did you mean to include the \f inside?
let expression = /(([^\f]*)(\r\n\f)){29,}/;

JavaScript Regular Express both dotall and global flags

I have a string like this:
#a
b
#c
d
I would like to break it up into sections beginning with #:
#a
b
and
#c
d
I have attempted this with a regular expression, but I find that I can’t get it working.
I though that the following would work:
var test='#a\nb\n#c\nd';
var re=/#.*?/gs;
var match=test.match(re);
alert(match.length);
alert(match);
That is, the s modifier matches through line breaks, and the g modifier picks up multiple instances. The ? lazy quantifier should stop the * from going too far.
However, I find that when I use just s, it only goes to the end of the line.
Clearly there’s something I’m not getting about either the regular expression or the match() method.
By the way, I know that s is only a recent addition to JavaScript, but I’m working in Electron, where it is readily available.

Regex is too much for this job. Use built-in string functions.
var str = `#a
b
#c
d`;
var chunks = str.split("\n\n");
console.log(chunks);

I think that I wrested with a bear once's answer assumes that you wish to break on the basis of line breaks, and the answer by Wiktor Stribiżew is very good but it fails (at least in my opinion).
For example, if we use Wiktor's regex /#.*(?:\r?\n(?!\r?\n).*)*/g on the string
#Section 1
This is one section
And this is also part of first sections
#Section 2
This is part of section two.
Then it will ignore the line "This is also part of second section." in its match. The reason is simply because his regex breaks on the basis of double \r?\n, and hence it will just ignore the that line.
I am assuming you want to something similar to what happens in markdown where the # are used to automatically detects the sections and heading.
If that is the case, then use the following regex: /#.*(?:\r?\n(?!#).*)*/g , it's a minor modification of Wiktor's great answer. And this matches the lines as (I hope) we wanted.
What it does is that it matches the whole section, and does a negative lookahead so that it doesn't include anything beyond the next section i.e., next # symbol at the beginning of the line.
Link: https://regex101.com/r/ai15fP/2

EDIT: If the only goal is to split into sections at lines starting with # you may just use
test.split(/^(?=#)/m)
See the JS demo:
var test="#a\nb\n\n#c\nd";
console.log(test.split(/^(?=#)/m))
The .*? at the end of the pattern never matches any chars because it is skipped and end of pattern signals the match lookup is complete.
Use
s.match(/#.*(?:\r?\n(?!\r?\n).*)*/g)
See the regex demo
Details
# - a # char
.* - any 0+ chars other than line break chars
(?:\r?\n(?!\r?\n).*)* - 0 or more repetitions of
\r?\n(?!\r?\n) - an optional CR and then LF that are not followed with an optional CR and then LF
.* - any 0+ chars other than line break chars
Or, use split with /(?:\r?\n){2,}/ that matches 2 or more line break sequences.
JS demo:
var test="#a\nb\n\n#c\nd";
console.log(test.match(/#.*(?:\r?\n(?!\r?\n).*)*/g));
console.log(test.split(/(?:\r?\n){2,}/));

Regex for given pattern

I have below test case
hello how are you // 1. Allow
hello how are you // 2. Not Allow
hello < // 3. Not Allow
for the following Rules:
Allow spaces at start and end
Not allow more than one space between words
Not allow angle brackets < >
I am trying the below:
^([^<> ]+ )+[^<> ]+$|^[^<> ]+$
This is working, but when giving spaces at start or end it is not allowing.

I assume that you use your regex to find matches in the whole
text string (all 3 lines together).
I see also that both your alternatives contain starting ^ and ending $,
so you want to limit the match to a single line
and probably use m regex option.
Note that [^...]+ expression matches a sequence of characters other than
placed between brackets, but it does not exclude \n chars,
what is probably not what you want.
So, maybe you should add \n in each [^...]+ expression.
The regex extended such a way is:
^([^<> \n]+ )+[^<> \n]+$|^[^<> \n]+$
and it matches line 1 and 2.
But note that the first alternative alone (^([^<> \n]+ )+[^<> \n]+$)
also does the job.
It you realy want that line 2 should not match, please specify why.
Edit
To allow any number of spaces at the begin / end of each line,
add * after initial ^ and before final $, so that the
regex (first alternative only) becomes:
^ *([^<> \n]+ )+[^<> \n]+ *$
But it still matches line 2.
Or maybe dots in your test string are actually spaces, but you wrote
the string using dots, to show the numer of spaces?
You should have mentioned it in your question.
Edit 2
Yet another possibility, allowing dots in place of spaces:
^[ .]*((?:[^<> .\n]+[ .])+[^<> .\n]+|[^<> .\n]+)[ .]*$
Details:
^[ .]* - Start of a line + a sequence of spaces or dots, may be empty.
( - Start of the capturing group - container for both alternatives of
stripped content.
(?:[^<> .\n]+[ .])+ - Alternative 1: A sequence of "allowed" chars ("word") +
a single space or dot (before the next "word", repeated a few times.
This group does not need to be a capturing one, so I put ?:.
[^<> .\n]+ - A sequence of "allowed" chars - the last "word".
| - Or.
[^<> .\n]+ - Alternative 2: A single "word".
) - End of the capturing group.
[ .]*$ - The final (possibly empty) sequence of spaces / dots + the
end of line.
Of course, with m option.

Metacharacters and parenthesis in regular expressions

Can anyone elaborate/translate this regular expression into English?
Thank you.
var g = "123456".match(/(.)(.)/);
I have noticed that the output looks like this:
12,1,2
and I know that dot means any character except new line. But what does this actually do?

A pair of parenthesis (without a ? as the first character, indicating other behaviour) will capture the contents to a group.
In your example, the first item in the array is the entire match, and subsequent items are any group matches.
It might be clearer if your code was something like:
var g = "123456".match(/.(.).(.)./);
This will match five characters, placing the second and fourth into groups 1 and 2 respectively, so outputting 12345,2,4
If you want pure grouping without capturing the content, use (?:...) syntax, the ?: part indicating a non-capturing group. (There are various assorted group things, like lookaheads and other fun stuff.)
Let me know if that is clear, or would further explanation help?

It looks for two characters - any characters because of the dots - and 'captures' them so that you can look for the whole string that was matched, and for each of the substrings (captures) as well.

Develop Reference

JavaScript is the programming language of the Web.

Splitting a text into chunks (Javascript, regex) - javascript

Hope this is what you wanted. This works. ([\\t])+(.{2,3}),\s?.[A,R,T,RS,RSS]{1,3}\.?\n.

Related

Regex to follow pattern except between braces

JS RegEx for finding number of lines in a page, separated by form feed \f

JavaScript Regular Express both dotall and global flags

Regex for given pattern

Metacharacters and parenthesis in regular expressions

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Splitting a text into chunks (Javascript, regex) - javascript

Hope this is what you wanted. This works. ([\*\t])+(.{2,3}),\s?.[A,R,T,RS,RSS]{1,3}\.?\n.*

Related

Regex to follow pattern except between braces

JS RegEx for finding number of lines in a page, separated by form feed \f

JavaScript Regular Express both dotall and global flags

Regex for given pattern

Metacharacters and parenthesis in regular expressions

Categories

Resources

Hope this is what you wanted. This works. ([\\t])+(.{2,3}),\s?.[A,R,T,RS,RSS]{1,3}\.?\n.