Javascript - parse formatted text and extract values in order? - javascript

I have a field with wiki style rendering on it that I'd like to bust up in Javascript.
The text I'm trying to parse looks like this:
{color:#47B}_name1_{color}
{color:#555}description1{color}
---
{color:#47B}_name2_{color}
{color:#555}description2{color}
---
{color:#47B}_name3_{color}
{color:#555}description3{color}
---
etc
Where name1 and description1 belong together, name2 and description2 belong together, and so forth. The values for name and description are user supplied values, with description potentially spanning multiple lines.
My end goal is to be able to extract the values of each name and each description from the text (and be able to reliably associated name1 with description1, etc).
My question is: If I used a regex to match all the names into an array and all the descriptions into an array, can I be ensured that the items in the array are in the correct order? That is, will names[0] always be the first name in the parsed text (assuming I did a javascript regex match into the names array)? Also- is this bad practice/should I do this another way?
The regular expression I'm trying to use to match names is:
/^(\{color\:#47B\})(_)(\s*?)(.*?)(\s*?)(_)(\{color\})$/
And the regular expression I'm using to match descriptions is:
/(\{color\:#555\})(.*?)(\{color\})/

A regex search will always return matches in source order (i.e. in the order in which they occur in the source text.)
I assume you are asking this question because you're hoping to do two regex matches (one for name, one for description) and then get two result arrays, and guarantee that namesmatch[i] always goes with descriptionmatch[i]. However, this will only be true if your source text is always exactly perfect.
In this case it may be better or safer either to use a single regex that matches both at once, or split your source up by those -- delimiters and then match within each block. The reason why it may be safer is that your source text may contain errors, and at least in this case you can detect that and have as much good data as possible.
A note about your regexes. The . does not match newlines, so if the text between your {color} braces might have a newline you need to include newlines explicitly. [\s\S] and [^] are common idioms for this. Alternatively, if all . in a regex should match newlines, set the dotAll flag (s).

Related

[Nearley]: how to parse matching opening and closing tag

I'm trying to parse a very simple language with nearley: you can put a string between matching opening and closing tags, and you can chain some tags. It looks like a kind of XML, but with[ instead of < , with tag always 2 chars long, and without nesting.
[aa]My text[/aa][ab]Another Text[/ab]
But I don't seem to be able to parse correctly this, as I get the grammar should be unambiguous as soon as I have more than one tag.
The grammar that I have right now:
#builtin "string.ne"
#builtin "whitespace.ne"
openAndCloseTag[X] -> "[" $X "]" string "[/" $X "]"
languages -> openAndCloseTag[[a-zA-Z] [a-zA-Z]] (_ openAndCloseTag[[a-zA-Z] [a-zA-Z]]):*
string -> sstrchar:* {% (d) => d[0].join("") %}
And related, Ideally I would like the tags to be case insensitive (eg. [bc]TESt[/BC] would be valid)
Has anyone any idea how we can do that? I wasn't able to find a nearley XML parser example .
Your language is almost too simple to need a parser generator. And at the same time, it is not context free, which makes it difficult to use a parser generator. So it is quite possible that the Nearly parser is not the best tool for you, although it is probably possible to make it work with a bit of hackery.
First things first. You have not actually provided an unambiguous definition of your language, which is why your parser reports an ambiguity. To see the ambiguity, consider the input
[aa]My text[/ab][ab]Another Text[/aa]
That's very similar to your test input; all I did was swap a pair of letters. Now, here's the question: Is that a valid input consisting of a single aa tag? Or is it a syntax error? (That's a serious question. Some definitions of tagging systems like this consider a tag to only be closed by a matching close tag, so that things which look like different tags are considered to be plain text. Such systems would accept the input as a single tagged value.)
The problem is that you define string as sstrchar:*, and if we look at the definition of sstrchar in string.ne, we see (leaving out the postprocessing actions, which are irrelevant):
sstrchar -> [^\\'\n]
| "\\" strescape
| "\\'"
Now, the first possibility is "any character other than a backslash, a single quote or a newline", and it's easy to see that all of the characters in [/ab] are in sstrchar. (It's not clear to me why you chose sstrchar; single quotes don't appear to be special in your language. Or perhaps you just didn't mention their significance.) So a string could extend up to the end of the input. Of course, the syntax requires a closing tag, and the Nearley parser is determined to find a match if there is one. But, in fact, there are two of them. So the parser declares an ambiguity, since it doesn't have any criterion to choose between the two close tags.
And here's where we come up against the issue that your language is not context-free. (Actually, it is context-free in some technical sense, because there are "only" 676 two-letter case-insensitive tags, and it would theoretically be possible to list all 676 possibilities. But I'm guessing you don't want to do that.)
A context-free grammar cannot express a language that insists that two non-terminals expand to the same string. That's the very definition of context-free: if one non-terminal can only match the same input as a previous non-terminal, then
the second non-terminals match is dependent on the context, specifically on the match produced by the first non-terminal. In a context-free grammar, a non-terminal expands to the same thing, regardless of the rest of the text. The context in which the non-terminal appears is not allowed to influence the expansion.
Now, you quite possibly expected that your macro definition:
openAndCloseTag[X] -> "[" $X "]" string "[/" $X "]"
is expressing a context-sensitive match by repeating the $X macro parameter. But it is not by accident that the Nearley documentation describes this construct as a macro. X here refers exactly to the string used in the macro invocation. So when you say:
openAndCloseTag[[a-zA-Z] [a-zA-Z]]
Nearly macro expands that to
"[" [a-zA-Z] [a-zA-Z] "]" string "[/" [a-zA-Z] [a-zA-Z] "]"
and that's what it will use as the grammar production. Observe that the two $X macro parameters were expanded to the same argument, but that doesn't mean that will match the same input text. Each of those subpatterns will independently match any two alphabetic characters. Context-freely.
As I alluded to earlier, you could use this macro to write out the 676 possible tag patterns:
tag -> openAndCloseTag["aa"i]
| openAndCloseTag["ab"i]
| openAndCloseTag["ac"i]
| ...
| openAndCloseTag["zz"i]
If you did that (and you managed to correctly list all of the possibilities) then the parser would not complain about ambiguity as long as you never use the same tag twice in the same input. So it would be ok with both your original input and my altered input (as long as you accept the interpretation that my input is a single tagged object). But it would still report the following as ambiguous:
[aa]My text[/aa][aa]Another Text[/aa]
That's ambiguous because the grammar allows it to be either a single aa tagged string (whose text includes characters which look like close and open tags) or as two consecutive aa tagged strings.
To eliminate the ambiguity you would have to write the string pattern in a way which does not permit internal tags, in the same way that sstrchar doesn't allow internal single quotes. Except, of course, it is not nearly so simple to match a string which doesn't contain a pattern, than to match a string which doesn't contain a single character. It could be done using Nearley, but I really don't think that it's what you want.
Probably your best bet is to use native Javascript regular expressions to match tagged strings. This will prove simpler because Javascript regular expressions are much more powerful than mathematical regular expressions, even allowing the possibility of matching (certain) context-sensitive constructions. You could, for example, use Javascript regular expressions with the Moo lexer, which integrates well into Nearley. Or you could just use the regular expressions directly, since once you match the tagged text, there isn't much else you need to do.
To get you started, here's a simple Javascript regular expression which matches tagged strings with matching case-insensitive labels (the i flag at the end):
/\[([a-zA-Z]{2})\].*?\[\/\1\]/gmi
You can play with it online using Regex 101

Match between simple delimiters, but not delimiters themselves

I was looking at JSON data that was just in a text file. I don't want to do anything aside from just use regex to get the values in between quotes. I'm just using this as a way to help practice regex and got to this point that seems like it should be simple, but it turns out it's not (at least to me and a few other people at the office). I've matched complicated urls with ease in regex so I'm not completely new to regex. This just seems like a weird case for me.
I've tried:
/(?:")(.*?)(?:")/
/"(.*?)"/
and several others but these got me the closest.
Basically we can forget that it's JSON and just say I want to match the words value and stuff out of "value" and "stuff". Everything I try includes the quotes, so I'd have to clean the strings afterwards of the delimiters or else the string is literally "value" with the quotes.
Any help would be much appreciated, whether this is simple or complicated, I'd love to know! Thanks
Update: Alright so I think I'll go with (?<=")(.*?)(?=") and read things by line without the global setting on so I just get the first match on each line. In my code I was just plopping in a huge string into a var in the code instead of actually opening a file with ajax/filereader or having a form setup to input data. I think I'll mark this as solved, much appreciated!
You have two choices to solve this problem:
Use capturing groups
You can match the delimiters and use capturing groups to get the text within. In this case your two regexes will work, but you need to use access capturing group 1 to get the results (demo). See How do you access the matched groups in a JavaScript regular expression? for how to do that.
Use zero-width assertions
You can use zero-width assertions to match only the text within, require delimiters around them without actually matching them (demo):
(?<=")(.*?)(?=")
but now since I'm not consuming the quotes it'll find instances between each quote, not just between pairs of quotes: e.g., a"b"c" would find b and c.
As for getting just the first match, I think that'll happen by default in JavaScript. You'd have to ask for repeated matching before you see the subsequent ones. So if you process your file one line at a time, you should get what you want.
get the values in between quotes
One thing to keep in mind is that valid JSON accepts escaped quotes inside the quoted values. Therefore, the RegEx should take this into account when capturing the groups which is done with the “unrolling-the-loop” pattern.
var pattern = /"[^"\\]*(?:\\.[^"\\]*)*"/g;
var data = {
"value": "This is \"stuff\".",
"empty": "",
"null": null,
"number": 50
};
var dataString = JSON.stringify(data);
console.log(dataString);
var matched = dataString.match(pattern);
matched.map(item => console.log(JSON.parse(item)));

JavaScript split string by specific character string

I have a text box with a bunch of comments, all separated by a specific character string as a means of splitting them to display each comment individually.
The string in question is | but I can change this to accommodate whatever will work. My only requirement is that it is not likely to be a string of characters someone will type in an everyday sentence.
I believe I need to use the split method and possibly some regex but all the other questions I've seen only seem to mention splitting by one character or a number of different characters, not a specific set of characters in a row.
Can anyone point me in the right direction?
.split() should work for that purpose:
var comments = "this is a comment|and here is another comment|and yet another one";
var parsedComments = comments.split('|');
This will give you all comments in an array which you can then loop over or do whatever you have to do.
Keep in mind you could also change | to something like <--NEWCOMMENT--> and it will still work fine inside the split('<--NEWCOMMENT-->') method.
Remember that split() removes the character it's splitting on, so your resulting array won't contain any instances of <--NEWCOMMENT-->

regex replace on JSON is removing an Object from Array

I'm trying to improve my understanding of Regex, but this one has me quite mystified.
I started with some text defined as:
var txt = "{\"columns\":[{\"text\":\"A\",\"value\":80},{\"text\":\"B\",\"renderer\":\"gbpFormat\",\"value\":80},{\"text\":\"C\",\"value\":80}]}";
and do a replace as follows:
txt.replace(/\"renderer\"\:(.*)(?:,)/g,"\"renderer\"\:gbpFormat\,");
which results in:
"{"columns":[{"text":"A","value":80},{"text":"B","renderer":gbpFormat,"value":80}]}"
What I expected was for the renderer attribute value to have it's quotes removed; which has happened, but also the C column is completely missing! I'd really love for someone to explain how my Regex has removed column C?
As an extra bonus, if you could explain how to remove the quotes around any value for renderer (i.e. so I don't have to hard-code the value gbpFormat in the regex) that'd be fantastic.
You are using a greedy operator while you need a lazy one. Change this:
"renderer":(.*)(?:,)
^---- add here the '?' to make it lazy
To
"renderer":(.*?)(?:,)
Working demo
Your code should be:
txt.replace(/\"renderer\"\:(.*?)(?:,)/g,"\"renderer\"\:gbpFormat\,");
If you are learning regex, take a look at this documentation to know more about greedyness. A nice extract to understand this is:
Watch Out for The Greediness!
Suppose you want to use a regex to match an HTML tag. You know that
the input will be a valid HTML file, so the regular expression does
not need to exclude any invalid use of sharp brackets. If it sits
between sharp brackets, it is an HTML tag.
Most people new to regular expressions will attempt to use <.+>. They
will be surprised when they test it on a string like This is a
first test. You might expect the regex to match and when
continuing after that match, .
But it does not. The regex will match first. Obviously not
what we wanted. The reason is that the plus is greedy. That is, the
plus causes the regex engine to repeat the preceding token as often as
possible. Only if that causes the entire regex to fail, will the regex
engine backtrack. That is, it will go back to the plus, make it give
up the last iteration, and proceed with the remainder of the regex.
Like the plus, the star and the repetition using curly braces are
greedy.
Try like this:
txt = txt.replace(/"renderer":"(.*?)"/g,'"renderer":$1');
The issue in the expression you were using was this part:
(.*)(?:,)
By default, the * quantifier is greedy by default, which means that it gobbles up as much as it can, so it will run up to the last comma in your string. The easiest solution would be to turn that in to a non-greedy quantifier, by adding a question mark after the asterisk and change that part of your expression to look like this
(.*?)(?:,)
For the solution I proposed at the top of this answer, I also removed the part matching the comma, because I think it's easier just to match everything between quotes. As for your bonus question, to replace the matched value instead of having to hardcode gbpFormat, I used a backreference ($1), which will insert the first matched group into the replacement string.
Don't manipulate JSON with regexp. It's too likely that you will break it, as you have found, and more importantly there's no need to.
In addition, once you have changed
'{"columns": [..."renderer": "gbpFormat", ...]}'
into
'{"columns": [..."renderer": gbpFormat, ...]}' // remove quotes from gbpFormat
then this is no longer valid JSON. (JSON requires that property values be numbers, quoted strings, objects, or arrays.) So you will not be able to parse it, or send it anywhere and have it interpreted correctly.
Therefore you should parse it to start with, then manipulate the resulting actual JS object:
var object = JSON.parse(txt);
object.columns.forEach(function(column) {
column.renderer = ghpFormat;
});
If you want to replace any quoted value of the renderer property with the value itself, then you could try
column.renderer = window[column.renderer];
Assuming that the value is available in the global namespace.
This question falls into the category of "I need a regexp, or I wrote one and it's not working, and I'm not really sure why it has to be a regexp, but I heard they can do all kinds of things, so that's just what I imagined I must need." People use regexps to try to do far too many complex matching, splitting, scanning, replacement, and validation tasks, including on complex languages such as HTML, or in this case JSON. There is almost always a better way.
The only time I can imagine wanting to manipulate JSON with regexps is if the JSON is broken somehow, perhaps due to a bug in server code, and it needs to be fixed up in order to be parseable.

Javascript Regex Match between pound signs

Quick. I have a string: #user 9#I'm alive! and I want to be about to pull out "user 9".
So far Im doing:
if(variable.match(/\#/g)){
console.log(variable):
}
But the output is still the full line.
Use .split(), in order to pull out your desired item.
var variable = variable.split("#");
console.log(variable[1]);
.split() turns your string into an array, with the first variable as a separator.
Of course, if you just want regex alone, you could do:
console.log(variable.match(/([^#]+)/g));
This will again give you an array of the items, but a smaller one as it doesn't use the empty value before the hash as an item. Further, as stated by #Stephen P, you will need to use a capture group(()) to capture the items you want.
Try something more along these lines...
var thetext="#user 9#I'm alive!";
thetext.match(/\#([^#]+)\#/g);
You want to introduce a capturing group (the part in the parentheses) to collect the text in between the pound signs.
You may want to use .exec() instead of .match() depending on what you're doing.
Also see this answer to the SO question "How do you access the matched groups in a javascript regex?"

Categories

Resources