Extracting JSON generic string from text file

Extracting JSON generic string from text file - javascript

I have a file the contents of which are formatted as follows:
{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
I'd like to extract just the JSON heading via regex (which I think is the most sensible approach here). However, in the future the JSON might change, (e.g. it might have extra fields added, or the current fields could change), so I'd like to keep the regex flexible. I have tried with the solution suggested here, however, that seems a bit too simplistic for my use case: in fact, the above "tricks" the regex, as shown in this regex101.com example.
Since my knowledge of regex is not that advanced, I'd like to know whether there's a regex approach that is able to cover my use case.
Thank you!

You can check for the first index of \n} to get the sub-string:
s = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
}`
i = s.indexOf('\n}')
if (i > 0) {
o = JSON.parse(s = s.slice(0, i + 2))
console.log(s); console.log(o)
}
or a bit shorter with RegEx:
s = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
}`
s.replace(/.*?\n}/s, function(m) {
o = JSON.parse(m)
console.log(m); console.log(o)
})

If the JSON always starts with { at the left margin and ends with } at the right margin, with everything else indented as you show, you can use the regular expression
/^{.*?^}$/ms
The m modifier makes ^ and $ match the beginning and end of lines, not the whole string. The s modifier allows . to match newlines.
var str = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
`;
var match = str.match(/^{.*?^}$/ms);
if (match) {
var data = JSON.parse(match[0]);
}
console.log(data);

Related

Regex matching the entire file

I have a big SRT (subtitle) file that I'm trying to convert to JSON, but my regex doesn't seem to be working correctly.
My expression:
^(\d+)\r?\n(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\s*\-\-\>\s*(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\r?\n([\s\S]*)(\r?\n)*$
Here is a sample of my srt file, each subtitle follows the same scheme.
1
00:00:11,636 --> 00:00:13,221
Josh communicated but
2
00:00:13,221 --> 00:00:16,850
it's also the belief that
we never knew the severity
my javascript file
const fs = require('fs');
function parse(content, options) {
var captions = [];
var parts = content.split(/\r?\n\s+\r?\n/g);
for (var i = 0; i < parts.length; i++) {
var regex = /^(\d+)\r?\n(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\s*\-\-\>\s*(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\r?\n([\s\S]*)(\r?\n)*$/gi;
var match = regex.exec(parts[i]);
if (match) {
var caption = {};
var eol = "\n";
caption.id = parseInt(match[1]);
caption.start = match[2];
caption.end = match[4];
var lines = match[6].split('/\r?\n/');
caption.content = lines.join(eol);
captions.push(caption);
continue;
}
}
return captions;
};
var content = fs.readFileSync('./English-SRT-CC.srt', 'utf8');
var captions = parse(content);
var json = JSON.stringify(captions, " ", 2);
console.log(json);
fs.writeFile("output.json", json, 'utf8', function (err) {
if (err) {
return console.log(err);
}
console.log("JSON file has been saved.");
});
And finally, here's my output:
{
"id": 1,
"start": "00:00:11,636",
"end": "00:00:13,221",
"content": "Josh communicated but\n\n2\n00:00:13,221 --> 00:00:16,850\n
// cut for shortness, it just continues the rest of the file inside "content"
My desired output?
{
"id": 1,
"start": "00:00:11,636",
"end": "00:00:13,221",
"content": "Josh communicated but"
},
{
"id": 2,
"start": "00:00:13,221",
"end": "00:00:16,850",
"content": "it's also the belief that\n we never knew the severity"
}
Thanks!
Edit: regex101

Use this regex to match your text:
/\d+\n[0-9\:\,\-\>\s]{29}\n(.+|(\n[^\n]))+/g
I'll break it down in parts:
Part 1: \d+\n
This part matches any digits followed by exactly one newline character.
Part 2: [0-9\:\,\-\>\s]{29}\n
This part matches the included characters, with exact length of 29, which is the fixed format of, for example, 00:00:11,636 --> 00:00:13,221, then followed by one newline character.
Part 3: (.+|(\n[^\n]))+
Now this part is important. I'll break it into sub-parts:
.+ is to match any character, except newline characters.
(\n[^\n]) is to match exactly one newline character that is NOT followed by another newline character. This is important to make multi-line subtitle matching possible. Without this, you can't match multi-line subtitles (because of the file structure, not because of regex limitation).
Wrapping them up with bracket (...)+ is to let it match with multiple lines. This is how you can match multi-line subtitles.
Part 4: g
Use this to match more than 1 matches.
Working code
In accordance to this regexp, I have also used another way to parse it into your desired output, which is a lot easier and less complicated that your current approach.
You can see how you can utilise it:
const text = `
1
00:00:11,636 --> 00:00:13,221
Josh communicated but
2
00:00:13,221 --> 00:00:16,850
it's also the belief that
we never knew the severity
`;
const regex = /\d+\n+[0-9\:\,\-\>\s]{29}\n(.+|(\n[^\n]))+/g;
const rawResult = text.match(regex);
console.log(rawResult);
const parsedResult = rawResult.map(chunk => {
const [id, time, ...lines] = chunk.split(/\n/g);
const [start, end] = time.split(/\s\-\-\>\s/);
const content = lines.join('\n');
return { id, start, end, content };
});
console.log(parsedResult);

Regex replace if matching group not preceded by `\` exept if preceded by `\\`

My Goal
What I want to do is something similar to this:
let json_obj = {
hello: {
to: 'world'
},
last_name: {
john: 'smith'
},
example: 'a ${type}', // ${type} -> json_obj.type
type: 'test'
}
// ${hello.to} -> json_obj.hello.to -> "word"
let sample_text = 'Hello ${hello.to}!\n' +
// ${last_name.john} -> json_obj.last_name.john -> "smith"
'My name is John ${last_name.john}.\n' +
// ${example} -> json_obj.example -> "a test"
'This is just ${example}!';
function replacer(text) {
return text.replace(/\${([^}]+)}/g, (m, gr) => {
gr = gr.split('.');
let obj = json_obj;
while(gr.length > 0)
obj = obj[gr.shift()];
/* I know there is no validation but it
is just to show what I'm trying to do. */
return replacer(obj);
});
}
console.log(replacer(sample_text));
Until now this is pretty easy to do.
But if $ is preceded by a backslash(\) I don't want to replace the thing between brackets. For example: \${hello.to}would not be replaced.
The problem grows up when I want to be able to escape the backslashes. What I mean by escaping the backslashes is for example:
\${hello.to} would become: ${hello.to}
\\${hello.to} would become: \world
\\\${hello.to} would become: \${hello.to}
\\\\${hello.to} would become: \\${hello.to}
etc.
What I've tried?
I didn't try many thing so far cause I've absolutely no idea how to achieve that since from what I know there is no lookbehind pattern in javascript regular expressions.
I hope the way I explained it is clear enoughto be understood andI hope someone has a solution.

I recommend you to solve this problem in separate steps :)
1) First step:
Simplify backslashes of your text replacing all occurrences of "\\" for "". This will eliminate all redundancies and make the token replacement part easier.
text = text.replace(/\\\\/g, '');
2) Second step:
To replace the tokens of the text, use this regex: /[^\\](\${([^}]+)})/. This one will not permit tokens that have with \ before them. Ex: \${hello.to}.
Here is you code with the new expression:
function replacer(text) {
return text.replace(/[^\\](\${([^}]+)})/, (m, gr) => {
gr = gr.split('.');
let obj = json_obj;
while(gr.length > 0)
obj = obj[gr.shift()];
/* I know there is no validation but it
is just to show what I'm trying to do. */
return replacer(obj);
});
}
If you still have any problems, let me know :)

JSON Remove trailing comma from last object

This JSON data is being dynamically inserted into a template I'm working on. I'm trying to remove the trailing comma from the list of objects.
The CMS I'm working in uses Velocity, which I'm not too familiar with yet. So I was looking to write a snippet of JavaScript that detects that trailing comma on the last object (ITEM2) and removes it. Is there a REGEX I can use to detect any comma before that closing bracket?
[
{
"ITEM1":{
"names":[
"nameA"
]
}
},
{
"ITEM2":{
"names":[
"nameB",
"nameC"
]
}
}, // need to remove this comma!
]

You need to find ,, after which there is no any new attribute, object or array.
New attribute could start either with quotes (" or ') or with any word-character (\w).
New object could start only with character {.
New array could start only with character [.
New attribute, object or array could be placed after a bunch of space-like symbols (\s).
So, the regex will be like this:
const regex = /\,(?!\s*?[\{\[\"\'\w])/g;
Use it like this:
// javascript
const json = input.replace(regex, ''); // remove all trailing commas (`input` variable holds the erroneous JSON)
const data = JSON.parse(json); // build a new JSON object based on correct string
Try the first regex.
Another approach is to find every ,, after which there is a closing bracket.
Closing brackets in this case are } and ].
Again, closing brackets might be placed after a bunch of space-like symbols (\s).
Hence the regexp:
const regex = /\,(?=\s*?[\}\]])/g;
Usage is the same.
Try the second regex.

For your specific example, you can do a simple search/replace like this:
,\n]$
Replacement string:
\n]
Working demo
Code
var re = /,\n]$/;
var str = '[ \n { \n "ITEM1":{ \n "names":[ \n "nameA"\n ]\n }\n },\n { \n "ITEM2":{ \n "names":[ \n "nameB",\n "nameC"\n ]\n }\n },\n]';
var subst = '\n]';
var result = str.replace(re, subst);

Consider the Json input = [{"ITEM1":{"names":["nameA"]}},{"ITEM2":{"names":["nameB","nameC"]}},] without whitespaces.
I suggest a simple way using substring.
input = input.substring(0, input.length-2);
input = input + "]";

I developped a simple but useful logic for this purpose - you can try this.
Integer Cnt = 5;
String StrInput = "[";
for(int i=1; i<Cnt; i++){
StrInput +=" {"
+ " \"ITEM"+i+"\":{ "
+ " \"names\":["
+ " \"nameA\""
+ "]"
+"}";
if(i ==(Cnt-1)) {
StrInput += "}";
} else {
StrInput += "},";
}
}
StrInput +="]";
System.out.println(StrInput);

Javascript regex find variables in a math equation

I want to find in a math expression elements that are not wrapped between { and }
Examples:
Input: abc+1*def
Matches: ["abc", "1", "def"]
Input: {abc}+1+def
Matches: ["1", "def"]
Input: abc+(1+def)
Matches: ["abc", "1", "def"]
Input: abc+(1+{def})
Matches: ["abc", "1"]
Input: abc def+(1.1+{ghi})
Matches: ["abc def", "1.1"]
Input: 1.1-{abc def}
Matches: ["1.1"]
Rules
The expression is well-formed. (So there won't be start parenthesis without closing parenthesis or starting { without })
The math symbols allowed in the expression are + - / * and ( )
Numbers could be decimals.
Variables could contains spaces.
Only one level of { } (no nested brackets)
So far, I ended with: http://regex101.com/r/gU0dO4
(^[^/*+({})-]+|(?:[/*+({})-])[^/*+({})-]+(?:[/*+({})-])|[^/*+({})-]+$)
I split the task into 3:
match elements at the beginning of the string
match elements that are between two { and }
match elements at the end of the string
But it doesn't work as expected.
Any idea ?

Matching {}s, especially nested ones is hard (read impossible) for a standard regular expression, since it requires counting the number of {s you encountered so you know which } terminated it.
Instead, a simple string manipulation method could work, this is a very basic parser that just reads the string left to right and consumes it when outside of parentheses.
var input = "abc def+(1.1+{ghi})"; // I assume well formed, as well as no precedence
var inParens = false;
var output = [], buffer = "", parenCount = 0;
for(var i = 0; i < input.length; i++){
if(!inParens){
if(input[i] === "{"){
inParens = true;
parenCount++;
} else if (["+","-","(",")","/","*"].some(function(x){
return x === input[i];
})){ // got symbol
if(buffer!==""){ // buffer has stuff to add to input
output.push(buffer); // add the last symbol
buffer = "";
}
} else { // letter or number
buffer += input[i]; // push to buffer
}
} else { // inParens is true
if(input[i] === "{") parenCount++;
if(input[i] === "}") parenCount--;
if(parenCount === 0) inParens = false; // consume again
}
}

This might be an interesting regexp challenge, but in the real world you'd be much better off simply finding all [^+/*()-]+ groups and removing those enclosed in {}'s
"abc def+(1.1+{ghi})".match(/[^+/*()-]+/g).filter(
function(x) { return !/^{.+?}$/.test(x) })
// ["abc def", "1.1"]
That being said, regexes is not a correct way to parse math expressions. For serious parsing, consider using formal grammars and parsers. There are plenty of parser generators for javascript, for example, in PEG.js you can write a grammar like
expr
= left:multiplicative "+" expr
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative
/ primary
primary
= atom
/ "{" expr "}"
/ "(" expr ")"
atom = number / word
number = n:[0-9.]+ { return parseFloat(n.join("")) }
word = w:[a-zA-Z ]+ { return w.join("") }
and generate a parser which will be able to turn
abc def+(1.1+{ghi})
into
[
"abc def",
"+",
[
"(",
[
1.1,
"+",
[
"{",
"ghi",
"}"
]
],
")"
]
]
Then you can iterate this array just normally and fetch the parts you're interested in.

The variable names you mentioned can be match by \b[\w.]+\b since they are strictly bounded by word separators
Since you have well formed formulas, the names you don't want to capture are strictly followed by }, therefore you can use a lookahead expression to exclude these :
(\b[\w.]+ \b)(?!})
Will match the required elements (http://regexr.com/38rch).
Edit:
For more complex uses like correctly matching :
abc {def{}}
abc def+(1.1+{g{h}i})
We need to change the lookahead term to (?|({|}))
To include the match of 1.2-{abc def} we need to change the \b1. This term is using lookaround expression which are not available in javascript. So we have to work around.
(?:^|[^a-zA-Z0-9. ])([a-zA-Z0-9. ]+(?=[^0-9A-Za-z. ]))(?!({|}))
Seems to be a good one for our examples (http://regex101.com/r/oH7dO1).
1 \b is the separation between a \w and a \W \z or \a. Since \w does not include space and \W does, it is incompatible with the definition of our variable names.

Going forward with user2864740's comment, you can replace all things between {} with empty and then match the remaining.
var matches = "string here".replace(/{.+?}/g,"").match(/\b[\w. ]+\b/g);
Since you know that expressions are valid, just select \w+

Regex remove everything thats outside { }

Regex to remove everything outside the { }
for example:
before:
|loader|1|2|3|4|5|6|7|8|9|{"data" : "some data" }
after:
{"data" : "some data" }
with #Marcelo's regex this works but not if there are others {} inside the {} like here:
"|loader|1|2|3|4|5|6|7|8|9|
{'data':
[
{'data':'some data'}
],
}"

This seems to work - What language are you using - Obviously Regex... but what server side - then I can put it into a statement for you
{(.*)}

You want to do:
Regex.Replace("|loader|1|2|3|4|5|6|7|8|9|{\"data\" : \"some data\" }", ".*?({.*?}).*?", "$1");
(C# syntax, regex should be fine in most languages afaik)

in javascript you can try
s = '|loader|1|2|3|4|5|6|7|8|9|{"data" : "some data" }';
s = s.replace(/[^{]*({[^}]*})/g,'$1');
alert(s);
of course this will not work if "some data" has curly braces so the solution highly depends on your input data.
I hope this will help you
Jerome Wagner

You can do something like this in Java:
String[] tests = {
"{ in in in } out", // "{ in in in }"
"out { in in in }", // "{ in in in }"
" { in } ", // "{ in }"
"pre { in1 } between { in2 } post", // "{ in1 }{ in2 }"
};
for (String test : tests) {
System.out.println(test.replaceAll("(?<=^|\\})[^{]+", ""));
}
The regex is:
(?<=^|\})[^{]+
Basically we match any string that is "outside", as defined as something that follows a literal }, or starting from the beginning of the string ^, until it reaches a literal{, i.e. we match [^{]+, We replace these matched "outside" string with an empty string.
See also
regular-expressions.info/Lookarounds
(?<=...) is a positive lookbehind
A non-regex Javascript solution, for nestable but single top-level {...}
Depending on the problem specification (it isn't exactly clear), you can also do something like this:
var s = "pre { { { } } } post";
s = s.substring(s.indexOf("{"), s.lastIndexOf("}") + 1);
This does exactly what it says: given an arbitrary string s, it takes its substring starting from the first { to the last } (inclusive).

For those who searching this for PHP, only this one worked for me:
preg_replace("/.*({.*}).*/","$1",$input);

Develop Reference

JavaScript is the programming language of the Web.

Extracting JSON generic string from text file - javascript

Related

Regex matching the entire file

Regex replace if matching group not preceded by `\` exept if preceded by `\\`

JSON Remove trailing comma from last object

Javascript regex find variables in a math equation

Regex remove everything thats outside { }

Categories

Resources