Regex matching the entire file - javascript

I have a big SRT (subtitle) file that I'm trying to convert to JSON, but my regex doesn't seem to be working correctly.
My expression:
^(\d+)\r?\n(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\s*\-\-\>\s*(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\r?\n([\s\S]*)(\r?\n)*$
Here is a sample of my srt file, each subtitle follows the same scheme.
1
00:00:11,636 --> 00:00:13,221
Josh communicated but
2
00:00:13,221 --> 00:00:16,850
it's also the belief that
we never knew the severity
my javascript file
const fs = require('fs');
function parse(content, options) {
var captions = [];
var parts = content.split(/\r?\n\s+\r?\n/g);
for (var i = 0; i < parts.length; i++) {
var regex = /^(\d+)\r?\n(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\s*\-\-\>\s*(\d{1,2}:\d{1,2}:\d{1,2}([.,]\d{1,3})?)\r?\n([\s\S]*)(\r?\n)*$/gi;
var match = regex.exec(parts[i]);
if (match) {
var caption = {};
var eol = "\n";
caption.id = parseInt(match[1]);
caption.start = match[2];
caption.end = match[4];
var lines = match[6].split('/\r?\n/');
caption.content = lines.join(eol);
captions.push(caption);
continue;
}
}
return captions;
};
var content = fs.readFileSync('./English-SRT-CC.srt', 'utf8');
var captions = parse(content);
var json = JSON.stringify(captions, " ", 2);
console.log(json);
fs.writeFile("output.json", json, 'utf8', function (err) {
if (err) {
return console.log(err);
}
console.log("JSON file has been saved.");
});
And finally, here's my output:
{
"id": 1,
"start": "00:00:11,636",
"end": "00:00:13,221",
"content": "Josh communicated but\n\n2\n00:00:13,221 --> 00:00:16,850\n
// cut for shortness, it just continues the rest of the file inside "content"
My desired output?
{
"id": 1,
"start": "00:00:11,636",
"end": "00:00:13,221",
"content": "Josh communicated but"
},
{
"id": 2,
"start": "00:00:13,221",
"end": "00:00:16,850",
"content": "it's also the belief that\n we never knew the severity"
}
Thanks!
Edit: regex101

Use this regex to match your text:
/\d+\n[0-9\:\,\-\>\s]{29}\n(.+|(\n[^\n]))+/g
I'll break it down in parts:
Part 1: \d+\n
This part matches any digits followed by exactly one newline character.
Part 2: [0-9\:\,\-\>\s]{29}\n
This part matches the included characters, with exact length of 29, which is the fixed format of, for example, 00:00:11,636 --> 00:00:13,221, then followed by one newline character.
Part 3: (.+|(\n[^\n]))+
Now this part is important. I'll break it into sub-parts:
.+ is to match any character, except newline characters.
(\n[^\n]) is to match exactly one newline character that is NOT followed by another newline character. This is important to make multi-line subtitle matching possible. Without this, you can't match multi-line subtitles (because of the file structure, not because of regex limitation).
Wrapping them up with bracket (...)+ is to let it match with multiple lines. This is how you can match multi-line subtitles.
Part 4: g
Use this to match more than 1 matches.
Working code
In accordance to this regexp, I have also used another way to parse it into your desired output, which is a lot easier and less complicated that your current approach.
You can see how you can utilise it:
const text = `
1
00:00:11,636 --> 00:00:13,221
Josh communicated but
2
00:00:13,221 --> 00:00:16,850
it's also the belief that
we never knew the severity
`;
const regex = /\d+\n+[0-9\:\,\-\>\s]{29}\n(.+|(\n[^\n]))+/g;
const rawResult = text.match(regex);
console.log(rawResult);
const parsedResult = rawResult.map(chunk => {
const [id, time, ...lines] = chunk.split(/\n/g);
const [start, end] = time.split(/\s\-\-\>\s/);
const content = lines.join('\n');
return { id, start, end, content };
});
console.log(parsedResult);

Related

How to replace all values in an object with strings using JS?

I'm trying to get an array of JSON objects. To do that, I'm trying to make the input I have parsable, then parse it and push it to that array using a for loop. The inputs I have to work with look like this:
firstname: Chris, lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW
I've gotten it to the point where each loop produces something like this:
{ "firstname": First Name, "lastname": Last Name, "email": sample#gmail.com, "viewerlink": audiencematic.com/viewer?v=TESTSHOW&push=715B3074, "tempid": 715B3074, "permaid": F8tGYNx, "showid": TESTSHOW }
But got stuck on the last bit, making the values strings. I want it to look like this, so I can use JSON.parse():
{ "firstname": "First Name", "lastname": "Last Name", "email": "sample#gmail.com", "viewerlink": "audiencematic.com/viewer?v=TESTSHOW&push=715B3074", "tempid": "715B3074", "permaid": "F8tGYNx", "showed": "TESTSHOW" }
I tried a couple of different methods I found on here, but one of the values is a URL and the period is screwing with the replace expressions. I tried using the replace function like this:
var jsonStr2 = jsonStr.replace(/(: +\w)|(:+\w)/g, function(matchedStr) {
return ':"' + matchedStr.substring(2, matchedStr.length) + '"';
});
But it just becomes this:
{ "firstname":""irst Name, "lastname":""ast Name, "email":""ample#gmail.com, "viewerlink":""udiencematic.com/viewer?v=TESTSHOW&push=715B3074, "tempid":""15B3074, "permaid":""8tGYNx, "showid":""ESTSHOW }
How should I change my replace function?
(I tried that code because I'm using
var jsonStr = string.replace(/(\w+:)|(\w+ :)/g, function(matchedStr) {
return '"' + matchedStr.substring(0, matchedStr.length - 1) + '":';
});
to put parenthesis around the key sides and that seems to work.)
FIGURED IT OUT!! SEE MY ANSWER BELOW.
One option might be to try using a deserialized version of the string, alter the values associated with the properties of the object, and then convert back to a string.
var person = "{fname:\"John\", lname:\"Doe\", age:25}";
var obj = JSON.parse(person);
for (x in obj) {
obj[x] = "";
}
var result = JSON.stringify(obj);
It's a little longer than doing a string replacement, but I find it a little easier to follow.
I figured it out! I just had to mess around in regexr to figure out what conditions I needed. Here's the working for loop code:
for (i = 0; i < audiencelistdirty.feed.openSearch$totalResults.$t; i++) {
var string = '{ ' + audiencelistdirty.feed.entry[i].content.$t + ' }';
var jsonStr = string.replace(/(\w+:)|(\w+ :)/g, function(matchedStr) {
return '"' + matchedStr.substring(0, matchedStr.length - 1) + '":';
});
var jsonStr1 = jsonStr.replace(/(:(.*?),)|(:\s(.*?)\s)/g, function(matchedStr) {
return ':"' + matchedStr.substring(2, matchedStr.length - 1) + '",';
});
var jsonStr2 = jsonStr1.replace(/(",})/g, function(matchedStr) {
return '" }';
});
var newObj = JSON.parse(jsonStr2);
audiencelist.push(newObj);
};
It's pretty ugly but it works.
EDIT: Sorry, I completely misread the question. To replace the values with quoted strings use this regex replace function:
const str =
'firstname: Chris, lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW'
const json = (() => {
const result = str
.replace(/\w+:\s(.*?)(?:,|$)/g, function (match, subStr) {
return match.replace(subStr, `"${subStr}"`)
})
.replace(/(\w+):/g, function (match, subStr) {
return match.replace(subStr, `"${subStr}"`)
})
return '{' + result + '}'
})()
Wrap the input string into commas then use a regex to identify the keys (between , and :) and their associated values (between : and ,) and construct the object directly as in the example below:
const input = ' firstname : Chris , lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW ';
const wrapped = `,${input},`;
const re = /,\s*([^:\s]*)\s*:\s*(.*?)\s*(?=,)/g;
const obj = {}
Array.from(wrapped.matchAll(re)).forEach((match) => obj[match[1]] = match[2]);
console.log(obj)
String.matchAll() is a newer function, not all JavaScript engines have implemented it yet. If you are one of the unlucky ones (or if you write code to be executed in a browser) then you can use the old-school way:
const input = ' firstname : Chris , lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW ';
const wrapped = `,${input},`;
const re = /,\s*([^:\s]*)\s*:\s*(.*?)\s*(?=,)/g;
const obj = {}
let match = re.exec(wrapped);
while (match) {
obj[match[1]] = match[2];
match = re.exec(wrapped);
}
console.log(obj);
The anatomy of the regex used above
The regular expression piece by piece:
/ # regex delimiter; not part of the regex but JavaScript syntax
, # match a comma
\s # match a white space character (space, tab, new line)
* # the previous symbol zero or more times
( # start the first capturing group; does not match anything
[ # start a character class...
^ # ... that matches any character not listed inside the class
: # ... i.e. any character but semicolon...
\s # ... and white space character
] # end of the character class; the entire class matches only one character
* # the previous symbol zero or more times
) # end of the first capturing group; does not match anything
\s*:\s* # zero or more spaces before and after the semicolon
( # start of the second capturing group
.* # any character, any number of times; this is greedy by default
? # make it not greedy
) # end of the second capturing group
\s* # zero or more spaces
(?= # lookahead positive assertion; matches but does not consume the matched substring
, # matches a comma
) # end of the assertion
/ # regex delimiter; not part of the regex but JavaScript
g # regex flag; 'g' for 'global' is needed to find all matches
Read about the syntax of regular expressions in JavaScript. For a more comprehensive description of the regex patterns I recommend reading the PHP documentation of PCRE (Perl-Compatible Regular Expressions).
You can see the regex in action and play with it on regex101.com.

Regex to fetch all spaces as long as they are not enclosed in brackets

Regex to fetch all spaces as long as they are not enclosed in braces
This is for a javascript mention system
ex: "Speak #::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}, all right?"
Need to get:
[ "Speak ", "#::{Joseph
Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}", ",", "all ", "right?"
]
[Edit]
Solved in: https://codesandbox.io/s/rough-http-8sgk2
Sorry for my bad english
I interpreted your question as you said to to fetch all spaces as long as they are not enclosed in braces, although your result example isn't what I would expect. Your example result contains a space after speak, as well as a separate match for the , after the {} groups. My output below shows what I would expect for what I think you are asking for, a list of strings split on just the spaces outside of braces.
const str =
"Speak #::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}, all right?";
// This regex matches both pairs of {} with things inside and spaces
// It will not properly handle nested {{}}
// It does this such that instead of capturing the spaces inside the {},
// it instead captures the whole of the {} group, spaces and all,
// so we can discard those later
var re = /(?:\{[^}]*?\})|( )/g;
var match;
var matches = [];
while ((match = re.exec(str)) != null) {
matches.push(match);
}
var cutString = str;
var splitPieces = [];
for (var len=matches.length, i=len - 1; i>=0; i--) {
match = matches[i];
// Since we have matched both groups of {} and spaces, ignore the {} matches
// just look at the matches that are exactly a space
if(match[0] == ' ') {
// Note that if there is a trailing space at the end of the string,
// we will still treat it as delimiter and give an empty string
// after it as a split element
// If this is undesirable, check if match.index + 1 >= cutString.length first
splitPieces.unshift(cutString.slice(match.index + 1));
cutString = cutString.slice(0, match.index);
}
}
splitPieces.unshift(cutString);
console.log(splitPieces)
Console:
["Speak", "#::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc},", "all", "right?"]

Split string with regex with numbers problems

Having a list of strings like:
Client Potential XSS2Medium
Client HTML5 Insecure Storage41Medium
Client Potential DOM Open Redirect12Low
I would like to split every string into three strings like:
["Client Potential XSS", "2", "Medium"]
I use this regular expression:
/[a-zA-Z ]+|[0-9]+/g)
But with strings that contains others numbers into, it obviously doesn't work. For example with:
Client HTML5 Insecure Storage41Medium
the result is:
["Client HTML", "5", " Insercure Storage", "41", "Medium"]
I can't find the regex that produces:
["Client HTML5 Insercure Storage", "41", "Medium"]
This regex works on regex101.com:
(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)
Using it in my code:
data.substring(startIndex, endIndex)
.split("\r\n") // Split the vulnerabilities
.filter(item => !item.match(/(-+)Page \([0-9]+\) Break(-+)/g) // Remove page break
&& !item.match(/PAGE [0-9]+ OF [0-9]+/g) // Remove pagination
&& item !== '') // Remove blank strings
.map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/g));
doesn't work.
Any help would be greatly appreciated!
EDIT:
All strings end with High, Medium and Low.
The problem is with your g global flag.
Remove that flag from this line: .map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/g)); to make it:
.map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/));
Also, you could make the regex much simpler, as shown by #bhmahler:
.map(v => v.match(/(.*?)(\d+)(low|medium|high)/i));
The following regex should give you what you are looking for.
/(.*?)(\d+)(low|medium|high)/gi
Here is an example https://regex101.com/r/AS9mvf/1
Here is an example of it working with map
var entries = [
'Client Potential XSS2Medium',
'Client HTML5 Insecure Storage41Medium',
'Client Potential DOM Open Redirect12Low'
];
var matches = entries.map(v => {
var result = /(.*?)(\d+)(low|medium|high)/gi.exec(v);
return [
result[1],
result[2],
result[3]
];
});
console.log(matches);
You could use a workaround (that is match vs. capture, then replace):
let strings = ['Client Potential XSS2Medium', 'Client HTML5 Insecure Storage41Medium', 'Client Potential DOM Open Redirect12Low', 'Client HTML5 Insecure Storage41Medium'];
let regex = /(?:HTML5|or_other_string)|(\d+)/g;
strings.forEach(function(string) {
string = string.replace(regex, function(match, g1) {
if (typeof(g1) != "undefined") {
return "###" + g1 + "###";
}
return match;
});
string = string.split("###");
console.log(string);
});
See an additional demo on regex101.com.
let arr = ["Client Potential XSS2Medium",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"];
let re = /^.+[a-zA-Z](?=\d+)|\d+(?=[A-Z])|[^\d]+\w+$/g;
arr.forEach(str => console.log(str.match(re)))
^.+[a-zA-Z](?=\d+) Match beginning of string followed by a-zA-Z followed by one or more digit characters
\d+(?=[A-Z]) Match one or more digit characters followed by uppercase letter character
[^\d]+\w+$ Negate digit characters followed by matching word characters until end of string
Here you have one solution that wraps the number before the words High, Low or Medium with a custom token using String.replace() and finally split the resulting string by this token:
const inputs = [
"Client Potential XSS2High",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"
];
let token = "-#-";
let regexp = /(\d+)(High|Low|Medium)$/;
let res = inputs.map(
x => x.replace(regexp, `${token}$1${token}$2`).split(token)
);
console.log(res);
Another solution is to use this regexp: /^(.*?)(\d+)(High|Low|Medium)$/i
const inputs = [
"Client Potential XSS2High",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"
];
let regexp = /^(.*?)(\d+)(High|Low|Medium)$/i;
let res = inputs.map(
x => x.match(regexp).slice(1)
);
console.log(res);
const text = `Client Potential XSS2Medium
Client HTML5 Insecure Storage41Medium
Client Potential DOM Open Redirect12Low`
const res = text.split("\n").map(el => el.replace(/\d+/g, a => ' ' + a + ' ') );
console.log(res)

Extracting JSON generic string from text file

I have a file the contents of which are formatted as follows:
{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
I'd like to extract just the JSON heading via regex (which I think is the most sensible approach here). However, in the future the JSON might change, (e.g. it might have extra fields added, or the current fields could change), so I'd like to keep the regex flexible. I have tried with the solution suggested here, however, that seems a bit too simplistic for my use case: in fact, the above "tricks" the regex, as shown in this regex101.com example.
Since my knowledge of regex is not that advanced, I'd like to know whether there's a regex approach that is able to cover my use case.
Thank you!
You can check for the first index of \n} to get the sub-string:
s = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
}`
i = s.indexOf('\n}')
if (i > 0) {
o = JSON.parse(s = s.slice(0, i + 2))
console.log(s); console.log(o)
}
or a bit shorter with RegEx:
s = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
}`
s.replace(/.*?\n}/s, function(m) {
o = JSON.parse(m)
console.log(m); console.log(o)
})
If the JSON always starts with { at the left margin and ends with } at the right margin, with everything else indented as you show, you can use the regular expression
/^{.*?^}$/ms
The m modifier makes ^ and $ match the beginning and end of lines, not the whole string. The s modifier allows . to match newlines.
var str = `{
"title": "This is a test }",
"date": "2017-11-16T20:47:16+00:00"
}
This is a test }
`;
var match = str.match(/^{.*?^}$/ms);
if (match) {
var data = JSON.parse(match[0]);
}
console.log(data);

A simpler regular expression to parse quoted strings

The question is simple. I have a string that contains multiple elements which are embedded in single-quotation marks:
var str = "'alice' 'anna marie' 'benjamin' 'christin' 'david' 'muhammad ali'"
And I want to parse it so that I have all those names in an array:
result = [
'alice',
'anna marie',
'benjamin',
'christin',
'david',
'muhammad ali'
]
Currently I'm using this code to do the job:
var result = str.match(/\s*'(.*?)'\s*'(.*?)'\s*'(.*?)'\s*'(.*?)'/);
But this regular expression is too long and it's not flexible, so if I have more elements in the str string, I have to edit the regular expression.
What is the fastest and most efficient way to do this parsing? Performance and felxibility is important in our web application.
I have looked at the following question but they are not my answer:
Regular Expression For Quoted String
Regular Expression - How To Find Words and Quoted Phrases
Define the pattern once and use the global g flag.
var matches = str.match(/'[^']*'/g);
If you want the tokens without the single quotes around them, the normal approach would be to use sub-matches in REGEX - however JavaScript doesn't support the capturing of sub-groups when the g flag is used. The simplest (though not necessarily most efficient) way around this would be to remove them afterwards, iteratively:
if (matches)
for (var i=0, len=matches.length; i<len; i++)
matches[i] = matches[i].replace(/'/g, '');
[EDIT] - as the other answers say, you could use split() instead, but only if you can rely on there always being a space (or some common delimiter) between each token in your string.
A different approach
I came here needing an approach that could parse a string for quotes and non quotes, preserve the order of quotes and non quotes, then output it with specific tags wrapped around them for React or React Native so I ended up not using the answers here because I wasn't sure how to get them to fit my need then did this instead.
function parseQuotes(str) {
var openQuote = false;
var parsed = [];
var quote = '';
var text = '';
var openQuote = false;
for (var i = 0; i < str.length; i++) {
var item = str[i];
if (item === '"' && !openQuote) {
openQuote = true;
parsed.push({ type: 'text', value: text });
text = '';
}
else if (item === '"' && openQuote) {
openQuote = false;
parsed.push({ type: 'quote', value: quote });
quote = '';
}
else if (openQuote) quote += item;
else text += item;
}
if (openQuote) parsed.push({ type: 'text', value: '"' + quote });
else parsed.push({ type: 'text', value: text });
return parsed;
}
That when given this:
'Testing this "shhhh" if it "works!" " hahahah!'
produces that:
[
{
"type": "text",
"value": "Testing this "
},
{
"type": "quote",
"value": "shhhh"
},
{
"type": "text",
"value": " if it "
},
{
"type": "quote",
"value": "works!"
},
{
"type": "text",
"value": " "
},
{
"type": "text",
"value": "\" hahahah!"
}
]
which allows you to easily wrap tags around it depending on what it is.
https://jsfiddle.net/o6seau4e/4/
When a regex object has the the global flag set, you can execute it multiple times against a string to find all matches. It works by starting the next search after the last character matched in the last run:
var buf = "'abc' 'def' 'ghi'";
var exp = /'(.*?)'/g;
for(var match=exp.exec(buf); match!=null; match=exp.exec(buf)) {
alert(match[0]);
}
Personally, I find it a really good way to parse strings.
EDIT: the expression /'(.*?)'/g matches any content between single-quote ('), the modifier *? is non-greedy and it greatly simplifies the pattern.
One way;
var str = "'alice' 'benjamin' 'christin' 'david'";
var result = {};
str.replace(/'([^']*)'/g, function(m, p1) {
result[p1] = "";
});
for (var k in result) {
alert(k);
}
If someone gets here and requires more complex string parsing, with both single or double quotes and ability for escaping the quote this is the regex. Tested in JS and Ruby.
r = /(['"])((?:\\\1|(?!\1).)*)(\1)/g
str = "'alice' ddd vvv-12 'an\"na m\\'arie' \"hello ' world\" \"hello \\\" world\" 'david' 'muhammad ali'"
console.log(str.match(r).join("\n"))
'alice'
'an"na m\'arie'
"hello ' world"
"hello \" world"
'david'
'muhammad ali'
See that non-quoted strings were not found. If the goal is to also find non-quote words then a small fix will do:
r = /(['"])((?:\\\1|(?!\1).)*)(\1)|([^'" ]+)/g
console.log(str.match(r).join("\n"))
'alice'
ddd
vvv-12
'an"na m\'arie'
"hello ' world"
"hello \" world"
'david'
'muhammad ali'

Categories

Resources