Basically I am working with a string that is a json string in python but when used in javascript it has the "'" tags instead of double quotes and I would like to turn it into a real json (by using the JSON.parse()) but there are some quotation marks in the middle of the sentences (because I replaced the "'" for double marks).
Example: '{"author": "Jonah D"Almeida", ... }'
(I want to replace the one in between D and Almeida)
As it already has quotation marks around the whole sentence, javascript gives an error because it can't create a json out of it and so, to solve that basically I want to replace the quotation mark in the middle of the sentence for a ' (single mark) but only if it has letters preceeding and following the quotation mark.
My thought: myString.replace('letter before ... " ... letter after', "'")
Any idea how can I get the right expression for this? Basically I just want to know the regex expression the check if before and after the " quote it has letters, and if yes, change it to single mark (').
The OP ... "Basically I am working with a string that is a json string"
The above example is not what the OP refers to as json string. The OP's example data string already is invalid JSON.
Thus the first thing was to fix the process which generates such data.
Because ...
"parsing valid JSON data will return a perfectly valid object, and in case of the OP's use case a correctly escaped string value as well. "
... proof ...
const testSample_A = { author: "Jonah D'Almeida" };
const testSample_B = { author: 'Jonah D"Almeida' };
const testSample_C = { author: 'Jonah D\'Almeida' };
const testSample_D = { author: "Jonah D\"Almeida" };
console.log({
testSample_A,
testSample_B,
testSample_C,
testSample_D,
});
console.log('JSON.stringify(...) ... ', {
testSample_A: JSON.stringify(testSample_A),
testSample_B: JSON.stringify(testSample_B),
testSample_C: JSON.stringify(testSample_C),
testSample_D: JSON.stringify(testSample_D),
});
console.log('JSON.parse(JSON.stringify(...)) ... ', {
testSample_A: JSON.parse(JSON.stringify(testSample_A)),
testSample_B: JSON.parse(JSON.stringify(testSample_B)),
testSample_C: JSON.parse(JSON.stringify(testSample_C)),
testSample_D: JSON.parse(JSON.stringify(testSample_D)),
});
.as-console-wrapper { min-height: 100%!important; top: 0; }
Edit
A sanitizing task which exactly follows the OP's requirements nevertheless can be achieved based on a regex which features both a positive lookahead and a positive lookbehind ... either for basic latin only /(?<=\w)"(?=\w)/gm or more international with unicode escapes ... /(?<=\p{L})"(?=\p{L})/gmu
console.log('Letter unicode escapes ...', `
{"author": "Jonah D"Almeida", ... }
{"author": "Jon"ah D"Almeida", ... }
{"author": "Jon"ah D"Alme"ida", ... }`
.replace(/(?<=\p{L})"(?=\p{L})/gmu, '\\"')
);
console.log('Basic Latin support ...', `
{"author": "Jonah D"Almeida", ... }
{"author": "Jon"ah D"Almeida", ... }
{"author": "Jon"ah D"Alme"ida", ... }`
.replace(/(?<=\w)"(?=\w)/gm, '\\"')
);
console.log(
'sanitized and parsed string data ...',
JSON.parse(`[
{ "author": "Jonah D"Almeida" },
{ "author": "Jon"ah D"Almeida" },
{ "author": "Jon"ah D"Alme"ida" }
]`.replace(/(?<=\p{L})"(?=\p{L})/gmu, '\\"'))
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Related
This question already has answers here:
Javascript RegExp to apply span tags on strings with nested substrings
(2 answers)
Closed 1 year ago.
Given the string:
This is a test, and all test-alike words should be testtest and checked.
I would like to write RegEx that will match either test or test-alike infinite amount of time but not testtest.
I'm no regex expert I have come up with the following so far.
\s*(\btest-alike\b|\btest\b) matches well but when doing something like test-test it will match and it shouldn't.
(^|[^\w])(\btest\b|\btest-alike\b)($|[^\w]) this one matches correctly using capture groups but its every alternate match so match no-match match etc.
Would like to know if for first regex there is a way to specify the condition to not match when words are split by chars like ' ' '' '"' etc.
Maybe this one might help ... /\b(?:\w+-)*test(?:-\w+)*\b/gi.
It tries matching the word that is searched for altogether with both optional leading and trailing valid character sequences with each sequence build by at least one word character and a connecting - ((?:\w+-)* or other way around (?:-\w+)*) ...
const sampleText = `This is a Test, and all alike-alike-test-alike-alike words should be testtest and checked.
This is a test-Test, and all alike-alike-Test alike-alike words should be testtest and checked.`;
const regX = (/\b(?:\w+-)*test(?:-\w+)*\b/gi);
const search = 'test';
console.log(
sampleText.match(regX)
);
console.log(
sampleText.match(
RegExp(`\\b(?:\\w+-)*${ search }(?:-\\w+)*\\b`, 'gi')
)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Edit
Regarding the subject that was furthermore discussed within several comment blocks, a search and replace/highlight approach might look like this ...
function toRegExpSearch(str) {
return String(str)
.replace((/[$^*+?!:=.|(){}[\]\\]/g), (([match]) => `\\${ match }`))
.replace((/\s+/g), '\\s+');
}
function highlightSearch(text, search) {
const regX = RegExp(`\\b((?:\\w+-)*${ toRegExpSearch(search) }(?:-\\w+)*)\\b`, 'gi');
const matchList = text.match(regX) || [];
return text
.split(regX)
.reduce((str, partial) => {
if (partial === matchList[0]) {
matchList.shift();
str = `${ str }<mark>${ partial }</mark>`;
} else {
str = `${ str }${ partial }`;
}
return str;
}, '');
}
const sampleText = `This is a Test, and all alike-alike-test-alike-alike words should be testtest and checked.
This is a test-Test, and all alike-alike-Test alike-alike words should be testtest and checked.`;
console.log(
highlightSearch(sampleText, 'TEST')
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
/\btest(-[a-zA-A]*)?\b/gi
here \btest will match the test and (-[a-zA-Z]*)?\b will check for word after '-'.
As per Wiktor's answer above. The solution to this problem is:
\b(?:test-alike|test's)\b
Please note that words should be sorted. Detailed info can be found in the answer here.
I'm trying to extract the PROCEDURE section out of CLAIM, EOB & COB from a text file.
and create an object like so
claim : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
eob : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
cob: [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}]
let data = ` SEND CLAIM {
PREFIX="9403 "
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
}
}
SEND EOB {
PREFIX="9403 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
ELIGIBLE="002750"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
ELIGIBLE="008725"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
ELIGIBLE="010200"
}
}
SEND COB {
PREFIX="TEST4 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
DATE="19990104"
}
PRIME_EOB=SEND EOB {
PREFIX="9403 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
ELIGIBLE="002750"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
ELIGIBLE="008725"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
ELIGIBLE="010200"
}
}
}`
let re = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm
console.log(data.match(re));
Here is what I have tried so far (^\s+PROCEDURE\s\{)([\S\s]*?)(?:}), but I can't figure out how I can match PROCEDUREs after key CLAIM or EOB
For "claim", you could match the following regular expression.
/(?<=^ *SEND CLAIM +\{\r?\n(?:^(?! *SEND EOB *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/
CLAIM regex
This matches the following strings, which I assume can be easily saved to an array with a sprinkling of Javascript code.
{ /* CLAIM #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
{ /* CLAIM #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
{ /* CLAIM #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
}
Javascript's regex engine performs the following operations.
(?<= : begin positive lookbehind
^ : match beginning of line
\ *SEND CLAIM\ + : match 'SEND CLAIM' surrounded by 0+ spaces
\{\r?\n : match '{' then line terminators
(?: : begin non-capture group
^ : match beginning of line
(?! : begin negative lookahead
\ *SEND EOB\ * : match 'SEND EOB' surrounded by 0+ spaces
\{ : match '{'
) : end negative lookahead
(?! : begin negative lookahead
\ *SEND COB\ * : match 'SEND COB' surrounded by 0+ spaces
\{ : match '{'
) : end negative lookahead
.*\r?\n : match line including terminators
) : end non-capture group
* : execute non-capture group 0+ times
^ : match beginning of line
\ *PROCEDURE\ * : match 'PROCEDURE' surrounded by 0+ spaces
) : end positive lookbehind
\{[^\}]*\} : match '{', 0+ characters other than '}', '}'
I've escaped space characters above to improve readability.
For "eob", use the slightly-modified regex:
/(?<=^ *SEND EOB +\{\r?\n(?:^(?! *SEND CLAIM *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/
EOB regex
I've made no attempt to do the same for "cob" as that part has a different structure than "claim" and "eob" and it is not clear to me how it is to be treated.
A final note, should it not be obvious: it would be far easier to extract the strings of interest using convention code with loops and, possibly, simple regular expressions, but I hope some readers may find my answer instructive about some elements of regular expressions.
Will CLAIM, EOB and COB always be in the same order? If so, you can split the text before using the regex you already have:
const procRegex = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm;
let claimData = data.split("EOB")[0];
let claimProcedures = claimData.match(procRegex);
let eobData = data.split("COB")[0].split("EOB")[1];
let eobProcedures = eobData.match(procRegex);
let cobData = data.split("COB")[1];
let cobProcedures = cobData.match(procRegex);
// If you want to leave out the PRIME_EOB, you can split COB again
cobData = cobData.split("EOB")[0];
cobProcedures = cobData.match(procRegex);
console.log(claimProcedures);
Output:
[
' PROCEDURE { /* #1 */\n' +
' PROCEDURE_LINE="1"\n' +
' PROCEDURE_CODE="01201"\n' +
' \n' +
' }',
' PROCEDURE { /* #2 */\n' +
' PROCEDURE_LINE="2"\n' +
' PROCEDURE_CODE="02102"\n' +
' \n' +
' }',
' PROCEDURE { /* #3 */\n' +
' PROCEDURE_LINE="3"\n' +
' PROCEDURE_CODE="21222"\n' +
' \n' +
' }'
]
Demo
As an alternate method, your data is not terribly far away from valid JSON, so you could run with that. The code below translates the data into JSON, then parses it into a Javascript object that you can use however you want.
/* data cannot have Javascript comments in it for this to work, or you need
another regex to remove them */
data = data.replace(/=/g, ":") // replace = with :
.replace(/\s?{/g, ": {") // replace { with : {
.replace(/SEND/g, "") // remove "SEND"
.replace(/\"\s*$(?!\s*\})/gm, "\",") // add commas after object properties
.replace(/}(?=\s*\w)/g, "},") // add commas after objects
.replace(/(?<!\}),\s*PROCEDURE: /g, ",\nPROCEDURES: [") // start procedures list
.replace(/(PROCEDURE:[\S\s]*?\})\s*(?!,\s*PROCEDURE)/g, "$1]\n") // end list
.replace(/PROCEDURE: /g, "") // remove "PROCEDURE"
.replace("PRIME_EOB: EOB:", "PRIME_EOB:") // replace double key with single key. Is this the behavior you want?
.replace(/(\S*):/g, "\"$1\":") // put quotes around object key names
let dataObj = JSON.parse("{" + data + "}");
console.log(dataObj.CLAIM.PROCEDURES);
Output:
[ { PROCEDURE_LINE: '1', PROCEDURE_CODE: '01201' },
{ PROCEDURE_LINE: '2', PROCEDURE_CODE: '02102' },
{ PROCEDURE_LINE: '3', PROCEDURE_CODE: '21222' } ]
Demo
What you are trying to do is to write a parser for the syntax used in your text file.
If one looks at the syntax it looks much like JSON.
I would recommend to modify the syntax with regexps to get a valid JSON syntax and parse it with the JavaScript JSON parser. The parser is able to handle recursion. At the end you will have a JavaScript object that allows you to remove- or add whatever you need. In addition the hierarchy of the source will be preserved.
This code does the job for the provided example:
let data = ` SEND CLAIM {
// your text file contents
}`;
// handle PRIME_EOB=SEND EOB {
var regex = /(\w+)=\w+.*{/gm;
var replace = data.replace(regex, "$1 {");
// append double quotes in lines like PROCEDURE_LINE="1"
var regex = /(\w+)=/g;
var replace = replace.replace(regex, "\"$1\": ");
// append double quotes in lines like PROCEDURE {
var regex = /(\w+.*)\s{/g;
var replace = replace.replace(regex, "\"$1\": {");
// remove comments: /* */
var regex = /\/\**.*\*\//g;
var replace = replace.replace(regex, "");
// append commas to lines i.e. "PROCEDURE_LINE": "2"
var regex = /(\".*\":\s*\".*\")/gm;
var replace = replace.replace(regex, "$1,");
// append commas to '}'
var regex = /^.*}.*$/gm;
var replace = replace.replace(regex, "},");
// remove trailing commas
var regex = /\,(?!\s*?[\{\[\"\'\w])/g;
var replace = replace.replace(regex, "");
// surround with {}
replace = "{" + replace + "}";
console.log(replace);
var obj = JSON.parse(replace);
console.log(obj);
The JSON looks like this snippet:
{ "SEND CLAIM": {
"PREFIX": "9403 ",
"PROCEDURE": {
"PROCEDURE_LINE": "1",
"PROCEDURE_CODE": "01201"
},
"PROCEDURE": {
"PROCEDURE_LINE": "2",
"PROCEDURE_CODE": "02102"
And the final object appears in the debugger like this
.
It is not completely clear to me what your final array or object should look like. But from here I expect only little effort to produce what you desire.
I want to find in a math expression elements that are not wrapped between { and }
Examples:
Input: abc+1*def
Matches: ["abc", "1", "def"]
Input: {abc}+1+def
Matches: ["1", "def"]
Input: abc+(1+def)
Matches: ["abc", "1", "def"]
Input: abc+(1+{def})
Matches: ["abc", "1"]
Input: abc def+(1.1+{ghi})
Matches: ["abc def", "1.1"]
Input: 1.1-{abc def}
Matches: ["1.1"]
Rules
The expression is well-formed. (So there won't be start parenthesis without closing parenthesis or starting { without })
The math symbols allowed in the expression are + - / * and ( )
Numbers could be decimals.
Variables could contains spaces.
Only one level of { } (no nested brackets)
So far, I ended with: http://regex101.com/r/gU0dO4
(^[^/*+({})-]+|(?:[/*+({})-])[^/*+({})-]+(?:[/*+({})-])|[^/*+({})-]+$)
I split the task into 3:
match elements at the beginning of the string
match elements that are between two { and }
match elements at the end of the string
But it doesn't work as expected.
Any idea ?
Matching {}s, especially nested ones is hard (read impossible) for a standard regular expression, since it requires counting the number of {s you encountered so you know which } terminated it.
Instead, a simple string manipulation method could work, this is a very basic parser that just reads the string left to right and consumes it when outside of parentheses.
var input = "abc def+(1.1+{ghi})"; // I assume well formed, as well as no precedence
var inParens = false;
var output = [], buffer = "", parenCount = 0;
for(var i = 0; i < input.length; i++){
if(!inParens){
if(input[i] === "{"){
inParens = true;
parenCount++;
} else if (["+","-","(",")","/","*"].some(function(x){
return x === input[i];
})){ // got symbol
if(buffer!==""){ // buffer has stuff to add to input
output.push(buffer); // add the last symbol
buffer = "";
}
} else { // letter or number
buffer += input[i]; // push to buffer
}
} else { // inParens is true
if(input[i] === "{") parenCount++;
if(input[i] === "}") parenCount--;
if(parenCount === 0) inParens = false; // consume again
}
}
This might be an interesting regexp challenge, but in the real world you'd be much better off simply finding all [^+/*()-]+ groups and removing those enclosed in {}'s
"abc def+(1.1+{ghi})".match(/[^+/*()-]+/g).filter(
function(x) { return !/^{.+?}$/.test(x) })
// ["abc def", "1.1"]
That being said, regexes is not a correct way to parse math expressions. For serious parsing, consider using formal grammars and parsers. There are plenty of parser generators for javascript, for example, in PEG.js you can write a grammar like
expr
= left:multiplicative "+" expr
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative
/ primary
primary
= atom
/ "{" expr "}"
/ "(" expr ")"
atom = number / word
number = n:[0-9.]+ { return parseFloat(n.join("")) }
word = w:[a-zA-Z ]+ { return w.join("") }
and generate a parser which will be able to turn
abc def+(1.1+{ghi})
into
[
"abc def",
"+",
[
"(",
[
1.1,
"+",
[
"{",
"ghi",
"}"
]
],
")"
]
]
Then you can iterate this array just normally and fetch the parts you're interested in.
The variable names you mentioned can be match by \b[\w.]+\b since they are strictly bounded by word separators
Since you have well formed formulas, the names you don't want to capture are strictly followed by }, therefore you can use a lookahead expression to exclude these :
(\b[\w.]+ \b)(?!})
Will match the required elements (http://regexr.com/38rch).
Edit:
For more complex uses like correctly matching :
abc {def{}}
abc def+(1.1+{g{h}i})
We need to change the lookahead term to (?|({|}))
To include the match of 1.2-{abc def} we need to change the \b1. This term is using lookaround expression which are not available in javascript. So we have to work around.
(?:^|[^a-zA-Z0-9. ])([a-zA-Z0-9. ]+(?=[^0-9A-Za-z. ]))(?!({|}))
Seems to be a good one for our examples (http://regex101.com/r/oH7dO1).
1 \b is the separation between a \w and a \W \z or \a. Since \w does not include space and \W does, it is incompatible with the definition of our variable names.
Going forward with user2864740's comment, you can replace all things between {} with empty and then match the remaining.
var matches = "string here".replace(/{.+?}/g,"").match(/\b[\w. ]+\b/g);
Since you know that expressions are valid, just select \w+
Regex to remove everything outside the { }
for example:
before:
|loader|1|2|3|4|5|6|7|8|9|{"data" : "some data" }
after:
{"data" : "some data" }
with #Marcelo's regex this works but not if there are others {} inside the {} like here:
"|loader|1|2|3|4|5|6|7|8|9|
{'data':
[
{'data':'some data'}
],
}"
This seems to work - What language are you using - Obviously Regex... but what server side - then I can put it into a statement for you
{(.*)}
You want to do:
Regex.Replace("|loader|1|2|3|4|5|6|7|8|9|{\"data\" : \"some data\" }", ".*?({.*?}).*?", "$1");
(C# syntax, regex should be fine in most languages afaik)
in javascript you can try
s = '|loader|1|2|3|4|5|6|7|8|9|{"data" : "some data" }';
s = s.replace(/[^{]*({[^}]*})/g,'$1');
alert(s);
of course this will not work if "some data" has curly braces so the solution highly depends on your input data.
I hope this will help you
Jerome Wagner
You can do something like this in Java:
String[] tests = {
"{ in in in } out", // "{ in in in }"
"out { in in in }", // "{ in in in }"
" { in } ", // "{ in }"
"pre { in1 } between { in2 } post", // "{ in1 }{ in2 }"
};
for (String test : tests) {
System.out.println(test.replaceAll("(?<=^|\\})[^{]+", ""));
}
The regex is:
(?<=^|\})[^{]+
Basically we match any string that is "outside", as defined as something that follows a literal }, or starting from the beginning of the string ^, until it reaches a literal{, i.e. we match [^{]+, We replace these matched "outside" string with an empty string.
See also
regular-expressions.info/Lookarounds
(?<=...) is a positive lookbehind
A non-regex Javascript solution, for nestable but single top-level {...}
Depending on the problem specification (it isn't exactly clear), you can also do something like this:
var s = "pre { { { } } } post";
s = s.substring(s.indexOf("{"), s.lastIndexOf("}") + 1);
This does exactly what it says: given an arbitrary string s, it takes its substring starting from the first { to the last } (inclusive).
For those who searching this for PHP, only this one worked for me:
preg_replace("/.*({.*}).*/","$1",$input);
how can i not allow these chars:
\ / " ' [ ] { } | ~ ` ^ &
using javascript regular expression pattern?
Check a string contains one of these characters:
if(str.match(/[\\\/"'\[\]{}|~`^&]/)){
alert('not valid');
}
Validate a whole string, start to end:
if(str.match(/^[^\\\/"'\[\]{}|~`^&]*$/)){
alert('it is ok.');
}
To specifically exclude just those characters (Just prefix with backslash)
const isNotSpecial = /[^\\\/\"\'\[\]\{\}\|\~\`\^\&]/.test(myvalue);
To generally exclude all special characters
const isNotSpecial = /[^\W]/.test(myvalue);
Another solution is encoding these special characters into the regular expression format.
Using package regexp-coder
const { RegExpCoder } = require('regexp-coder');
console.log(RegExpCoder.encodeRegExp('a^\\.()[]?+*|$z'));
// Output: 'a\^\\\.\(\)\[\]\?\+\*\|\$z'