Parse incorrect JSON - javascript

I use JSON to send data to websocket. Sometimes websocket recive many messages as one, and event.data looks like:
{"message1":"message1"}{"message2":"message2"}
so i can't parse it with JSON.Parse. How to handle this problem?

Here's an example of an auto-recovering JSON parser, which you can use to parse concatenated jsons:
function *multiJson(str) {
while (str) {
try {
yield JSON.parse(str);
str = '';
} catch(e) {
var m = String(e).match(/position\s+(\d+)/);
yield JSON.parse(str.slice(0, m[1]));
str = str.slice(m[1]);
}
}
}
//
let test = '{"message1":"message1"}{"message2":{"nested":"hi}{there"}}"third"[4,5,6]';
for (let x of multiJson(test))
console.log(x)
Basically, if there's a syntax error at position n, it tries to parse out everything before n and what's after it.

If you cannot fix it on the sending side and it always looks like this, then you might try to fix it and replace '}{' with '}\n{', split on newlines and have an array of JSON strings.
var array = input.replace('}{', '}\n{').split('\n');
Note that if your input contains newlines then you have to use another character or string:
var array = input.replace('}{', '}==XXX=={').split('==XXX==');
but it relies on fact that you don't have '}{' anywhere else in the string, which may not be true.
A more correct way, but harder, would be to count { and } that are not inside of strings, and when you get the same number of } as the number of { then split the string there.
What you would have to do is go character by character and keep track of whether you are inside quotes or not, make every { increment a counter, } decrement a counter and split your input whenever your counter hits zero.
Another hacky way would be to try to split the string on every possible } and try to parse the substring as JSON and if it's valid then use it and remove from the input.

If you have any control over the API then I would strongly recommend that you have it fixed there. However if you don't the please proceed reading.
I assume that looking for "}" is not really an option since you could have nested objects and the } character might be inside a string and so on.
A quick and easy way would be to try parse the string starting with 1 character and adding characters one by one until the JSON parser does not fail. That is when you will have your first chunk of data parsed.
Move the offset to the end of the successfully parsed data and repeat.
Might not be an elegant solution or very efficient one but then again you have a non standard data format.

Related

can String.prototype.split() differentiate between two instances of a character

I am working with rather large csv files (not mine, I cannot change the formatting of the files).
My script reads the files into a string, and then turns it into array by using .split() method first to split the rows using "\n".
The delimitator for the rows is the comma (",").
The problem is that the csv file is written to include the commas inside some of the values like so:
Type,Class,Result\n
AA,SG26,27%\n
AC,DC747,17%\n
"FF,RF",R$%,89%\n
HE,RT,56%\n
My function treats them as separate values, since it depends on split() with the delimitator of "," so it splits all the values like csv[2][Type] in this example into two.
I have tried using the replace function before splitting the string like so:
String.prototype.processCSV = function(delimiter = ","){
var str;
if(this.includes('"'){
str = this.replace(/"\s,\s"/g, "");
}
//rest of the function
}
But I do not see any results of doing that.
Is there any way to differentiate between the commas in the values and the separating commas, or any better way to read csv into arrays (please note that the array is then mapped so I can access the values by keys)?
Thank you in advance.
Edit: I should add that the project is on a static page that loads the csv files into strings first with ajax xmlhttpresponse, not in the node, due to project requirements I cannot establish a node backend.
Don't just split on ,. That's not the correct way to handle CSV files. Use a real CSV parser.
There are lots of CSV parsers on npm. Here's an example using Papaparse (npm or official home page):
var results = Papa.parse(csv, {
header: true
});
console.log(results[0].Type); // prints AA
console.log(results[0].Class); // prints SG26
console.log(results[0].Result); // prints 27%
The only real way to "distinguish commas at different positions" is to parse the string, and process the characters between " differently.
const input = `"1,1","2,2"`;
let pos = 0;
while (pos < input.length) {
switch (input[pos]) {
case `,`:
// handle comma
break;
case `\n`:
// handle newline
break;
case `"`:
const end = input.indexOf(`"`, pos + 1);
// handle string
// skip processing the string
pos = end;
break;
}
pos += 1;
}
But instead of writing your own parser (which is a fun exercise though) it is probably a good idea to use an existing implementation instead.

How to compare two Strings and get Different part

now I have two strings,
var str1 = "A10B1C101D11";
var str2 = "A1B22C101D110E1";
What I intend to do is to tell the difference between them, the result will look like
A10B1C101D11
A10 B22 C101 D110E1
It follows the same pattern, one character and a number. And if the character doesn't exist or the number is different between them, I will say they are different, and highlight the different part. Can regular expression do it or any other good solution? thanks in advance!
Let me start by stating that regexp might not be the best tool for this. As the strings have a simple format that you are aware of it will be faster and safer to parse the strings into tokens and then compare the tokens.
However you can do this with Regexp, although in javascript you are hampered by the lack of lookbehind.
The way to do this is to use negative lookahead to prevent matches that are included in the other string. However since javascript does not support lookbehind you might need to go search from both directions.
We do this by concatenating the strings, with a delimiter that we can test for.
If using '|' as a delimiter the regexp becomes;
/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g
To find the tokens in the second string that are not present in the first you do;
var bothstring=str2.concat("|",str1);
var re=/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g;
var match=re.exec(bothstring);
Subsequent calls to re.exec will return later matches. So you can iterate over them as in the following example;
while (match!=null){
alert("\""+match+"\" At position "+match.index);
match=re.exec(t);
}
As stated this gives tokens in str2 that are different in str1. To get the tokens in str1 that are different use the same code but change the order of str1 and str2 when you concatenate the strings.
The above code might not be safe if dealing with potentially dirty input. In particular it might misbehave if feed a string like "A100|A100", the first A100 will not be considered as having a missing object because the regexp is not aware that the source is supposed to be two different strings. If this is a potential issue then search for occurences of the delimiting character.
You call break the string into an array
var aStr1 = str1.split('');
var aStr2 = str2.split('');
Then check which one has more characters, and save the smaller number
var totalCharacters;
if(aStr1.length > aStr2.length) {
totalCharacters = aStr2.length
} else {
totalCharacters = aStr1.length
}
And loop comparing both
var diff = [];
for(var i = 0; i<totalCharacters; i++) {
if(aStr1[i] != aStr2[i]) {
diff.push(aStr1[i]); // or something else
}
}
At the very end you can concat those last characters from the bigger String (since they obviously are different from the other one).
Does it helps you?

Efficient string parsing in JS: How to create a substring which does not allocate a new string

I have large messages coming over the websocket that I'd like to parse with a regex (for simplicity).
The regex recognizes the format of the header, and upon reading the length field, we then know where the next segment lies, and I can run the regex on that portion.
However, since my entire message might be huge (say... 10MB) and consisting of many many segments (say... 1000, where the average segment is a little under 1K in length), then naively slicing the main message to pass it back to re.exec() at the next location seems like it will result in a ton of GC thrashing, if not an allocation of gigabytes just for the raw string content.
I wonder if there are any regex related functions which allow me to specify the index to start running the regex at? exec and search don't let me do this.
ES6 defines a "sticky" flag on RegExps, which allows to check, if string starts with regexp at specific position:
var position = 3;
var string = "la-la-la";
var re = /\d+/y;
re.lastIndex = position;
var match = re.exec(string);
//... do something with match
There is a discussion about this:
http://esdiscuss.org/topic/proposal-for-exact-matching-and-matching-at-a-position-in-regexp
I forgot that RegExp.prototype.exec handles this for you, so you just keep passing the original string in and it will only start searching starting at the place it last stopped.
This is not exactly ideal for me since it does a whole bunch of extra parsing than I need it to (it will parse the entire segments' contents), though. I think I can just "reach in" and push the lastIndex forward.
Yes, there is a way, but not through arguments of a function. Instead, you can utilize the .lastIndex property of the RegExp object (which needs to have the global flag set). The match, replace, exec and test methods will respect this value.
Your code might therefore look like this:
var re = /header:…length:(\d+)/g;
for (var m; m=re.exec(re); ) {
var len = parseInt(m[1], 10);
re.lastIndex += len;
…
}

JS / RegEx to remove characters grouped within square braces

I hope I can explain myself clearly here and that this is not too much of a specific issue.
I am working on some javascript that needs to take a string, find instances of chars between square brackets, store any returned results and then remove them from the original string.
My code so far is as follows:
parseLine : function(raw)
{
var arr = [];
var regex = /\[(.*?)]/g;
var arr;
while((arr = regex.exec(raw)) !== null)
{
console.log(" ", arr);
arr.push(arr[1]);
raw = raw.replace(/\[(.*?)]/, "");
console.log(" ", raw);
}
return {results:arr, text:raw};
}
This seems to work in most cases. If I pass in the string [id1]It [someChar]found [a#]an [id2]excellent [aa]match then it returns all the chars from within the square brackets and the original string with the bracketed groups removed.
The problem arises when I use the string [id1]It [someChar]found [a#]a [aa]match.
It seems to fail when only a single letter (and space?) follows a bracketed group and starts missing groups as you can see in the log if you try it out. It also freaks out if i use groups back to back like [a][b] which I will need to do.
I'm guessing this is my RegEx - begged and borrowed from various posts here as I know nothing about it really - but I've had no luck fixing it and could use some help if anyone has any to offer. A fix would be great but more than that an explanation of what is actually going on behind the scenes would be awesome.
Thanks in advance all.
You could use the replace method with a function to simplify the code and run the regexp only once:
function parseLine(raw) {
var results = [];
var parsed = raw.replace(/\[(.*?)\]/g, function(match,capture) {
results.push(capture);
return '';
});
return { results : results, text : parsed };
}
The problem is due to the lastIndex property of the regex /\[(.*?)]/g; not resetting, since the regex is declared as global. When the regex has global flag g on, lastIndex property of RegExp is used to mark the position to start the next attempt to search for a match, and it is expected that the same string is fed to the RegExp.exec() function (explicitly, or implicitly via RegExp.test() for example) until no more match can be found. Either that, or you reset the lastIndex to 0 before feeding in a new input.
Since your code is reassigning the variable raw on every loop, you are using the wrong lastIndex to attempt the next match.
The problem will be solved when you remove g flag from your regex. Or you could use the solution proposed by Tibos where you supply a function to String.replace() function to do replacement and extract the capturing group at the same time.
You need to escape the last bracket: \[(.*?)\].

Splitting string in javascript

How can I split the following string?
var str = "test":"abc","test1":"hello,hi","test2":"hello,hi,there";
If I use str.split(",") then I won't be able to get strings which contain commas.
Whats the best way to split the above string?
I assume it's actually:
var str = '"test":"abc","test1":"hello,hi","test2":"hello,hi,there"';
because otherwise it wouldn't even be valid JavaScript.
If I had a string like this I would parse it as an incomplete JSON which it seems to be:
var obj = JSON.parse('{'+str+'}');
and then use is as a plain object:
alert(obj.test1); // says: hello,hi
See DEMO
Update 1: Looking at other answers I wonder whether it's only me who sees it as invalid JavaScript?
Update 2: Also, is it only me who sees it as a JSON without curly braces?
Though not clear with your input. Here is what I can suggest.
str.split('","');
and then append the double quotes to each string
str.split('","'); Difficult to say given the formatting
if Zed is right though you can do this (assuming the opening and closing {)
str = eval(str);
var test = str.test; // Returns abc
var test1 = str.test1; // returns hello,hi
//etc
That's a general problem in all languages: if the items you need contain the delimiter, it gets complicated.
The simplest way would be to make sure the delimiter is unique. If you can't do that, you will probably have to iterate over the quoted Strings manually, something like this:
var arr = [];
var result = text.match(/"([^"]*"/g);
for (i in result) {
arr.push(i);
}
Iterate once over the string and replace commas(,) following a (") and followed by a (") with a (%) or something not likely to find in your little strings. Then split by (%) or whatever you chose.

Categories

Resources