Parsing inconsistent data - javascript

Here's what the data's supposed to look like:
Some junk data
More junk data
1. fairly long key, all on one line
value: some other text with spaces and stuff
2. hey look! another long key. still on one line
value: a different value with some different information
There's several of these per file, usually between twenty and thirty. The total number of key-value pairs exceeds 20,000, meaning manually correcting each file is a non-option. The number prefacing each key is supposed to increment properly. There is supposed to be a newline between a value and the following key. Each value should be prefaced with the string "value: "
Right now, I go line by line and classify each line as either key, value, or junk. I then parse the number out of the key and store the number, key, and value in an object.
Issues arise when the data is improperly formatted. Here are a few issues I've encountered thus far:
no newline between the key and value.
an unexpected newline in the middle of the key or value, which results in the program viewing a portion of each key or value as junk data.
the word "value" being spelled wrong.
I handle the third scenario by computing the Levenstein distance between the first six characters of each line and a master string "value:". How can I fix the other two issues?
If it matters, the parsing is happening on a node.js server, but I'm open to other languages if they can work with this inconsistent data more easily.

Take a look at this:
RegEx: ^(\d+)\. ?(.+?)(?:value|vlaue|balue|valie): ?(.+?)[\n\r]{2,}
Explained demo here: http://regex101.com/r/gG0wH8
If you have your 'misspelled value' issue fixed you can simplify it to:
^(\d+)\. ?(.+?)value: ?(.+?)[\n\r]{2,} otherwise add as many misspellings with a | in that RegEx part.
For this to work I hooked on:
line must start with digit(s) and a dot with a optional space
key is everything after the id and before the value
value ends after at least 2 line breaks
You should also remove the correct entries and then reexamine the file to check if anything else is missing.

Related

Javascript: String comparison returns true although the strings are different (intermittent issue)

I'm processing the data in chunks using WritableStream. The decoded data is a json string and in case it starts with , I need to remove the comma.
But here's the problem, after the chunk is being decoded to string I'm checking the first character
const startsWithComma = chunk.at(0) === ','
and SOMETIMES it returns true although the chunk doesn't start with , and causes the JSON.parse to fail later on. See the attached image.
Things I tried:
used .at() alternatives like .charAt(), .startsWith(), chunk[0]
The issue is intermittent meaning sometimes it can process the entire data and sometimes might fail mid through.
so, expanding on my comment:
from your image, is it possible that the debug is running after the comma was already taken out? is it also possible that the chunk may begin with more than 1 comma so sometimes debugging at that exact spot would still show a comma?
The solution would be to take off the commas using a while loop such as
while( chunk.at(0)===',' ){
chunk = chunk.slice(1).trim();
}
now I do not know the reason for doing it if isFirstChunk so I'd leave that alone, but the above loop should solve your startsWithComma issue :D

String from api response cant display break line, but hard code string can display break line

Good day,
I am using Angular 8 as my frontend, and java api service as my backend.
I need to retrieve some String from backend, and the String will having \n in between, for example:
"Instructions:\n1. Key in 122<16 digit \nreload pin># to reload.\n2. Press SEND/CALL from
In my .ts file, I am setting this String value as follow:
this.str = this.apiRes.responseMsg1;
console.log("this.str : " + this.str);
This will give me Instructions:\n1. Key in 122<16 digit \nreload pin># to reload.\n2. Press SEND/CALL, thus when I use it to display in html, it will just display as 1 line.
If I hard code this String to a String variable, for example:
this.str = "Instructions:\n1. Key in 122<16 digit \nreload pin># to reload.\n2. Press SEND/CALL from";
console.log("this.str : " + this.str);
It will give me :
Instructions:
1. Key in 122<16 digit
reload pin># to reload.
2. Press SEND/CALL from
Which is what I want.
I am not really familiar with Angular, I am trying to find this answer in google, but cant get any related result.
May I know why is this happen? And any way I can display the api message accordingly?
HTML ignores \n line breaks. You basically have two options to fix this:
Replace the line breaks with <br> elements.
Add a line of css white-space: pre-wrap; to your HTML element which is displaying your string.

Applescript with do Javascript and passed Applescript Variable

I have written a script that has automated the creation of products on a website I administer. In my process I upload JPEG images of the products and pull the Keywords that are tagged in the JPEG to add into the product information. In this process I use Applescript to Activate Safari and process a Javascript line of code. The line of code includes the a variable that is derived from Applescript Shell Script.
Code below
tell application "Finder"
set sourceFolder to folder POSIX file "/Users/<username>/Desktop/Upload/Temp/HighRes/"
set theFiles to files of sourceFolder
set inputPath to "/Users/<username>/Desktop/Upload/Temp/"
end tell
repeat with afile in theFiles
set filename to name of afile
set fname to text 1 thru ((offset of "." in filename) - 1) of filename
--INPUT CODE TO BE LOOPED OVER BELOW--
--Add Image Keywords from Metadata--
try
set pathVAR1 to "/Users/<username>/Desktop/Upload/Temp/HighRes/"
set pathVAR2 to pathVAR1 & filename
set myvar to do shell script "mdls -name kMDItemKeywords " & quoted form of pathVAR2
set var1 to ((offset of "(" in myvar) + 1)
set var2 to ((length of myvar) - 1)
set myKeywords to ((characters var1 thru var2 of myvar) as string)
--Inputs the Keywords from the Image Metadata--
tell application "Safari"
activate
do JavaScript "document.getElementById('ctl00_cphMainContent_txtKeyWords').value = \"" & myKeywords & "\";" in current tab of window 1
end tell
end try
--END OF CODE TO BE LOOPED OVER--
end repeat
==End Code==
Problem:
The code below is not passing the variable myKeywords to Safari, but if I run a dialog it will appear in the dialog.
do JavaScript "document.getElementById('ctl00_cphMainContent_txtKeyWords').value = \"" & myKeywords & "\";" in current tab of window 1
I don't have a specific solution that will definitely solve your problem, but I do have a number of observations about your script with recommendations on how it can be changed to improve its speed, robustness and adherence to principles of best practice.
Get rid of that try block. You have no idea what's happening in your script when things go wrong if you're masking the errors with unnecessary error-catching. The only line that needs to be enclosed in try...end try is do shell script, but only put it in once you know your code is working. In general, try blocks should only be used:
when your script has the potential to throw an error that is entirely predictable and explainable, and you understand the reasons why and under what conditions the error occurs, allowing you to implement an effective error-handling method;
around the fewest possible number of lines of code within which the error arises, leaving all lines of code whose existence doesn't depend on the result of the error-prone statement(s);
after your script has been written, tested, and debugged, where placing the try block(s) no longer serves to force a script to continue executing in the wake of an inconvenient error of unknown origin, but has a clear and well-defined function to perform in harmony with your code, and not against it.
As a general rule in AppleScript, don't use Finder to perform file system operations if you can avoid it: it's slow, and blocks while it's performing the operations, meaning you can't interact with the GUI during this time. Use System Events instead. It's a faceless application that won't stop other things operating when it's performing a task; it's fast, in the context of AppleScript and Finder in particular, and isn't prone to timing out quite so much as Finder does; it handles posix paths natively (including expansion of tildes), without any coercion necessary using POSIX file; it returns alias objects, which are the universal class of file object that every other scriptable application understands.
There are a couple of instances where Finder is still necessary. System Events cannot reveal a file; nor can it get you the currently selected files in Finder. But it's simple enough to have Finder retrieve the selection as an alias list, then switch to System Events to do the actual file handling on this list.
This is curious:
set filename to name of afile
set fname to text 1 thru ((offset of "." in filename) - 1) of filename
Am I right in thinking that fname is intending to hold just the base file name portion of the file name, and this operation is designed to strip off the extension ? It's a pretty good first attempt, and well done for using text here to itemise the components of the string rather than characters. But, it would, of course, end up chopping off a lot more than just the file extension if the file name had more than one "." in it, which isn't uncommon.
One way to safely castrate the end of the file name is to use text item delimiters:
set filename to the name of afile
set fname to the filename
set my text item delimiters to "."
if "." is in the filename then set fname to text items 1 thru -2 of the filename as text
You should then be mindful or resetting the text item delimiters afterwards or there'll be consequences later on when you try and concatenate strings together.
Another way of chopping of the extension without utilising text item delimiters is string scanning, which is where you iterate through the characters of a string performing operations or tests as you go, and achieving the desired outcome. It's speedier than it sounds and a powerful technique for very complex string searching and manipulations:
set filename to the name of afile
set fname to the filename
repeat while the last character of fname ≠ "."
set fname to text 1 thru -2 of fname
end
set fname to text 1 thru -2 of fname
You could also retrieve the name extension property of the file, get its length, and remove (1 + that) many characters from the end of the file's name. There a myriad ways to achieve the same outcome.
This is wrong in this particular instance:
set myKeywords to ((characters var1 thru var2 of myvar) as string)
characters produces a list, which you then have to concatenate back into a string, and this is unsafe if you aren't sure what the text item delimiters are set to. As you haven't made a reference to it in your script, it should be set to an empty string, which would result in the joining of the characters back into words produce the expected result. However, this could easily not be the case, if, say, you performed the first technique of file extension castration and neglected to set the text item limiters back—then the resulting string would have a period between every single letter.
As a policy in AppleScript (which you can personally choose to adhere to or ignore), it's considered by some as poor form if you perform list to string coercion operations without first setting the text item delimiters to a definitive value.
But you needn't do so here, because rather than using characters, use text:
set myKeywords to text var1 thru var2 of myvar
You're performing a shell command that looks like this: mdls -name kMDItemKeywords <file>, and then the two lines of AppleScript that follow awkwardly try and trim off the leading and trailing parentheses around the text representation of a bash array. Instead, you can turn on the -raw flag for mdls, which simplifies the output by stripping off the name of the key for you. This then places the parentheses as the very first and very last characters; however, since there's a load of dead whitespace in the output as well, you might as well get bash to perform all the clean up for you:
mdls -raw -name kMDItemContentTypeTree <file> | grep -E -io '[^()",[:blank:]]+'
This disregards parentheses, double quotes, commas, and whitespace, so all you get back is a list of keywords, one per line, and without any extra baggage. If you needed to itemise them, you can set a variable to the paragraphs of the output from the do shell script command, which splits the text into lines placing each keyword into a list. But it seems here that you need text and don't mind it being multilinear.
When I started to write this answer, I didn't have an inkling as to what was causing the specific issue that brought you here. Having gone through the details of how mdls formats its output, I now see the issue is with the fact that the myKeywords string will contain a bunch of double quotes, and you've surrounded the placement of the myKeywords entity in your JavaScript expression with double quotes. All of these quotes are only being escaped equally and once only in the AppleScript environment but not in the JavaScript environment, which results in each neighbouring double quote acting as an open-close pair. I ran a similar command in bash to obtain an array of values (kMDContentTreeType), and then processed the text in the way AppleScript does, before opening the JavaScript console in my browser and pasting it to illustrate what's going on:
Anything in red is contained inside a string; everything else is therefore taken as a JavaScript identifier or object (or it would be if the messed up quotes didn't also mess up the syntax, and then result in an unterminated string that's still expecting one last quote to pair with.
I think the solution is to use a continuation character "\" for backward compatibility with older browsers: so you would need to have each line (except the last one) appended with a backslash, and you need to change the pair of double quotes surrounding the myKeywords value in your JavaScript expression to a pair of single quotes. In newer browsers, you can forgo the headache of appending continuation marks to each line and instead replace the pair of outside double quotes with a pair of backticks (`) instead:
❌'This line throws
an EOF error in
JavaScript';
✅'This line is \
processed successfully \
in JavaScript';
✅`This line is also
processed successfully
in JavaScript`;
I had tried the backticks ( ` ) suggested by CJK but that did not work for me. The main issue being raised was that the kMDItemKeywords returned escaped characters.
Heart,
Studio,
Red,
\"RF126-10.tif\",
Tree,
\"Heart Tree\",
occasion,
Farm,
birds,
\"Red Farm Studio\",
\"all occasion\",
all
I was able to get rid of the escaped characters using the following:
NEW CODE
set myKeywords to do shell script "echo " & quoted form of myKeywords & " | tr -d '[:cntrl:]'| tr '[:upper:]' '[:lower:]' | tr -d '\"'"
UPDATED CODE FOR JAVASCRIPT
--Inputs the Keywords from the Image Metadata--
tell application "Safari"
activate
do JavaScript "document.getElementById('ctl00_cphMainContent_txtKeyWords').value = '" & myKeywords & "';" in current tab of window 1
end tell
RESULT
--> " heart, studio, red, rf126-10.tif, tree, heart tree, occasion, farm, birds, red farm studio, all occasion, all"

Weird symbol in string breaks JSON.parse, but seems to be undetectable?

The description field is a text area field, somehow a user ended up with some strange little symbol in it. (see image)
When I grab this from the server, I assemble my data from the objects I grab, which includes the description on this object, and turn it into JSON string, and send it to my javascript.
From javascript, I JSON.parse it. But that weird little symbol causes the parse to fail. But, when you look at it, there is no character there or anything, yet it throws an undefined character in JSON.parse.
My response from the server has the description like this:
"blahblahtesttext\r\nslkdjf",
There is nothing but the expected \r\n......
But it has an unexpected token where that symbol is.
{"value":"blah blah test text//Symbol should be here, but there is nothing and it forces it to the next line
\r\nslkdjf","fieldType":"TEXTAREA","field":"Description"}
Where that symbol forces the string to the next line, which causes the issue.
Because I can't see what the actual character is... I do not know how to handle this.
Is there something that can strip out invalid characters in a JSON string so the parse works? I don't want to just try/catch this as it would toss out everything, I just want that weird invalid symbol to be stripped out.
Or is there a way to see what the actual character is that JSON.parse does not like?

 <-- here is that symbol for copy pasting into a string if you want to try parsing it.
EDIT:
I found that it was doing this in Notepad++
Where you can see that where the line separator was, it is placing actual carriage return and line feed there, breaking the string. It already has \r\n\r\n for the two returns that were placed in the actual text area after that line separator character.
But still unsure of how to deal with this, as that carriage return and line feed do not appear in the string as '\n\r', there is no character representation of them, but instead it actually puts a return there and breaks the string.
NEW EDIT:
Finally found something to get this working. I couldn't do a replace on that line separator character. When I pulled it from my database, it came through as a hidden carriage return. When you manually pressed 'Enter' in the text area, the string I got from the database would actually put a '\r\n' there. But the line separator did not.
So, I added these three lines before parsing to ensure I was escaping any invalid new lines/carriage returns.
result = result.replace(/\r\n/g, '\\r\\n');
result = result.replace(/\r/g, '\\r');
result = result.replace(/\n/g, '\\n');
The '\r\n' that were actually in the string would correctly be escaped already, which tripped me up because I didn't have to worry about escaping those until someone tried introducing this line separator....
As Xufox says, that appears to be U+2028. JSON.parse shouldn't fail on it since U+2028 doesn't require escaping in JSON; Chrome's doesn't, but that's probably because it's implementing this stage 4 proposal Xufox pointed out:
const o = {prop: "testing\u2028one two three"};
console.log(JSON.parse(JSON.stringify(o)));
If you need to work around a JSON.parse implementation that doesn't handle it, you could do this:
str = str.replace(/\u2028/g, "\\u2028");
...before running JSON.parse on str.

Is AngularJS parsing the value incorrectly?

I have an extremely simple example here: http://jsfiddle.net/daylight/hqHSr/
To try it, just go to the fiddle and click the [Add Item] button to add some rows.
Next, click the check box next to any item and you'll see something similar to the following:
Problem: Only Displays Numeric Part
The problem is that the value should display the entire string shown in the row. In the example that means it should display: 86884-LLMUL.
Notice that it only displays the numeric part of the value.
If you look at the control you'll see that I'm using an input of type="text".
Also, if you look at the model (simpleItem) object you'll see that the one property it has is a string.
The JavaScript for the model class looks like:
function simpleItem(id) {
this.id = id;
}
My Attempt To Force to String Type
When I generate each of the simpleItems I even go so far as to set them to a character when I call the constructor (just to force the id to be set to a string type).
Here's the code that initializes each of the simpleItem ids:
currentItem.id = getRandom(100000).toString() + "-" + getRandomLetters(5).toUpperCase();
You can see the rest of the code in the fiddle, but the thing is I generate a random value and concatenate the value together with a hyphen and 5 letters. It's just a silly little piece of code for this sample.
But now, here is the part where it gets really odd.
If I simply change the hyphen - to another character like an uppercase X I get an error each time I click on the checkbox.
Here's the changed code and the new output, which you can see at the revised fiddle: fiddle version 2
currentItem.id = getRandom(100000).toString() + "X" + getRandomLetters(5).toUpperCase();
Also, now if you open Dev Tools in your browser you'll see in the console that Angular is now reporting an error each time you click the [Add Item] button. It looks like:
Adding Single-quotes ?Fixes? It
If you go up to the HTML and alter the following line from this:
ng-init="itemId ={{l.id.toString()}}"
to this
ng-init="itemId ='{{l.id.toString()}}'"
Now when you run it, the error will go away and it will work as you can see at the updated fiddle here: fiddle Version 3
Angular : Converts Hyphen to Minus Sign?
You see, Angular seems to be converting it to a numeric, attempting to do math on it (parsing the hyphen as a minus sign) and then truncating the string portion. That all seems to work when I use a hyphen, but when I use a X then Angular chokes.
Here's what it looks like when you add the single-quotes - of course the angular errors in Dev Tools console go away too.
Angular Forces to Numeric Type?
Why would this occur in Angular? It's as if it is attempting to force my string value to a numeric even though the INPUT element is type text and the JavaScript var is type string.
Anyone have ideas about this?
What About the Asterisk (multiplication symbol)?
Right as I was completing this I wondered what would happen if I changed the - to a * and ran it again. This time I saw the error below, which is indicative that something is attempting to convert to numeric.
This is the expected behavior. Angular is merely interpolating the text you have in your scope into the ng-init expression using scope.$eval and then executing that expression. This has very little to do with what is the type of the input box of the rest of the surrounding context.
It is definitely not desirable that Angular should wrap any interpolation it does in quotes, it'll break its use in all other places such as class="my-class {{dynamic-class}}".
Replace your ng-init with
ng-init="itemId =l.id.toString()"
In following with the docs, you should only use init in special circumstances anyway, you should rely on your controller for this. http://docs.angularjs.org/api/ng.directive:ngInit
I think we're just getting confused with Angular's weirdness. Basically, you're giving angular a string which it's turning into a javascript expression because it's in a {{}}. It's already, explicitly, a string (between the double-quotes):
ng-init="itemId ={{l.id.toString()}}"
It's apparently ignoring the fact that you're saying "hey, no really, this is a real string" with your l.id.toString(). It doesn't care. It's already a string and is going to evaluate it.
Just use the single quotes?
If you ng-init itemId={{undefined===undefined}}, what would you expect to happen? (it prints "true" in the alert).
Same with this: (undefined === undefined is in quotes) ng-init itemId={{'undefined===undefined'}}; prints true in the alert.
ng-init expects an angular expression. You don't have to use curly brackets there. You can simply write it like this:
ng-init="itemId=l.id" ng-click="checkBoxClicked(itemId)"

Categories

Resources