I need to get make a json file from a whitespace-delineated txt file.
However:
1. the whitespaces are inconsistent in length and
2. some of the data of each "column" is missing.
A single row looks like this in the txt file:
5653 Phrakhtaes Phrakhtaes 34.56717 33.02724 L LCTY GB 05 0 32 Asia/Nicosia 2014-09
Ultimately, this data will go onto Redis. But without some means of creating keys for each "column", I don't see how I can work with this data.
Please, I could really use the help!
Thanks in advance!
Simply just split where there are 2 or more spaces in between your data:
var line = "5653 Phrakhtaes Phrakhtaes 34.56717 33.02724 L LCTY GB 05 0 32 Asia/Nicosia 2014-09";
console.log(line.split(/ +/));
As far as data missing, I'd recommend you just check the length of the array, and < the number of expected results, you simply discard. The only other option is to loop through, and judge which one may be missing (Based on string type, if it's in integer, uppercase, etc...) if there are a variable number of spaces in between data points.
Related
I've got a CSV file with 3 million+ rows.
The format is supposed to be like so:
date, name , num1, num2
e.g.
"2019-05-07, New york, 10, 3
2019-05-08, New york, 15, 5,
2019-05-09, New york, 12, 6"
and so on...
The problem is every 5,000 rows or so, the "Name" column will have commas in its value.
e.g.
2019-05-09, Denver, Colorado, 10, 9
My script incorrectly reads 4 columns and fails.
Some values in the name column even have 3 commas.
Note the Name column values are not enclosed in quotes, so that's why it's giving me the error.
Is there a way to detect these extra commas? I don't think there is, so I'm beginning to think this 3m+ row file is useless trying to parse.
To parse, you can split into an array, then use shift and pop for the peripheral fields. Finally, you can just join on what's left:
let line = '2019-05-09, Denver, Colorado, 10, 9';
let entries = line.split(',');
let parsed = {
date: entries.shift().trim(),
num2: entries.pop().trim(),
num1: entries.pop().trim(),
name: entries.join(',').trim()
}
console.log(parsed);
So, to answer your question: No, your csv file is not unreadable, FOR NOW. If columns can be appended in the future, and such columns suffer the same issue as "name", you're in trouble. It's probably wiser to push back on the developer of the file and get them to properly quote it. You would not be out of line.
Well, nothing is impossible per se... you can, for example, work backwards and look for the first column (delimited by the first comma), the last two columns (by looking for the last 2 commas) and treat everything in between as the name. But you'll need to implement your own parsing function as I doubt a library would deal with invalid CSV like the one you have.
It's not very efficient, but if the column in question is always cities and states you could always do a find/replace for any states in the file before running your script. (e.g. -Find ", Colorado" replace with " Colorado".
I would like to get the data depicted on the sentiment value line chart:
http://sentdex.com/financial-analysis/?i=TWTR&tf=7d
Looking for answers I went through
Web scraping data from an interactive chart that seems to be very similar to my case.
Also went through:
Scraping graph data from a website using Python
This is my last attempt:
import re
svg_string = "M 364.5 53 L 364.5 171.35000000000002 M 364.5 184.5 L 364.5 302.85 M 364.5 184.5 L 364.5 302.85"
print repr(svg_string)
data = [map(float, xy.split(',')) for xy in re.split('[ML]', svg_string)[1:]]
print data
I am facing at least 3 issues:
The first one is that the data for svg_string represents coordinates vs. real values so I am not sure how to access the interesting data.
The second is that even when I play with this code I am getting
ValueError: invalid literal for float(): 364.5 53
And last, the string for svg_string does not even represent the graph properly (I cannot find the right code).
How do I extract the values?
Thank you in advance.
It's hard to know exactly what you're after overall, but the ValueError you are getting is because your data is not exactly the same as the other question you referenced. You have spaces in your data where the other question had commas.
To alleviate the ValueError change:
data = [map(float, xy.split(',')) for xy in re.split('[ML]', svg_string)[1:]]
to:
data = [map(float, xy.split()) for xy in re.split('[ML]', svg_string)[1:]]
Hopefully this gets you onto the next step.
Edit:
Ok so I looked at the page again, and the data is literally just in a js variable that you can grab from the response. The variable name is 'series' so you either need to do some parsing yourself to grab the data or find a library to work with (e.g. BeautifulSoup, etc.).
So I know how to format a string or integer like 2000 to 2K, but how do I reverse it?
I want to do something like:
var string = "$2K".replace("/* K with 000 and remove $ symbol in front of 2 */");
How do I start? I am not very good regular expressions, but I have been taking some more time out to learn them. If you can help, I certainly appreciate it. Is it possible to do the same thing for M for millions (adding 000000 at the end) or B for billions (adding 000000000 at the end)?
var string = "$2K".replace(/\$(\d+)K/, "$1000");
will give output as
2000
I'm going to take a different approach to this, as the best way to do this is to change your app to not lose the original numeric information. I recognize that this isn't always possible (for example, if you're scraping formatted values...), but it could be useful way to think about it for other users with similar question.
Instead of just storing the numeric values or the display values (and then trying to convert back to the numeric values later on), try to update your app to store both in the same object:
var value = {numeric: 2000, display: '2K'}
console.log(value.numeric); // 2000
console.log(value.display); // 2K
The example here is a bit simplified, but if you pass around your values like this, you don't need to convert back in the first place. It also allows you to have your formatted values change based on locale, currency, or rounding, and you don't lose the precision of your original values.
Since there is currently no universal way to read live data from an audio track in JavaScript I'm using a small library/API to read volume data from a text file that I converted from an MP3 offline.
The string looks like this
!!!!!!!!!!!!!!!!!!!!!!!!!!###"~{~||ysvgfiw`gXg}i}|mbnTaac[Wb~v|xqsfSeYiV`R
][\Z^RdZ\XX`Ihb\O`3Z1W*I'D'H&J&J'O&M&O%O&I&M&S&R&R%U&W&T&V&m%\%n%[%Y%I&O'P'G
'L(V'X&I'F(O&a&h'[&W'P&C'](I&R&Y'\)\'Y'G(O'X'b'f&N&S&U'N&P&J'N)O'R)K'T(f|`|d
//etc...
and the idea is basically that at a given point in the song the Unicode number of the character at the corresponding point in the text file yields a nominal value to represent volume.
The library translates the data (in this case, a stereo track) with the following (simplified here):
getVolume = function(sampleIndex,o) {
o.left = Math.min(1,(this.data.charCodeAt(sampleIndex*2|0)-33)/93);
o.right = Math.min(1,(this.data.charCodeAt(sampleIndex*2+1|0)-33)/93);
}
I'd like some insight into how the file was encoded in the first place, and how I'm making use of it here.
What is the significance of 93 and 33?
What is the purpose of the bitwise |?
Is this a common means of porting information (ie, does it have a name), or is there a better way to do it?
It looks like the range of the characters in that file are from ! to ~. ! has an ASCII code of 33 and ~ has an ASCII code of 126. 126-33 = 93.
33 and 93 are used for normalizing values beween ! and ~.
var data = '!';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0
var data = '~';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 1
var data = '"';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.010752688172043012
var data = '#';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.021505376344086023
// ... and so on
The |0 is there due to the fact that sampleIndex*2 or sampleIndex*2+1 will yield a non-integer value when being passed a non-integer sampleIndex. |0 truncates the decimal part just in case someone sends in an incorrectly formatted sampleIndex (i.e. non-integer).
Doing a bitwise OR with zero will truncate the number on the LHS to a integer. Not sure about the rest of your question though, sorry.
93 and 33 are ASCII codes (not unicode) for the characters "]" and "!" respectively. Hope that helps a bit.
This will help you forever:
http://www.asciitable.com/
ASCIII codes for everything.
Enjoy!
First and foremost: JSON and XML are not an option in this specific case, please don't suggest them. If this makes it easier to accept that fact, imagine that I intend to reinvent the wheel for self-education.
Back to the point:
I need to design a binary-safe data format to encode some datagrams I send to a particular dumb server that I write (in C if that matters).
To simplify the question, let's say that I'm sending only numbers, strings and arrays.
Important fact: Server does not (and should not) know anything about Unicode and stuff. It treats all strings as binary blobs (and never looks inside them).
The format that I originally devised is as follows:
Datagram: <Number:size>\n<Value1>...<ValueN>
Value:
Number: N\n<Value>\n
String: S\n<Number:size-in-bytes>\n<bytes>\n
Array: A\n<Number:size>\n<Value0>...<ValueN>
Example:
[ 1, "foo", [] ]
Serializes as follows:
1 ; number of items in datagram
A ; -- array --
3 ; number of items in array
N ; -- number --
1 ; number value
S ; -- string --
3 ; string size in bytes
foo ; string bytes
A ; -- array --
0 ; number of items in array
The problem is that I can not reliably get a string size in bytes in JavaScript.
So, the question is: how to change the format, so a string can be both saved in JS and loaded in C neatly.
I do not want to add Unicode support to the server.
And I do not quite want to decode strings on server (say, from base64 or simply to unescape \xNN sequences) — this would require work with dynamic string buffers, which, given how dumb the server is, is not so desirable...
Any clues?
It seems that reading UTF-8 in plain C is not that scary after all. So I'm extending the protocol to handle UTF-8 strings natively. (But will appreciate an answer to this question as it stands.)