Merging millions of data using nodejs

Merging millions of data using nodejs - javascript

i need help/tips
i have a huge amount of json data that needs to be merged, sorted and filtered. right now, they're separated into different folders. almost 2GB of json files.
what i'm doing right now is:
reading all files inside each folders
appending JSON parsed data to an Array variable inside my script.
sorting the Array variable
filtering.
save it to one file
i'm rethinking that instead of appending parsed data to a variable, maybe i should store it inside a file ?.. what do you guys think ?
what approach is better when dealing with this kind of situation ?
By the way, i'm experiencing a
Javascript Heap out of memory

you could use some kind of database, e.g. MySQL with table's engine "memory" so it would be saved in ram only and would be blazing quick and would be erased after reboot but you should truncate it anyways after the operation while it's all temp. When you will have data in the table, it will be easy to filter/sort required bits and grab data incrementally by let's say 1000 rows and parse it as needed. You will not have to hold 2gigs of data inside js.
2gigs of data will probably block your js thread during loops and you will get frozen app anyways.
If you will use some file to save temporary data to avoid database, i recommend using some temporary disk which would be mounted on RAM, so you will have much better i/o speed.

Related

Why is serialisation necessary for sending data?

I've been reading up on JSON and serialization, from what I understand JSON is a format often used for transferring data over a network e.g. from/to a web server or storing the data to disk.
The data could be strings, numbers, objects etc. I haven't found a clear explanation for why the serialization is needed, for instance when sending a string to a web server or saving it to disk, isn't the string already stored as a series of bits and bytes by the computer, isn't this the most basic form for the data? so why can't these be sent/stored as they are?
Why does it need to be stringified into JSON i.e. serialised, which turns it into a string?
To be clear, I'm asking why it's needed and a simple clear explanation for that.
Thanks

Broadly speaking serialization does two important, mostly independent jobs:
collects all the information into a single "chunk" (stream) of data that's self-contained and
turns all the information into an agreed-on format (usually optimized for either compactness or ease of parsing)
#1 is important because a single object with many properties and sub-object can be spread all over the memory of a running program.
For example a JavaScript runtime could have a dedicated memory pool for strings constants. Then an object that uses some constant as a key would just reference into that pool from its data structure. That means that the object is no longer in a single self-contained block in memory: it's spread out over multiple areas. This kind of spreading-out is actually the norm: objects don't usually contain complex data directly and depending on the language even "primitive" values such as number could be stored as references to another place in memory.
#2 is important mostly because the format used to quickly access data in-memory might not be suitable to transfer (because it might contain unnecessary redundancy or memory pointers that don't make any sense when transferred to another computer, which partially ties to reason #1).
An example of that would be a map (or dictionary): the in-memory representation will usually involve multiple buckets that hold hashed-values and some kind of collision-handling structure inside those buckets (a linked list or a tree, for example). That structure helps with efficient access to separate keys, but transferring that structure directly over the wire is pointless: it's very easy to re-build and there's no guarantee that the receiving end uses the exact same way to represent a map. So instead we just send each key and the associated value and let the receiving end deal with re-constructing any data structures it needs for efficient access.

The simple reason is that data can be stored differently in memory on different computers, or even by programs on the same computer written in different programming languages.
Serialization formats like JSON provide a defined way for exchanging data between computers or programs.

Not everyone knows how to parse or interpret those series of bits. Sometimes you need some general structure, some format, that can be passed around so that other people understand what it is you're trying to tell them.

Use web worker to stringify

So I have an app that needs to JSON.stringify its data to put into localStorage, but as the data gets larger, this operation gets outrageously expensive.
So, I tried moving this onto a webWorker so it's off the main thread, but I'm now learning posting an object to a webWorker is even more expensive than stringifying it.
So I guess I'm asking, is there any way whatsoever to get JSON.stringify off the main thread, or at least make it less expensive?
I'm familiar with fast-json-stringify, but I don't think I can feasibly provide a complete schema every time...

You have correctly observed that passing object to web worker costs as much as serializing it. This is because web workers also need to receive serialized data, not native JS objects, because the instance objects are bound to the JS thread they were created in.
The generic solution is applicable to many programming problems: chose the right data structures when working with large datasets. When data gets larger it's better sacrifice simplicity of access for performance. Thus do any of:
Store data in indexedDB
If your large object contains lists of the same kind of entry, use indexed DB for reading and writing and you don't need to worry about serialization at all. This will require refactor of your code, but this is the correct solution for large datasets.
Store data in ArrayBuffer
If your data is mostly fixed-size values, use an ArrayBuffer. ArrayBuffer can be copied or moved to web worker pretty much instantly and if your entries are all same size, serialization can be done in parallel. For access, you may write simple wrappers classes that will translate your binary data into something more readable.

How to order huge ( GB sized ) CSV file?

Background
I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.
Naive approach
So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....
Naive approach v2
My second try was to follow a bit the idea behind MapReduce.
So, I would slice this huge file in several parts, and order each one. Then I would combine all the parts into the final file.
The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....
Objective
My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.
Question
What are the popular solutions/algorithm to order data sets this big?

What are the popular solutions/algorithm to order data sets this big?
Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.
You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.
Database
You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.
Chunked Memory Sort, Manual Merge
Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).
As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia.org/wiki/External_sorting.
One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.
Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're comparing since it's now done. When there is no more data, you're done.
Sort Keys only in Memory
If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.
Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.
BTree Key Sort
If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.
Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.

I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.

Best practise of using localstorage to store a large amount of objects

Currently I'm experimenting with localStorage to store a large amount of objects of same type, and I am getting a bit confused.
One way of thinking is to store all the object in an array. But then for each read/write of a single object I need to deserialise/serialise the whole array.
The other way is to directly store each object with its key in the localStorage. This will make accessing each object much easier but I'm worried of the amount of objects that will be stored (tens of thousands). Also, getting all the objects will require iterating the whole localStorage!
I'm wondering which way will be better in your experience? Also, would it be worthwhile to try on more sophisticated client side database like PouchDB?

If you want something simple for storing a large amount of key/values, and you don't want to have to worry about the types, then I recommend LocalForage. You can store strings, numbers, arrays, objects, Blobs, whatever you want. It uses IndexedDB and WebSQL where available, so the storage limits are much higher than LocalStorage.
PouchDB works too, but the API is more complex, and it's better-suited for when you want to sync data with CouchDB on the server.

If you do not want to have a lot of keys, you can:
concat row JSONs with \n and store them as a single key
build and update an index(es) stored under separate keys, each linking some key with a particular row number.
In this case parsing rows is just .split('\n') that is ~2 orders of magnitude faster, then JSON.parse.
Please, notice, that you possibly need special effort to syncronize simultaneously opened tabs. It can be a challenge in complex cases.
localStorage has both good and bad parts.
Good parts:
syncronous;
extremely fast, both read and write are just memcpy – it‘s 100+Mb/s throughput even on weak devices (for example JSON.stringify is in general 5-20 times slower than localStorage.setItem);
thoroughly tested and reliable.
Bad news:
no transactions, so you need an engineering effort to sync tabs;
think you have not more than 2Mb (cause there exist systems with this limit);
2Mb of storage actually mean 1M chars you can save.
These points show borders of localStorage applicability as a DB. LS is good for tasks, where you need syncronicity and speed, and where you can trim you DB to fit into quota.
So localStorage is good for caches and logs. Not more.

I hadn't personally used localStorage to manage so many elements.
However, the pattern I usually use to manage data is to load the complete info database into a javascript object, manage it on memory during the proccess and saving it again to localStorage when the proccess is finished.
Of course, this pattern may not be a good approach to your needings, depending on your project specifications.
If you need to save data constantly, data access could become a problem, and thus probably using some type of small database access is a better option.
If your data volume is exceptionally high it also could be a problem to manage it on memory, however, depending on data model, you'd be able to build it to efficient structures that would allow you to load and save data just when it's needed.

Processing a large (12K+ rows) array in JavaScript

The project requirements are odd for this one, but I'm looking to get some insight...
I have a CSV file with about 12,000 rows of data, approximately 12-15 columns. I'm converting that to a JSON array and loading it via JSONP (has to run client-side). It takes many seconds to do any kind of querying on the data set to returned a smaller, filtered data set. I'm currently using JLINQ to do the filtering, but I'm essentially just looping through the array and returning a smaller set based on conditions.
Would webdb or indexeddb allow me to do this filtering significantly faster? Any tutorials/articles out there that you know of that tackles this particular type of issue?

http://square.github.com/crossfilter/ (no longer maintained, see https://github.com/crossfilter/crossfilter for a newer fork.)
Crossfilter is a JavaScript library for exploring large multivariate
datasets in the browser. Crossfilter supports extremely fast (<30ms)
interaction with coordinated views, even with datasets containing a
million or more records...

This reminds me of an article John Resig wrote about dictionary lookups (a real dictionary, not a programming construct).
http://ejohn.org/blog/dictionary-lookups-in-javascript/
He starts with server side implementations, and then works on a client side solution. It should give you some ideas for ways to improve what you are doing right now:
Caching
Local Storage
Memory Considerations

If you require loading an entire data object into memory before you apply some transform on it, I would leave IndexedDB and WebSQL out of the mix as they typically both add to complexity and reduce the performance of apps.
For this type of filtering, a library like Crossfilter will go a long way.
Where IndexedDB and WebSQL can come into play in terms of filtering is when you don't need to load, or don't want to load, an entire dataset into memory. These databases are best utilized for their ability to index rows (WebSQL) and attributes (IndexedDB).
With in browser databases, you can stream data into a database one record at a time and then cursor through it, one record at a time. The benefit here for filtering is that this you means can leave your data on "disk" (a .leveldb in Chrome and .sqlite database for FF) and filter out unnecessary records either as a pre-filter step or filter in itself.

Develop Reference

JavaScript is the programming language of the Web.