Background
I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.
Naive approach
So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....
Naive approach v2
My second try was to follow a bit the idea behind MapReduce.
So, I would slice this huge file in several parts, and order each one. Then I would combine all the parts into the final file.
The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....
Objective
My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.
Question
What are the popular solutions/algorithm to order data sets this big?
What are the popular solutions/algorithm to order data sets this big?
Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.
You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.
Database
You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.
Chunked Memory Sort, Manual Merge
Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).
As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia.org/wiki/External_sorting.
One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.
Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're comparing since it's now done. When there is no more data, you're done.
Sort Keys only in Memory
If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.
Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.
BTree Key Sort
If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.
Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.
I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.
Related
I have an object store that has an inline key path and two indexes. The first index identifies a portfolio, such as '2'. The second index identifies the module under a portfolio, such as '2.3' for the 3rd module of portfolio '2'. And the key path identifies the data object within the module, such as '2.3.5'. As the user builds his or her modules, the individual objects are written to the database.
Suppose a user decides to delete an entire portfolio of large size. I'd like to understand which of two ways would be the best in terms of efficiency and memory usage for deleting all data from the object store for that specific portfolio.
One method could open a cursor on the desired portfolio index within a single transaction to delete each item in the object store having that portfolio index value.
A second method, in my particular case, is to use a pointer. I have a pointer array that keeps track of every data object's key path in the object store. This is because a user could choose to insert a new item at position 2 within a module of 100 items. Rather than stepping through the store for the particular module and changing all the key paths for each item 2.3.2 through 2.3.100 to be 2.3.3 through 2.3.101, for example, I add the inserted item as 2.3.101 and place it at position 2 in the pointer, since it's much easier to update the pointer than to move large pieces of data in the database through copying them, deleting them, and writing them again with a new key path.
So, the second method could be to step through the pointer deleting all data objects by the key path stored in the pointer for that portfolio and to perform each deletion in a separate transaction.
The questions are:
Is it accurate that a large transaction, such as an open cursor across many data objects, requires the browser to store large amounts of data in memory in order to be able to rollback that transaction if any step fails along the way?
If so, is it better to employ the second approach since each transaction works on one data object only? Or, is it inefficient to repeatedly open small transactions on the same object store and search for each object by specific key path as opposed to just stepping through the store by an ordered index?
If the multiple-transaction approach is taken, would the browser release the memory quickly enough that there'd be a reduction in the total memory used during the entire process, or would it hold on to it until after the entire process completes such that it would accumulate to the same point anyway.
In this case, please assume the expected size of a large portfolio to be 50 modules of 100 objects each, such that the comparison is between a single transaction working on 5,000 data objects through a cursor versus performing 5,000 individual transactions on a single data object at a time through the known key path.
Perhaps, I am overthinking all of this because I am new to it and attempting to learn a bunch of things in a hurry. Thank you for any guidance you may be able to provide.
i need help/tips
i have a huge amount of json data that needs to be merged, sorted and filtered. right now, they're separated into different folders. almost 2GB of json files.
what i'm doing right now is:
reading all files inside each folders
appending JSON parsed data to an Array variable inside my script.
sorting the Array variable
filtering.
save it to one file
i'm rethinking that instead of appending parsed data to a variable, maybe i should store it inside a file ?.. what do you guys think ?
what approach is better when dealing with this kind of situation ?
By the way, i'm experiencing a
Javascript Heap out of memory
you could use some kind of database, e.g. MySQL with table's engine "memory" so it would be saved in ram only and would be blazing quick and would be erased after reboot but you should truncate it anyways after the operation while it's all temp. When you will have data in the table, it will be easy to filter/sort required bits and grab data incrementally by let's say 1000 rows and parse it as needed. You will not have to hold 2gigs of data inside js.
2gigs of data will probably block your js thread during loops and you will get frozen app anyways.
If you will use some file to save temporary data to avoid database, i recommend using some temporary disk which would be mounted on RAM, so you will have much better i/o speed.
In my database I have a table with data of cities. It includes the city name, translation of the name (it's a bi-lingual website), and latitude/longitude. This data will not change and every user will need to load this data. There are about 300 rows.
I'm trying to keep the pressure put on the server as low as possible (at least to a reasonable extent), but I'd also prefer to keep these in the database. Would it be best to have this data inside a class that is loaded in my main app.js file? It should be kept in memory and global to all users, correct? Or would it be better on the server to keep it in the database and select the data whenever a user needs it? The sign in screen has the listing of cities, so it would be loaded often.
I've just seen that unlike PHP, many of the Node.js servers don't have tons of memory, even the ones that aren't exactly cheap, so I'm worried about putting unnecessary things into memory.
I decided to give this a try. Using an example data set consisting of 300 rows (each containing 24 string characters and two doubles and property names), a small node.js script indicated an additional memory usage of 80 to 100 KB.
You should ask yourself:
How often will the data be used? How much of the data does a user need?
If the whole dataset will be used on a regular basis (let's say multiple times a second), you should consider keeping the data in memory. If, however, your users will need a part of the data and only once from time to time, you might consider loading the data from a database.
Can I guarantee efficient loading from the database?
An important fact is that loading parts of the data from a database might even require more memory, because the V8 garbage collector might delay the collection of the loaded data, so the same data (or multiple parts of the data) might be in memory at the same time. There is also a guaranteed overhead due to database / connection data and so on.
Is my approach sustainable?
Finally, consider the possibility of data growth. If you expect the dataset to grow by a non-trivial amount, think about the above points again and decide whether a growth is likely enough to justify database usage.
I'm building (with a partner) a small web app. It is an information app and pulls the info from a JSON file; we'll end up with about 150 items in the JSON, each with 130 properties.
We build the buttons in the app by querying the JSON, which is stored in localStorage, and getting things like item[i].name and item[i].cssClass to construct buttons.
Question is - would it be more efficient and worth it to create 2 arrays in localStorage that hold the name and cssClass for the purposes of constructing these buttons, or is this a waste of time and should we just pull the name and cssClass straight from the JSON?
I should clarify - We need to the items by sortable by name, cssClass, etc; users can sort the data into lists and the buttons can be constructed as alphabetic lists (which take you to the details of the items) or as category buttons which take you to lists of items in that category.
The issue is - does sorting the JSON carry a significant overhead compared to just sorting the array of names?
Is this a waste of time and should we just pull the name and cssClass straight from the JSON?
Yes, it will be fine.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - http://c2.com/cgi/wiki?PrematureOptimization
Retrieving the i-th element of an array, same as getting a property of an object, has time complexity O(1). JSON is quite a high-level language. As long as you don't work with millions of items, you shouldn't care about the implementation details of the interpreter.
I'd guess that the web browser will sure spend much more time working with DOM / rendering the page, compared to operations performed on your data structures.
The project requirements are odd for this one, but I'm looking to get some insight...
I have a CSV file with about 12,000 rows of data, approximately 12-15 columns. I'm converting that to a JSON array and loading it via JSONP (has to run client-side). It takes many seconds to do any kind of querying on the data set to returned a smaller, filtered data set. I'm currently using JLINQ to do the filtering, but I'm essentially just looping through the array and returning a smaller set based on conditions.
Would webdb or indexeddb allow me to do this filtering significantly faster? Any tutorials/articles out there that you know of that tackles this particular type of issue?
http://square.github.com/crossfilter/ (no longer maintained, see https://github.com/crossfilter/crossfilter for a newer fork.)
Crossfilter is a JavaScript library for exploring large multivariate
datasets in the browser. Crossfilter supports extremely fast (<30ms)
interaction with coordinated views, even with datasets containing a
million or more records...
This reminds me of an article John Resig wrote about dictionary lookups (a real dictionary, not a programming construct).
http://ejohn.org/blog/dictionary-lookups-in-javascript/
He starts with server side implementations, and then works on a client side solution. It should give you some ideas for ways to improve what you are doing right now:
Caching
Local Storage
Memory Considerations
If you require loading an entire data object into memory before you apply some transform on it, I would leave IndexedDB and WebSQL out of the mix as they typically both add to complexity and reduce the performance of apps.
For this type of filtering, a library like Crossfilter will go a long way.
Where IndexedDB and WebSQL can come into play in terms of filtering is when you don't need to load, or don't want to load, an entire dataset into memory. These databases are best utilized for their ability to index rows (WebSQL) and attributes (IndexedDB).
With in browser databases, you can stream data into a database one record at a time and then cursor through it, one record at a time. The benefit here for filtering is that this you means can leave your data on "disk" (a .leveldb in Chrome and .sqlite database for FF) and filter out unnecessary records either as a pre-filter step or filter in itself.