I'm developing a CSV parser that should be able to deal with huge datasets (read 10 million rows) in the browser.
Basically the parser works as follows:
Main thread reads chunk of 20MB, otherwise the browser would crash quickly. After that, sends the chunk of data read to one of the workers.
The worker receives the data and discards the columns I don't want and saves the ones I want. Normally I only want 4-5 columns out of 20-30.
The worker sends the processed data back to the main thread.
The main thread receives the data and saves it in the data array.
Repeat steps 1-4 until file is done.
At the end with the dataset (crimes city of chicago), I end up with an array that has inside of it 71 other arrays and each of these arrays contains +/- 90K elements. Each of these 90K elements contains 5 strings (columns that were taken from the read file). Namely latitude, longitude, year, block and IUCR.
Summarizing, 71 is the number of chunks of 20MB in the dataset, 90K is the number of rows in each chunk of 20MB and 5 is the columns that were extracted.
I noticed that the browser (Chrome) was using too much memory, so I tried in 4 different browsers (Chrome, Opera, Vivaldi and Firefox), and recorded the memory used by the tab.
Chrome - 1.76GB
Opera - 1.76GB
Firefox - 1.3GB
Vivaldi - 1GB
If I try to recreate the same array but with mock data, it only uses approx. 350MB of memory.:
var data = [];
for(let i = 0; i < 71; i++){
let rows = [];
for(let j = 0; j < 90*1000; j++){
rows.push(["029XX W MADISON ST", "2027", "-87.698850575", "2001", "41.880939487"])
}
data.push(rows);
}
I understand that if the array is static, as seen in the code above, it's easier to perform better than the dynamic case. But I wasn't expecting to use 5 times more memory for the same quantity of data.
There's anything I can do to use less memory on the parser?
Basically to use less memory one can use some techniques.
First, columns of the CSV that contain numbers should be converted and used as such. Since numbers in Javascript take 8 bytes but the same number as a string can take much more space (2 bytes per char).
Another thing is to terminate all workers when the job is done.
Related
I am working on urdu (language spoken in pakistan, india, bangladesh) voice recognition to translate urdu speech into urdu words. So far i did nothing but just have found meyda javascript library for extracting mfccs from data frames. Some document says that for ASR there needs first 12 or 13 mfccs out of 26. During the test, i have separate 46 phonemes(/b/, /g/, /d/ ...) in a folder in wav extension. After running meyda proccess on one of the phoneme, it creates 4 to 5 frames per phoneme, where each frame contain the mfccs each of first 12 values. Due to less than 10 reputation, post images are disabled. but you can the image on the following link. The image contain 7 frames of phoneme /b/. each frame includes 13 mfccs. The Red long vertical line value is 438, others or 48, 38 etc.
http://realnfo.com/images/b.png
My question is that whether i need to save these frames(mfccs) in the database as predefined phoneme for /b/ and the same i do for all the other phonemes and then tie the microphone, meyda will extract the mfccs per frame, and i will programmed the javascript that the extracted frame mfcc will be matched with the predefined frames mfccs by using Dynamic Time Warping. And at the end will get the smallest distance for specific phoneme.
The proffesional way after mfccs are HMM and GMM but i dont know how to deal with. i studied so many documents about HMM and GMM but waste.
co-author of Meyda here.
That seems like a pretty difficult use case. If you already know how to split the buffers up into phonemes, you can run the MFCC extraction on those buffers, and use k Nearest Neighbour (or some better classification algorithm) for what I would imagine would be reasonable success rate.
A rough sketch:
const Meyda = require('meyda');
// I can't find a real KNN library because npm is down.
// I'm just using this as a placeholder for a real one.
const knn = require('knn');
// dataset should be a collection of labelled mfcc sets
const nearestPhoneme = knn(dataset);
const buffer = [...]; // a buffer containing a phoneme
let nearestPhonemes = []; // an array to store your phoneme matches
for(let i = 0; i < buffer.length; i += Meyda.bufferSize) {
nearestPhonemes.push(nearestPhoneme(Meyda.extract('mfcc', buffer)));
}
After this for loop, nearestPhonemes contains an array of the best guesses for phonemes for each frame of the audio. You could then pick the most commonly occurring phoneme in that array (the mode). I would also imagine that averaging the mfccs across the whole frame may yield a more robust result. It's certainly something you'll have to play around with and experiment with to find the most optimal solution.
Hope that helps! If you open source your code, I would love to see it.
Given is a big (but not huge) array of strings (in numbers 1000-5000 single strings). I want to perform some calculations and other stuff on these strings. Because it always stopped working when dealing with that one big array, I rewrote my function to recursively fetch smaller chunks (currently 50 elements) - I did this using splice because I thought it would be a nice idea to reduce the size of the big array step by step.
After implementing the "chunk"-version, I'm now able to calculate up to about 2000 string-elements (above that my laptop is becoming extremely slow and crashing after a while).
The question: why is it still crashing, even though I'm not processing that huge array but just small chunks successively?
Thanks in advance.
var file = {some-array} // the array of lines
var portionSize = 50; // the size of the chunks
// this function is called recursively
function convertStart(i) {
var size = file.length;
chunk = file.splice(0,portionSize);
portionConvert(chunk,i);
}
// this function is used for calculating things with the strings
function portionConvert(chunk,istart) {
for(var i=0;i<portionSize;i++) {
// doing some string calculation with the smaller chunk here
}
istart += 1;
convertStart(istart); // recall the function with the next chunk
}
From my experience the amount of recursion you're doing can "exceed the stack," unless you narrow down the input values, which is why you were able to do more with less. Keep in mind that for every new function call, the state of the function at the call site is saved in your RAM. If you have a computer with little RAM it's going to get clogged up.
If you're a having a processing problem you should switch to a loop version. Loops don't progressively save the state of the function, just the values. Typically, I would leave recursion for smaller jobs like processing a tree-like/object structures or parsing expressions; some situation where it requires processing to "intuitively go deeper" on something. In the case where you just have one long array, I would just process each of the elements with a forEach, which is a for-loop in a handy wrapper:
file.forEach(function(arrayElement) {
// doing some string calculation with the chunk (arrayElement) here
});
Take a look at forEach here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/forEach
I would like to operate with quite big number of data - with elements and periodic table.
At first let the program return the atomic weight of given element. How would you perfrom that?
By manually creating a table with 118 elements and searching for given element in tab1[element][] and then passing the tab1[][atomic_weight] by iterating up to 118 times?
Or maybe instead of creating in program table create a file with the data? Languages are C++ and JS (in browser-JS you can't deal with local files, but only with server ones by using e.g. AJAX, yes?).
Later it will have to perform more advanced calculations. Of course databases would be helpful, but without using it?
Here are your steps to make this happen:
Decide the targets you want your application to run on (The web, local machine...)
Learn C++ OR Javascript depending on #1 (Buy a book)
Come back to this question on Stack Overflow
Realize this is not a good question for Stack Overflow
Tips when you get to a point where you can answer your own question:
Use a single dimension array with Objects you have designed. This is one reason why Object Oriented Programming is so great. It can be expanded easily later.
Why the single dimension array?
118 Elements is chump change for a computer even if you went through every element. You have to touch on each one anyways, so why make it more complex than a single dimension array.
You KNOW how large the data structure will be, and it won't change
You could access elements anywhere on the table in O(1) time based on its atomic number
Groups and periods can be deduced by simple math, and therefore also deduced in constant time.
The jist:
You aren't fooling me. You have a long way to go. Learn to program first.
I recommend placing all the element information into a structure.
For example:
struct Element_Attributes
{
const char * symbol;
unsigned int weight;
};
The data structure to contain the elements varies, depending on how you want to access them.
Since there are columns and rows on the Periodic Table of the Elements, a matrix would be appropriate. There are blank areas in the matrix, so your program would have to handle the {wasted} blank space in the matrix.
Another idea is to have row and column links. The row links would point to the next element in the row. The column link would point to the next element in the column. This would have slower access time than a matrix, but would not have empty slots (links).
Also, don't worry about your program's performance unless somebody (a user) says it is too slow. Other entities are usually slower than executing loops. Some of those entities are file I/O and User I/O.
Edit 1 - Implementation
There are 33 columns to the table and 7 rows:
#define MAX_ROWS 7
#define MAX_COLUMNS 33
Element_Attributes Periodic_Table[MAX_ROWS][MAX_COLUMNS];
Manually, an element can be created and added to the table:
Element_Attributes hydrogen = {"H", 1};
Periodic_Table[0][0] = hydrogen;
The table can also be defined statically (when it is declared). This is left as an exercise for the reader.
Searching:
bool element_found = false;
for (unsigned int row = 0; row < MAX_ROWS; ++row)
{
for (unsigned int column = 0; column < MAX_COLUMNS; ++column)
{
const std::string element_symbol = Periodic_Table[row][column].symbol;
if (element_symbol == "K") // Search for Potassium
{
element_found = true;
break;
}
}
if (element_found)
{
break;
}
}
I've seen little utility routines in various languages that, for a desired array capacity, will compute an "ideal size" for the array. These routines are typically used when it's okay for the allocated array to be larger than the capacity. They usually work by computing an array length such that the allocated block size (in bytes) plus a memory allocation overhead is the smallest exact power of 2 needed for a given capacity. Depending on the memory management scheme, this can significantly reduce memory fragmentation as memory blocks are allocated and then freed.
JavaScript allows one to construct arrays with predefined length. So does the concept of "ideal size" apply? I can think of four arguments against it (in no particular order):
JS memory management systems work in a way that would not benefit from such a strategy
JS engines already implement such a sizing strategy internally
JS engines don't really keep arrays as contiguous memory blocks, so the whole idea is moot (except for typed arrays)
The idea applies, but memory management is so engine-dependent that no single "ideal size" strategy would be workable
On the other hand, perhaps all of those arguments are wrong and a little utility routine would actually be effective (as in: make a measurable difference in script performance).
So: Can one write an effective "ideal size" routine for JavaScript arrays?
Arrays in javascript are at their core objects. They merely act like arrays through an api. Initializing an array with an argument merely sets the length property with that value.
If the only argument passed to the Array constructor is an integer between 0 and 232-1 (inclusive), this returns a new JavaScript array with length set to that number. -Array MDN
Also, there is no array "Type". An array is an Object type. It is thus an Array Object ecma 5.1.
As a result, there will be no difference in memory usage between using
var one = new Array();
var two = new Array(1000);
aside from the length property. When tested in a loop using chrome's memory timeline, this checks out as well. Creating 1000 of each of those both result in roughly 2.2MB of allocation on my machine.
one
two
You'll have to measure performance because there are too many moving parts. The VM and the engine and browser. Then, the virtual memory (the platform windows/linux, the physical available memory and mass storage devices HD/SSD). And, obviously, the current load (presence of other web pages or if server-side, other applications).
I see little use in such an effort. Any ideal size for performance may just not be ideal anymore when another tab loads in the browser or the page is loaded on another machine.
Best thing I see here to improve is development time, write less and be quicker on deploying your website.
I know this question and the answer was about memory usage. BUT although there might be no difference in the allocated memory size between calling the two constructors (with and without the size parameter), there is a difference in performance when filling the array. Chrome engine obviously performs some pre-allocation, as suggested by this code run in the Chrome profiler:
<html>
<body>
<script>
function preAlloc() {
var a = new Array(100000);
for(var i = 0; i < a.length; i++) {
a[i] = i;
}
}
function noAlloc() {
var a = [];
var length = 100000;
for(var i = 0; i < length; i++) {
a[i] = i;
}
}
function repeat(func, count) {
var i = 0;
while (i++ < count) {
func();
}
}
</script>
</body>
Array performance test
<script>
// 2413 ms scripting
repeat(noAlloc, 10000);
repeat(preAlloc, 10000);
</script>
</html>
The profiler shows that the function with no size parameter took 28 s to allocate and fill 100,000 items array for 1000 times and the function with the size parameter in the array constructor took under 7 seconds.
I am trying to build a large Array (22,000 elements) of Associative Array elements in JavaScript. Do I need to worry about the length of the indices with regards to memory usage?
In other words, which of the following options saves memory? or are they the same in memory consumption?
Option 1:
var student = new Array();
for (i=0; i<22000; i++)
student[i] = {
"studentName": token[0],
"studentMarks": token[1],
"studentDOB": token[2]
};
Option 2:
var student = new Array();
for (i=0; i<22000; i++)
student[i] = {
"n": token[0],
"m": token[1],
"d": token[2]
};
I tried to test this on Google Chrome DevTools, but the numbers are inconsistent to make a decision. My best guess is that because the Array indices repeat, the browser can optimize memory usage by not repeating them for each student[i], but that is just a guess.
Edit:
To clarify, the problem is the following: a large array containing many small associative arrays. Does it matter using long index or short index when it comes to memory requirements.
Edit 2:
The 3N array approach that was suggested in the comments and #Joseph Myers is referring to is creating one array 'var student = []', with a size 3*22000, and then using student[0] for name, student[1] for marks, student[2] for DOB, etc.
Thanks.
The difference is insignificant, so the answer is no. This sort of thing would barely even fall under micro optimization. You should always opt for most readable solutions when in such dilemmas. The cost of maintaining code from your second option outweighs any (if any) performance gain you could get from it.
What you should do though is use the literal for creating an array.
[] instead of new Array(). (just a side note)
A better approach to solve your problem would probably be to find a way to load the data in parts, implementing some kind of pagination (I assume you're not doing heavy computations on the client).
The main analysis of associative arrays' computational cost has to do with performance degradation as the number of elements stored increases, but there are some results available about performance loss as the key length increases.
In Algorithms in C by Sedgewick, it is noted that for some key-based storage systems the search cost does not grow with the key length, and for others it does. All of the comparison-based search methods depend on key length--if two keys differ only in their rightmost bit, then comparing them requires time proportional to their length. Hash-based methods always require time proportional to the key length (in order to compute the hash function).
Of course, the key takes up storage space within the original code and/or at least temporarily in the execution of the script.
The kind of storage used for JavaScript may vary for different browsers, but in a resource-constrained environment, using smaller keys would have an advantage, like still too small of an advantage to notice, but surely there are some cases when the advantage would be worthwhile.
P.S. My library just got in two new books that I ordered in December about the latest computational algorithms, and I can check them tomorrow to see if there are any new results about key length impacting the performance of associative arrays / JS objects.
Update: Keys like studentName take 2% longer on a Nexus 7 and 4% longer on an iPhone 5. This is negligible to me. I averaged 500 runs of creating a 30,000-element array with each element containing an object { a: i, b: 6, c: 'seven' } vs. 500 runs using an object { studentName: i, studentMarks: 6, studentDOB: 'seven' }. On a desktop computer, the program still runs so fast that the processor's frequency / number of interrupts, etc., produce varying results and the entire program finishes almost instantly. Once every few runs, the big key size actually goes faster (because other variations in the testing environment affect the result more than 2-4%, since the JavaScript timer is based on clock time rather than CPU time.) You can try it yourself here: http://dropoff.us/private/1372219707-1-test-small-objects-key-size.html
Your 3N array approach (using array[0], array[1], and array[2] for the contents of the first object; and array[3], array[4], and array[5] for the second object, etc.) works much faster than any object method. It's five times faster than the small object method and five times faster plus 2-4% than the big object method on a desktop, and it is 11 times faster on a Nexus 7.