A section of my Node.js application involves receiving a string as input from the user and storing it in a JSON file. JSON itself obviously has no limit on this, but is there any upper bound on the amount of text that Node can process into JSON?
Note that I am not using MongoDB or any other technology for the actual insertion - this is native stringification and saving to a .json file using fs.
V8 (the JavaScript engine node is built upon) until very recently had a hard limit on heap size of about 1.9 GB.
Node v0.10 is stuck on an older version of V8 (3.14) due to breaking V8 API changes around native addons. Node 0.12 will update to the newest V8 (3.26), which will break many native modules, but opens the door for the 1.9 GB heap limit to be raised.
So as it stands, a single node process can keep no more than 1.9 GB of JavaScript code, objects, strings, etc combined. That means the maximum length of a string is under 1.9 GB.
You can get around this by using Buffers, which store data outside of the V8 heap (but still in your process's heap). A 64-bit build of node can pretty much fill all your RAM as long as you never have more than 1.9 GB of data in JavaScript variables.
All that said, you should never come anywhere near this limit. When dealing with this much data, you must deal with it as a stream. You should never have more than a few megabytes (at most) in memory at one time. The good news is node is especially well-suited to dealing with streaming data.
You should ask yourself some questions:
What kind of data are you actually receiving from the user?
Why do you want to store it in JSON format?
Is it really a good idea to stuff gigabytes into JSON? (The answer is no.)
What will happen with the data later, after it is stored? Will your code read it? Something else?
The question you've posted is actually quite vague in regard to what you're actually trying to accomplish. For more specific advice, update your question with more information.
If you expect the data to never be all that big, just throw a reasonable limit of 10 MB or something on the input, buffer it all, and use JSON.stringify.
If you expect to deal with data any larger, you need to stream the input straight to disk. Look in to transform streams if you need to process/modify the data before it goes to disk. For example, there are modules that deal with streaming JSON.
The maximum string size in "vanilla" nodeJS (v0.10.28) is in the ballpark of 1GB.
If your are in a hurry, you can test the maximum supported string size with a self doubling string. The system tested has 8GB of RAM, mostly unused.
x = 'x';
while (1){
x = ''+x+x; // string context
console.log(x.length);
}
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
FATAL ERROR: JS Allocation failed - process out of memory
Aborted (core dumped)
In another test I got to 1,000,000,000 with a one char at a time for loop.
Now a critic might say, "wait, what about JSON. the question is about JSON!" and I would shout THERE ARE NO JSON OBJECTS IN JAVASCRIPT the JS types are Object, Array, String, Number, etc.... and as JSON is a String representation this question boils down to what is the longest allowed string. But just to double check, let's add a JSON.stringify call to address the JSON conversion.
Code
x = 'x';
while (1){
x = ''+x+x; // string context
console.log(JSON.stringify({a:x}).length);
}
Expectations: the size of the JSON string will start greater than 2, because the first object is going to stringify to '{"a":"xx"}' for 10 chars. It won't start to double until the x string in property a gets bigger. It will probably fail around 256M since it probably makes a second copy in stringification. Recall a stringification is independent of the original object.
Result:
10
12
16
24
40
72
136
264
520
1032
2056
4104
8200
16392
32776
65544
131080
262152
524296
1048584
2097160
4194312
8388616
16777224
33554440
67108872
134217736
268435464
Pretty much as expected....
Now these limits are probably related to the C/C++ code that implements JS in the nodeJS project, which at this time I believe is the same V8 code used in Chrome browsers.
There is evidence from blog posts of people recompiling nodeJS to get around memory limits in older versions. There are also a number of nodejs command line switches. I have not tested the effect of any of this.
The maximum length of a string in node.js is defined by the underlying Javascript Engine "V8". In V8 the maximum length is independent of the heap size. The size of a string is actually constrained by the limits defined by optimized object layout. See https://chromium-review.googlesource.com/c/v8/v8/+/2030916 which is a recent (Feb 2020) change to the maximum length of a string in V8. The commit message explains the different lengths over time. The limit has gone from about 256MB to 1GB then back to 512MB (on 64-bit V8 platforms).
This is a good question, but I think the upper limit you need to be worried about doesn't involve the max JSON string size.
In my opinion the limit you need to worry about is how long do you wish to block the request thread while it's processing the user's request.
Any string size over 1MB will take the user a few seconds to upload and 10 of Megabytes could take minutes. After receiving the request, the server will take a few hundred milliseconds to seconds to parse into a data structure leading to a very poor user experience (Parsing JSON is very expensive)
The bandwidth and server processing times will overshadow any limit JSON may have on string size.
Related
I'm looking at a node.js library (gen-readlines) that reads large flat files via a generator - i.e. a file is read in 'chunks' of 65 536 bytes at a time via a generator.
Not having a CS background I didn't think much about this until someone mentioned that a disk reads 65 536 bytes of data at a time.
Questions:
Is this true of all disks (both metallic and SSD)?
8 bytes == 64 bit. What is the relationship between a 64 bit processor and a disk read of 64bits * 1024 bytes read sizes?
i.e. what is the significance of 64KB in terms of Disc IO?
Considering how high-level JavaScript is, can I really instruct a generator to yield bytes after exactly one disc read? Or is the number specified as a buffer size in the library I've linked to completely arbitrary when thinking in terms of JavaScript...
Is this true of all disks (both metallic and SSD)?
No, it depends on how the disk is formatted, the cluster size IIRC. It is a fairly common value in today's world, but smaller cluster sizes aren't uncommon. They are typically multiples of 4k (in the last decade or more). When I was young and the world was new, 512 bytes was normal. :-) 64k is likely to be big enough for even a disk formatted with a large cluster size.
But there's a lot more to it than the basic unit of disk allocation. For one thing, there's very likely multiple levels of caching — in the disk drive's built-in controller, in the disk controller on the motherboard, in the OS... Today's disks (or even yesterday's, or the day before's) are not stupid platters we have to try to micro-manage with code.
8 bytes == 64 bit. What is the relationship between a 64 bit processor and a disk read of 64bits * 1024 bytes read sizes?
Other than that they're both powers of 2, I don't think there is one.
Considering how high-level JavaScript is, can I really instruct a generator to yield bytes after exactly one disc read?
That's not really the key question. The key question is whether the code in the generator function (or any function) can read exactly 64k at a time.
The answer is yes, and that code does:
let bytesRead = fs.readSync(fd, readChunk, 0, bufferSize, position);
...where bufferSize is 64k. readSync is a low-level call.
In summary: 64k is likely to be large enough to hold even the largest minimum allocation unit of a disk; and if it's too big, no problem, it's still not outrageous and multiple allocation units can be read into it. But I'd want to see well-crafted benchmarks before I believed it made a significant difference. I can see the logic, but with the layers between even Node's C++ code inside readSync and the actual physical reading of the disk...
While disk read may be aligned, the OS makes it transparent for the most part; as you mentioned that you're reading sequentially, it doesn't matter what buffer size you're using. There are no relationship between 64 bit and 64KB alignment (I have only heard of 4K align anyway).
You may want to create a buffer of size of power of 2; just for better aligning with memory allocator. JavaScript abstracts most of the memory allocations, so it doesn't necessary improves performance when you have a 64K or 4K buffer (in normal sense, it should be sufficiently big to reduce syscall overhead).
Do the IO in your favorite style, as long as it is buffered. The buffer size doesn't matter too much if it's 4K or 64K (but too small buffer is bad as unbuffered), but whether the IO is buffered or not, matters very much.
1- no, it depends on the firmware of the storage device, on the drive controller, and on the operating system. Newer HDDs use 4 KiB sectors, thus such a disk reads at least 4 KiB at a time.
2- there is no relation between the processor's register or bus size and the disk I/O chunks.
3- data rates depend on both data size and I/O latency overhead (overhead due to I/O processing, for instance system call processing). Bigger data chunks means less I/O for the same data size, means less I/O overhead.
4- from the point of view at the JavaScript high layer, you do not need to worry about these low-level behaviours. Everything will work correctly, since there are many caches at several levels.
I have a 25MB json file, that I "require" why my app starts up. Initial it seems that the node.js process takes up almost 200MB of memory.
But if I leave it running and come back to it, Activity monitor reports that it is using only 9MB which makes no sense at all! At the very least, it should be a few MB more, since even a simple node.js app that does almost nothing (acting like a server), uses 9MB.
The app seems to work fine - it is a server, that provides search suggestions form a word list of 220,000 words.
Is Activity Monitor wrong ?
Why is it using only 9MB, but initially used ~200MB when the application started up ?
Since it's JavaScript things that are no longer being used are removed via Garbage Collector(GC), freeing memory. Everything (or many things) may have been loaded into memory at the start. Then items that were not longer needed were removed from memory by the GC. Usually generation can take more memory in progress and lose some afterwards, for example temporary data-structures can be used in progress but are not longer needed when the process is done.
It's also possible that items in memory where swapped out and written to the disk temporally (and may be later retrieved), this swapping this is done by your OS and tends to be used more on programs that reserve a lot of memory.
How much memory it takes to load the file depends on a number of factors.
What text encoding is being used to store the file? JavaScript uses UTF-16 internally, so if that's not what's being used on disk, the size may be different. If the file is in UTF-32, for example, then the in-memory UTF-16 version will be smaller unless it's full of astrals. If the file is in UTF-8, then things are reversed: the in-memory version will be larger unless it's full of astrals. But for now, let's just assume that they're about the same size, either because they use the same encoding or the pattern of astrals just happens to make the file sizes more or less the same.
You're right that it takes at least 25MB to load the file (assuming that encodings don't interfere). The semantics of the JSON API being what they are, you need to have the whole file in memory as a string, so the app will take up at least that much memory at that time. That doesn't count whatever the parser needs to run, so you need at least 34MB: 25 for the file, 9 for Node, and then whatever your particular app uses for itself.
But your app doesn't need all of that memory all the time. Depending on how you've written the app, you're probably destroying your references to the file at some point.
Because of the semantics of JSON, there's no way to avoid loading the whole file into memory, which takes 25MB because that's the size of the file. There's also no way to avoid taking up whatever memory the JSON parser needs to do its work and build the object.
But depending on how you've written the app, there probably comes a point when you no longer need that data. Either you exit the function that you used to load the file, or you assign that variable to something else, or any of a number of other possibilities. However it happens, JavaScript reclaims memory that's not being used anymore. This is called garbage collection, and it's popular among so-called "scripting languages" (though other programming languages can use it too).
There's also the question of text representation versus in-memory representation. Strings require about the same amount of space in memory versus on-disk, unless you change the encoding, but Numbers and Booleans are another matter entirely. In JavaScript, all Numbers are 64-bit floating-point numbers, so if most of your numbers on disk are more than four characters long, then the in-memory representation will be smaller, possibly by quite a bit. Note that I said characters, not digits: it's true that digits are characters, but +, -, e, and . are characters too, so -1e0 takes up as twice as much space as -1 when written as text, even though they represent the same value in memory. As another example, 3.14 takes up as much space as 1000 as text (and happen to take up the same amount of space in memory: 64 bits each). But -0.00000001 and 100000000 take up much less space in memory than on disk, because the in-memory representation is smaller. Booleans can be even smaller: different engines store them in different ways, but you could theoretically do it in as little as one bit. That's a far cry from the 8 bytes it takes to store "true", or 10 to store "false".
So if your data is mostly about Numbers and Booleans, then the in-memory representation stands to get a lot smaller. If it's mostly Strings, then not so much.
A section of my Node.js application involves receiving a string as input from the user and storing it in a JSON file. JSON itself obviously has no limit on this, but is there any upper bound on the amount of text that Node can process into JSON?
Note that I am not using MongoDB or any other technology for the actual insertion - this is native stringification and saving to a .json file using fs.
V8 (the JavaScript engine node is built upon) until very recently had a hard limit on heap size of about 1.9 GB.
Node v0.10 is stuck on an older version of V8 (3.14) due to breaking V8 API changes around native addons. Node 0.12 will update to the newest V8 (3.26), which will break many native modules, but opens the door for the 1.9 GB heap limit to be raised.
So as it stands, a single node process can keep no more than 1.9 GB of JavaScript code, objects, strings, etc combined. That means the maximum length of a string is under 1.9 GB.
You can get around this by using Buffers, which store data outside of the V8 heap (but still in your process's heap). A 64-bit build of node can pretty much fill all your RAM as long as you never have more than 1.9 GB of data in JavaScript variables.
All that said, you should never come anywhere near this limit. When dealing with this much data, you must deal with it as a stream. You should never have more than a few megabytes (at most) in memory at one time. The good news is node is especially well-suited to dealing with streaming data.
You should ask yourself some questions:
What kind of data are you actually receiving from the user?
Why do you want to store it in JSON format?
Is it really a good idea to stuff gigabytes into JSON? (The answer is no.)
What will happen with the data later, after it is stored? Will your code read it? Something else?
The question you've posted is actually quite vague in regard to what you're actually trying to accomplish. For more specific advice, update your question with more information.
If you expect the data to never be all that big, just throw a reasonable limit of 10 MB or something on the input, buffer it all, and use JSON.stringify.
If you expect to deal with data any larger, you need to stream the input straight to disk. Look in to transform streams if you need to process/modify the data before it goes to disk. For example, there are modules that deal with streaming JSON.
The maximum string size in "vanilla" nodeJS (v0.10.28) is in the ballpark of 1GB.
If your are in a hurry, you can test the maximum supported string size with a self doubling string. The system tested has 8GB of RAM, mostly unused.
x = 'x';
while (1){
x = ''+x+x; // string context
console.log(x.length);
}
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
FATAL ERROR: JS Allocation failed - process out of memory
Aborted (core dumped)
In another test I got to 1,000,000,000 with a one char at a time for loop.
Now a critic might say, "wait, what about JSON. the question is about JSON!" and I would shout THERE ARE NO JSON OBJECTS IN JAVASCRIPT the JS types are Object, Array, String, Number, etc.... and as JSON is a String representation this question boils down to what is the longest allowed string. But just to double check, let's add a JSON.stringify call to address the JSON conversion.
Code
x = 'x';
while (1){
x = ''+x+x; // string context
console.log(JSON.stringify({a:x}).length);
}
Expectations: the size of the JSON string will start greater than 2, because the first object is going to stringify to '{"a":"xx"}' for 10 chars. It won't start to double until the x string in property a gets bigger. It will probably fail around 256M since it probably makes a second copy in stringification. Recall a stringification is independent of the original object.
Result:
10
12
16
24
40
72
136
264
520
1032
2056
4104
8200
16392
32776
65544
131080
262152
524296
1048584
2097160
4194312
8388616
16777224
33554440
67108872
134217736
268435464
Pretty much as expected....
Now these limits are probably related to the C/C++ code that implements JS in the nodeJS project, which at this time I believe is the same V8 code used in Chrome browsers.
There is evidence from blog posts of people recompiling nodeJS to get around memory limits in older versions. There are also a number of nodejs command line switches. I have not tested the effect of any of this.
The maximum length of a string in node.js is defined by the underlying Javascript Engine "V8". In V8 the maximum length is independent of the heap size. The size of a string is actually constrained by the limits defined by optimized object layout. See https://chromium-review.googlesource.com/c/v8/v8/+/2030916 which is a recent (Feb 2020) change to the maximum length of a string in V8. The commit message explains the different lengths over time. The limit has gone from about 256MB to 1GB then back to 512MB (on 64-bit V8 platforms).
This is a good question, but I think the upper limit you need to be worried about doesn't involve the max JSON string size.
In my opinion the limit you need to worry about is how long do you wish to block the request thread while it's processing the user's request.
Any string size over 1MB will take the user a few seconds to upload and 10 of Megabytes could take minutes. After receiving the request, the server will take a few hundred milliseconds to seconds to parse into a data structure leading to a very poor user experience (Parsing JSON is very expensive)
The bandwidth and server processing times will overshadow any limit JSON may have on string size.
I wanted to try IndexedDB, to see if it is fit for my purpose.
Doing some testing, I noticed, that its grow rate seems to be exponentially with every insert.
(Only tested in google chrome version 31.0.1650.63 (Offizieller Build 238485) m / Windows by now)
My Code in full: http://pastebin.com/15WK96FY
Basically I save a string with 2.6 mio characters.
Checking window.webkitStorageInfo.queryUsageAndQuota I see that it consumes ~7.8MB, meaning ~3 bytes per character used.
If I save the string 10 times however, I get a usage of ~167MB, meaning ~6.4 bytes per character used.
By saving it 50 times I'm high up in the gigabytes and my computer starts to freeze.
Am I doing something wrong or is there a way around this behaviour?
Your test is wrong. Field test2 should not be indexed.
I'm starting to write a Chess program in JavaScript and possibly some Node.JS if I find the need to involve the server in the Chess AI logic, which is still plausible at least in my possibly ignorant opinion. My question is simple enough: Is the client-side FileSystem API for JavaScript a reasonable way to cache off minimax results for future reference, or is the resulting data just way too much to store in any one place? My idea was that it could be used as a way to allow the AI to adapt to the user and "learn" by being able to access previous decisions rather than manually re-determining them every time. Is this a reasonable plan or am I underestimating the memory usage this would need? If your answer is that this is plausible, some tips on the most efficient method for storing the data in this manner would be nice too.
I have written Chess Engines before in C++, but no Javascript.
What you describe is usually solved by a transposition table. You calculate a hash key that identifies the position and store additional data with it.
See:
https://www.chessprogramming.org/Transposition_Table
https://www.chessprogramming.org/Zobrist_Hashing
Web storage provides per origin:
2.5 MB for Google Chrome
5 MB for Mozilla Firefox
10 MB for Internet Explorer
Each entry usually holds:
Zobrist Hash Key: 8 byte
Best Move: 2 byte
Depth: 1 byte
Score: 2 byte
Type of score (exact, upper bound, lower bound): 1 byte
= 16 byte
So e.g. Google Chrome can hold 160k entries. Usually for a chess position analysis you use over 1 GB of memory for the transposition table. Anyway, for a javascript engine I think the 2.5 MB is a good compromise.
To make sure that the javascript engine uses the optimal storage I advise you to convert the data to some sort of binary representation. Then I would index the localStorage by Zobrist Hash Key and store all the other information associated with it.