I have a use case where in I need to provide a file of the size of 500 MB as a downloadable to users in the browser. The file itself is stored in an S3 bucket in the backend. As an alternate to the approach of fetching the file with a single GET S3 call, I am exploring an approach to make multiple ranged based GET calls to S3 such that the total data fetched by the ranged GET calls can be concatenated together to recreate the original file.
As the file size is on a higher side, I won’t be making all S3 ranged GET calls together such that the entire file data[from the GET call responses] is never loaded in browser memory.
Instead I plan to execute the S3 ranged GET calls in batches such that the browser memory consumed by the downloads is equal to the size of data fetched by that batch.
As a next step I am looking for a way to write the data from the GET calls batch directly to the file system [instead of maintaining them as a blob in browser memory]. The write operation should be such that data from all the batches is concatenated and stored as a single file on the user’s file system.
Are there any known ways to write data to file system in such a manner?
Related
I have a middleware that reads multipart/form-data and returns any files submitted attached to the request body. I use busboy to process the input and return the file contents as a buffer. I read that buffers can consume a lot of memory and hence ReadableStreams are a better choice. I don't understand why this is so: even a stream has to store all the underlying data somewhere in the memory, right? So how can that be better than directly accessing the buffer itself?
yeah. It does store some data, but it's a small fraction of it. Say, you want to upload a file somewhere. With a buffer you would have to potentially read 1GB file and then upload the whole thing at once. If you don't have that 1GB of memory available, or if you're uploading multiple files in the same time, you will simply run out of memory. With streams, you can process data as you read it. So you load 1B of data, you upload it, you free your memory, you load another byte, you upload that etc.
If you're building a server application the idea is the same. You start receiving a file on your server and you can start processing whatever piece of the data you have before the client manages to upload the whole thing. TCP is built in such a way that the client simply cannot upload the stuff faster than you can process it. At no point will you have the entire file in your memory.
I have an application that allows users to upload relatively large video files, which are stored on S3. We previously used flow-php-server to chunk uploads over multiple requests that are then assembled and stored on S3. Unfortunately, this method no longer works, as we recently upgraded our server architecture to a load balanced environment and chunks are being split across the multiple servers behind our load balancer.
What is the solution to this problem? I am under the impression that if we split file uploads over multiple requests, we can make no guarantees about which server it hits so uploads will fail. Does this mean I'll have to settle with single request uploads and deal with browser single request file size limits? Is there another way around this?
I'm not sure if the solution requires configuring the server/load balancer to somehow direct uploads for the same file to the same server, or if there's a different method I can implement on the front/back ends to accommodate this.
There are several solutions for saving a data blob to a file in a browser environment. I normally use FileSaver.js.
In this particular case, though, my data comes in from a websocket stream, and the resulting file might be so large (hundreds of MB) that concatenating it in memory really kills performance.
Is it at all possible to open a stream that data can be written to as it comes in?
Edit:
The websocket is an XMPP connection. The file transfer is negotiated via Stream Initiation (XEP-0095) using the SI File Transfer (XEP-0096) protocol; the data is then transferred via an In Band Bytestream (XEP-0047).
Basically, the web application gets the data in base64-encoded chunks of 4096 bytes, and I'd like to know if there is any way to let the user save the file before all the data is received.
When uploading a large file to a web server for processing, the upload itself can take a long time, and unless the file can be processed sequentially, the server has to wait until the entire file has been received before processing can begin.
Is it possible to make some javascript "glue" that lets the web server request specific ranges of data from a file on the client's computer as necessary? What I'm asking for is in a way the same as HTTP Range capability, only the server would be generating the requests and javascript on the client computer would be serving the data.
By doing this, the web server could process files as they're being uploaded, for example video files with important information in the footer, or compressed archives. The web server could unpack parts of a zip file without uploading the whole archive, and could even check that the zip file structure is correct before a large file is uploaded. In fact, any generic processing of a large file that can't be read sequentially could be done at the same time as the file is being transferred, instead of first uploading the file and then processing it later. Is it possible to do something like this from the browser without deploying something like java or flash?
If it's possible to read bytes as necessary from the file, it's also conceivable that the web server wouldn't need to have space to store the entire file on a local drive, but simply access the file directly from the client's drive and process it in memory. This would probably be inefficient for some use cases, but the possibility is interesting.
What is the best practice for coordinating access to files in node.js?
I'm trying to write an http based file uploader for very large files (10sGB) that is resumable. I'm trying to figure out what the best approach is to handle two people trying to upload the same file at the same time... I'm also trying to think ahead to the possibility where more than one copy of the node.js http server is running behind a load balancer, which means catching duplicate uploads can't rely on just the code itself.
In python, for example, you can create a file by passing the correct flags to the open() call to force an atomic create. Not sure if the default node.js open new file is atomic.
Another option I thought of, but don't really want to pursue, is using a database with an async driver that supports atomic transactions to track this state...
In order to know if multiple users are uploading the same file, you will have to identify the files somehow. Hashing is best for this. First, hash the entire file on the client side to identify it. Tell the server the hash of the file, if there is already a file on the server with the same hash, then the file has already been uploaded or is currently being uploaded.
Since this is an http file server, you will likely want users to upload files from a browser. You can get the contents of a file with a browser using the File Reader API. Unfortunately as of now this isn't widely supported. You might have to use something like flash to to get it to work in other browsers.
As you stream the file into memory with the file reader, you will want to break it into chunks and hash the chunks. Then send the server all of the file's hashed chunks. It's important that you break the file into chunks and hash those individual chunks instead of the contents of the entire file because otherwise the client could send one hash and upload an entire different file.
After the hashes are received and compared to other files' hashes and it turns out someone else is currently uploading the same file, the server then decides which user gets to upload which chunks of the file. The server then tells the uploading clients what chunks it wants from them, and the clients upload their corresponding chunks.
As each chunk is finished uploading, it is rehashed on the server and compared with the original array of hashes to verify that the user is uploading the correct file.
I found this on HackerNews under a response to someone complaining about some of the same things in node.js. I'll put it here for completeness. This allows me to at least lock some file writes in node.js like I wanted to.
IsaacSchlueter 4 hours ago | link
You can open a file with O_EXCL if you pass in the open flags as a
number. (You can find them on require("constants"), and they need to
be binary-OR'ed together.) This isn't documented. It should be. It
should probably also be exposed in a cleaner way. Most of the rest of
what you describe is APIs that need to be polished and refined a bit.
The boundaries are well defined at this point, though. We probably
won't add another builtin module at this point, or dramatically expand
what any of them can do. (I don't consider seek() dramatic, it's just
tricky to get right given JavaScript's annoying Number problems.)