Custom Stream Write Length Using Nodejs - javascript

This may already be answered somewhere on the site, but if it is I couldn't find it. I also couldn't find an exact answer to my question (or at least couldn't make sense of how to implement a solution) based on any of the official Node.js documentation.
Question: Is it possible to customize the length (in bytes) of each disk write that occurs while piping the input of a readable stream into a file?
I will be uploading large files (~50Gb) and it's possible that there could be many clients doing so at the same time. In order to accomplish this I'll be slicing files at the client side and then uploading a chunk at a time. Ideally I want physical writes to disk on the server side to occur in 1Mb portions - but is this possible? And if it is possible then how can it be implemented?

You will probably use a WriteStream. While it is not documented in the fs api, any Writable does take a highWaterMark option for when to flush its buffer. See also the details on buffering.
So it's just
var writeToDisk = fs.createWriteStream(path, {highWaterMark: 1024*1024});
req.pipe(writeToDisk);
Disclaimer: I would not believe in cargo-cult "server-friendly" chunk sizes. I'd go with the default (which is 16kb), and when performance becomes a problem test other sizes to find the optimal value for the current setup.

Related

Reading JPEG file to retrieve orientation information

I've been researching ways to retrieve orientation information from a JPEG file in pure JavaScript.
An excellent way to get this information is outlined in this SO answer. Essentially one reads the entire file using readAsArrayBuffer and then processes it for the required information.
However, is it really necessary to read the whole file to retrieve EXIF information? Is there an optimization whereby one can read a subset of bytes when doing this?
For instance, this SO answer seems to suggest the first 20 bytes are good enough for the job. However, the former answer's writer himself asserts that he removed the slice statement because sometimes the tag came in after the limit (he had originally set it to 64KB, i.e. reader.readAsArrayBuffer(file.slice(0, 64 * 1024));)
So what's a rule of thumb one can use when programming this sort of a thing? Or does one not exist at all? I want to write code where performance doesn't get heavily affected by the size (in bytes) of file uploaded by a user. That is my goal.
Note: I've tried Googling this information as well, however haven't found anything meaningful.
Till a more seasoned expert chimes in, I've settled for reader.readAsArrayBuffer(file.slice(0, 128 * 1024));.

What is the most efficient way in JavaScript to parse huge amounts of data from a file

What is the most efficient way in JavaScript to parse huge amounts of data from a file?
Currently I use JSON parse to serialize an uncompressed 250MB file, which is really slow. Is there a simple and fast way to read a lot of data in JavaScript from a file without looping through every character? The data stored in the file are only a few floating point arrays?
UPDATE:
The file contains a 3d mesh, 6 buffers (vert, uv etc). Also the buffers need to be presented as typed arrays. streaming is not a option because the file has to be fully loaded before the graphics engine can continue. Maybe a better question is how to transfer huge typed arrays from a file to javascript in the most efficient way.
I would recommend a SAX based parser for these kind of JavaScript or a stream parser.
DOM parsing would load the whole thing in memory and this is not the way to go by for large files like you mentioned.
For Javascript based SAX Parsing (in XML) you might refer to
https://code.google.com/p/jssaxparser/
and
for JSON you might write your own, the following link demonstrates how to write a basic SAX based parser in Javascript
http://ajaxian.com/archives/javascript-sax-based-parser
Have you tried encoding it to a binary and transferring it as a blob?
https://developer.mozilla.org/en-US/docs/DOM/XMLHttpRequest/Sending_and_Receiving_Binary_Data
http://www.htmlgoodies.com/html5/tutorials/working-with-binary-files-using-the-javascript-filereader-.html#fbid=LLhCrL0KEb6
There isn't a really good way of doing that, because the whole file is going to be loaded into memory and we all know that all of them have big memory leaks. Can you not instead add some paging for viewing the contents of that file?
Check if there are any plugins that allow you to read the file as a stream, that will improve this greatly.
UPDATE
http://www.html5rocks.com/en/tutorials/file/dndfiles/
You might want to read about the new HTML5 API's to read local files. You will have the issue with downloading 250mb of data still tho.
I can think of 1 solution and 1 hack
SOLUTION:
Extending the split the data in chunks: it boils down to http protocol. REST parts on the notion that http has enough "language" for most client-server scenarios.
You can setup on the client a request header Content-len to establish how much data you need per request
Then on the backend have some options http://httpstatus.es
Reply a 413 if the server is simply unable to get that much data from the db
417 if the server is able to reply but not under the requested header (Content-len)
206 with the provided chunk, letting know the client "there is more from where that came from"
HACK:
Use Websocket and get the binary file. Then use the html5 FileAPI to load it into memory.
This is likely to fail though because its not the download causing the problem, but the parsing of an almost-endless JS object
You're out of luck on the browser. Not only do you have to download the file, but you'll have to parse the json regardless. Parse it on the server, break it into smaller chunks, store that data into the db, and query for what you need.

Avoid duplicate content on Node.js server

I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance
My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).
What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.
Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)
MD5 is actually quite fast, more than fast enough for your use case. One anecdotal benchmark has it at about ~400 Megabytes per second on a single CPU (source). It wouldn't be the bottleneck in your server processing, and it is a reliable way to check for duplicate files. MD5 is vulnerable to collision attacks, but they must be painstakingly prepared; chance collisions are statistically impossible. It sounds like collisions wouldn't be too great of a problem in your application (but make sure you handle them anyway).
If you truly just want speed to the exclusion of reliability, you could go with CRC. It's not intended to be a true hash, just to detect errors in a byte stream. It has a relatively high collision rate of about 1 in a million. However, it's blazing fast; it's meant to be implemented in hardware on routers.
How about the following approach:
When the user uploads the images, it creates the MD5 sum
The image is then stored using that MD5 sum as a filename
The original image name is stored on the FS as well, but as a symlink pointing to the MD5 name.
If a user uploads an image that is a duplicate, then you can check whether the MD5 name already exists and just create the symlink.
For converting the existing images into that structure, I'm sure a fairly simple shell script using md5sum, mv and ln -s would do the trick.
One other possibility is to use something like MongoDB to store the images in a DB, which may well be easier to cluster.

Choosing an appropriate compression scheme for data transfer over JSON

After some comments by David, I've decided to revise my question. The original question can be found below as well as the newly revised question. I'm leaving the original question simply to have a history as to why this question was started.
Original Question (Setting LZMA properties for jslzma)
I've got some large json files I need to transfer with ajax. I'm currently using jQuery and $.getJSON(). I'd like to use the jslzma library to decompress the files upon receiving them. Currently, I'm using django with the pylzma library to compress the files.
The only problem is that there's a lack of documentation for the jslzma library. There is some, but not enough. So I have two questions about how to use the library.
It gives this as an example:
LZMA.decompress(properties, inStream, outStream, outSize);
I know how to set the inStream and outStream variables, but not the properties or the outSize. So can anyone give an example(s) on how to set the properties variable (ie. what's expected) and how to calculate the outSize...
Thanks.
Edit #1 (Revised Question)
I'm looking for a compression scheme that lends itself to highly repeatable data using python (django) and javascript.
The data being transferred contains elevation measurements. Each file has 1200x1200 data points, which equates to about 2.75MB in it's raw binary form uncompressed. JSON balloons it to between 5-6MB. I've also looked into base64 (just to cover all the bases), which would reduce the size but I haven't had any success reading it in js. I think the data lends itself to easy compression just because of the highly repeatable data values. For example, one file only has 83 unique elevation values to describe 1440000 data points.
I just haven't had much luck, mainly because I'm just starting to learn JavaScript.
So can anyone suggest a compression scheme for this type of data? The goal is to minimize the transfer time by reducing the size for the data.
Thanks.
For what it's worth LZMA is typically very slow to compress as well as decompress; and thus it is more common to use bit faster compression schemes. Standard GZIP (deflate) has reasonably good balance: its compression ratio is acceptable, and its compression speed is MUCH better than that of LZMA or bzip2.
Also: most web servers and clients support automatic handling of gzip compression, which makes it even more convenient to use.
Decompression on the client side with Javacscript can take a significant longer time and highly depends on the available bandwidth of the client's box. Why not just implement a lesser but faster and easier to write decompression like rle, delta or golomb code? Or maybe you want to look into compressed Jsons?

Is it possible to find stretches of silence in audio files with Javascript?

I've been working on a tool to transcribe recordings of speech with Javascript. Basically I'm hooking up key events to play, pause, and loop a file read in with the audio tag.
There are a number of advanced existing desktop apps for doing this sort of thing (such as Transcriber -- here's a screenshot). Most transcription tools have a built-in waveform that can be used to jump around the audio file, which is very helpful because the transcriber can learn to visually find and repeat or loop phrases.
I'm wondering if it's possible to emulate a subset of this functionality in the browser, with Javascript. I don't know much about signal processing, perhaps it's not even feasible.
But what I envision is Javascript reading the sound stream from the file, and periodically sampling the amplitude. If the amplitude is very low for longer than a certain threshhold of time, then that would be labled as a phrase break.
Such labeling, I think, would be very useful for transcription. I could then set up key commands to jump to the previous period of silence. So hypothetically (imagining a jQuery-based API):
var audio = $('audio#someid');
var silences = silenceFindingVoodoo(audio);
silences will then contain a list of times, so I can hook up some way to let the user jump around through the various silences, and then set the currentTime to a chosen value, and play it.
Is it even conceivable to do this sort of thing with Javascript?
Yes it's possible with Web Audio API, to be more precise you will need AnalyserNode. To give you a short proof of concept you can get this example, and add following code to drawTimeDomain():
var threshold = 1000;
var sum = 0;
for (var i in amplitudeArray) {
sum += Math.abs(128 - amplitudeArray[i]);
}
var test = (sum < threshold) ? 'silent' : 'sound';
console.log('silent info', test);
You will just need a additional logic to filter silent by milliseconds (e.g. any silent taking more than 500 ms should be seen as real silent )
I think this is possible using javascript (although maybe not advisable, of course). This article:
https://developer.mozilla.org/En/Using_XMLHttpRequest#Handling_binary_data
... discusses how to access files as binary data, and once you have the audio file as binary data you could do whatever you like with it (I guess, anyway - I'm not real strong with javascript). With audio files in WAV format, this would be a trivial exercise, since the data is already organized by samples in the time domain. With audio files in a compressed format (like MP3), transforming the compressed data back into time-domain samples would be so insanely difficult to do in javascript that I would found a religion around you if you managed to do it successfully.
Update: after reading your question again, I realized that it might actually be possible to do what you're discussing in javascript, even if the files are in MP3 format and not WAV format. As I understand your question, you're actually just looking to locate points of silence within the audio stream, as opposed to actually stripping out the silent stretches.
To locate the silent stretches, you wouldn't necessarily need to convert the frequency-domain data of an MP3 file back into the time-domain of a WAV file. In fact, identifying quiet stretches in audio can actually be done more reliably in the frequency domain than in the time domain. Quiet stretches tend to have a distinctively flat frequency response graph, whereas in the time domain the peak amplitudes of audible speech are sometimes not much higher than the peaks of background noise, especially if auto-leveling is occurring.
Analyzing an MP3 file in javascript would be significantly easier if the file were CBR (constant bit rate) instead of VBR (variable bit rate).
As far as I know, JavaScript is not powerful enough to do this.
You'll have to resort to flash, or some sort of server side processing to do this.
With the HTML5 audio/video tags, you might be able to trick the page into doing something like this. You could (hypothetically) identify silences server-side and send the timestamps of those silences to the client as meta data in the page (hidden fields or something) and then use that to allow JavaScript to identify those spots in the audio file.
If you use WebWorker threads you may be able to do this in Javascript, but that would require using more threads in the browser to do this. You could break up the problem into multiple threads and process it, but, it would be all but impossible to synchronize this with the playback. So, Javascript can determine the silent periods, by doing some audio processing, but since you can't link that to the playback well it would not be the best choice.
But, if you wanted to show the waveforms to the user then javascript and canvas can be used for this, but then see the next paragraph for the streaming.
Your best bet would be to have the server stream the audio and it can do the processing and find all the silences. Each of these should then be saved in a separate file, so that you can easily jump between the silences, and by streaming, your server app can determine when to load up the new file, so there isn't a break.
I don't think JavaScript is the tool you want to use to use for processing those audio files - that's asking for trouble. However, javascript could easily read a corresponding XML file which describes where those silences occur in the audio file, adjusting the user interface appropriately. Then, the question is what do you use to generate those XML files:
You can do it manually if you need to demo the capability right away. (Use audacity to see where those audio envelopes occur)
Check out this CodeProject article, which creates a wav processing library in C#. The author has created a function to extract silence from the input file. Probably a good place to start hacking.
Just two of my initial thoughts ... There are ALOT of audio processing APIs out there, but they are written for particular frameworks and application programming languages. Definitely make use of them before trying to write something from scratch ... unless you happen to really love fourier transforms.

Categories

Resources