Avoid duplicate content on Node.js server

Avoid duplicate content on Node.js server - javascript

I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance
My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).
What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.
Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)

MD5 is actually quite fast, more than fast enough for your use case. One anecdotal benchmark has it at about ~400 Megabytes per second on a single CPU (source). It wouldn't be the bottleneck in your server processing, and it is a reliable way to check for duplicate files. MD5 is vulnerable to collision attacks, but they must be painstakingly prepared; chance collisions are statistically impossible. It sounds like collisions wouldn't be too great of a problem in your application (but make sure you handle them anyway).
If you truly just want speed to the exclusion of reliability, you could go with CRC. It's not intended to be a true hash, just to detect errors in a byte stream. It has a relatively high collision rate of about 1 in a million. However, it's blazing fast; it's meant to be implemented in hardware on routers.

How about the following approach:
When the user uploads the images, it creates the MD5 sum
The image is then stored using that MD5 sum as a filename
The original image name is stored on the FS as well, but as a symlink pointing to the MD5 name.
If a user uploads an image that is a duplicate, then you can check whether the MD5 name already exists and just create the symlink.
For converting the existing images into that structure, I'm sure a fairly simple shell script using md5sum, mv and ln -s would do the trick.
One other possibility is to use something like MongoDB to store the images in a DB, which may well be easier to cluster.

Related

Scan/Track many files and skip already processed files with node.js

I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:
Use a file watcher like fs.watch or chokidar. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down.
Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).
All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing directory and move the files to a processed directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.

Store the files directly in the database as opposed to their location. Using Filestream is an option. Then you just add some sort of a flag that indicates if its been processed. Then you can just loop over all the files and know if they have been processed or not. Just make sure to update the table for processed files. Depending on the processing you could also limit processing to times that are convenient.
Ex.) If there is a chance a file will not be used, but it needs to be processed before use. Then you can just process the file before the call and avoid checking constantly or periodically.
Perfromance-wise this could even be faster than the filesystem in terms of read-write.
From the SQLite website:
... many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not.

As you are storing files processing info in DB, get the last processing time from DB in single query and process all the files which are created after that timestamp.
For filtering files via timestamp How to read a file from directory sorting date modified in Node JS
And if you can control directory structure than partition your files by datetime and other primary/secondary keys.

How about option 5: based on time? If you know the last time you processed the directory was at timestamp x, then the next go around you can skip all files older than x just by looking at the file stats. Then from this smaller subset you can use hashes to look for clashes.
Edit: Seems arpit and I were typing the same general idea at the same time. Note though that the sorting method in the link he included will iterate over all 10k files 3 times. You don't need to sort anything, you just need to iterate through once and process the ones that fit the bill.

Reading JPEG file to retrieve orientation information

I've been researching ways to retrieve orientation information from a JPEG file in pure JavaScript.
An excellent way to get this information is outlined in this SO answer. Essentially one reads the entire file using readAsArrayBuffer and then processes it for the required information.
However, is it really necessary to read the whole file to retrieve EXIF information? Is there an optimization whereby one can read a subset of bytes when doing this?
For instance, this SO answer seems to suggest the first 20 bytes are good enough for the job. However, the former answer's writer himself asserts that he removed the slice statement because sometimes the tag came in after the limit (he had originally set it to 64KB, i.e. reader.readAsArrayBuffer(file.slice(0, 64 * 1024));)
So what's a rule of thumb one can use when programming this sort of a thing? Or does one not exist at all? I want to write code where performance doesn't get heavily affected by the size (in bytes) of file uploaded by a user. That is my goal.
Note: I've tried Googling this information as well, however haven't found anything meaningful.

Till a more seasoned expert chimes in, I've settled for reader.readAsArrayBuffer(file.slice(0, 128 * 1024));.

Custom Stream Write Length Using Nodejs

This may already be answered somewhere on the site, but if it is I couldn't find it. I also couldn't find an exact answer to my question (or at least couldn't make sense of how to implement a solution) based on any of the official Node.js documentation.
Question: Is it possible to customize the length (in bytes) of each disk write that occurs while piping the input of a readable stream into a file?
I will be uploading large files (~50Gb) and it's possible that there could be many clients doing so at the same time. In order to accomplish this I'll be slicing files at the client side and then uploading a chunk at a time. Ideally I want physical writes to disk on the server side to occur in 1Mb portions - but is this possible? And if it is possible then how can it be implemented?

You will probably use a WriteStream. While it is not documented in the fs api, any Writable does take a highWaterMark option for when to flush its buffer. See also the details on buffering.
So it's just
var writeToDisk = fs.createWriteStream(path, {highWaterMark: 1024*1024});
req.pipe(writeToDisk);
Disclaimer: I would not believe in cargo-cult "server-friendly" chunk sizes. I'd go with the default (which is 16kb), and when performance becomes a problem test other sizes to find the optimal value for the current setup.

Is there some performance benefit using base64 encoded images?

For small size image what's (if any) the benefit in loading time using base64 encoded image in a javascript file (or in a plain HTML file)?
$(document).ready(function(){
var imgsrc = "../images/icon.png";
var img64 = "P/iaVYUy94mcZxqpf9cfCwtPdXVmBfD49NHxwMraWV/iJErLmNwAGT3//w3NB";
$('img.icon').attr('src', imgsrc); // Type 1
$('img.icon').attr('src', 'data:image/png;base64,' + img64); // Type 2 base64
});

The benefit is that you have to make one less HTTP request, since the image is "included" in a file you have made a request for anyway. Quantifying that depends on a whole lot of parameters such as caching, image size, network speed, and latency, so the only way is to measure (and the actual measurement would certainly not apply to everyone everywhere).
I should mention that another common approach to minimizing the number of HTTP requests is by using CSS sprites to put many images into one file. This would arguably be an even more efficient approach, since it also results in less bytes being transferred over (base64 bloats the byte size by a factor of about 1.33).
Of course, you do end up paying a price for this: decreased convenience of modifying your graphics assets.

You need to make multiple server requests, lets say you download a contrived bit of HTML such as:
<img src="bar.jpg" />
You already needed to make a request to get that. A TCP/IP socket was created, negotiated, downloaded that HTML, and closed. This happens for every file you download.
So off your browser goes to create a new connection and download that jpg, P/iaVYUy94mcZxqpf9cfCwtPdXVmBfD49NHxwMraWV/iJErLmNwAGT3//w3NB
The time to transfer that tiny bit of text was massive, not because of the file download, but simply because of the negotiation to get to the download part.
That's a lot of work for one image, so you can in-line the image with base64 encoding. This doesn't work with legacy browsers mind you, only modern ones.
The same idea behind base64 inline data is why we've done things like closure compiler (optimizes speed of download against execution time), and CSS Spirtes (get as much data from one request as we can, without being too slow).
There's other uses for base64 inline data, but your question was about performance.
Be careful not to think that the HTTP overhead is so massive and you should only make one request-- that's just silly. You don't want to go overboard and inline all the things, just really trivial bits. It's not something you should be using in a lot of places. Seperation of concerns is good, don't start abusing this because you think your pages will be faster (they'll actually be slower because the download for a single file is massive, and your page won't start pre-rendering till it's done).

It saves you a request to the server.
When you reference an image through the src-property, it'll load the page, and then do the additional request to fetch the image.
When you use the base64 encoded image, it'll save you that delay.

Is it possible to find stretches of silence in audio files with Javascript?

I've been working on a tool to transcribe recordings of speech with Javascript. Basically I'm hooking up key events to play, pause, and loop a file read in with the audio tag.
There are a number of advanced existing desktop apps for doing this sort of thing (such as Transcriber -- here's a screenshot). Most transcription tools have a built-in waveform that can be used to jump around the audio file, which is very helpful because the transcriber can learn to visually find and repeat or loop phrases.
I'm wondering if it's possible to emulate a subset of this functionality in the browser, with Javascript. I don't know much about signal processing, perhaps it's not even feasible.
But what I envision is Javascript reading the sound stream from the file, and periodically sampling the amplitude. If the amplitude is very low for longer than a certain threshhold of time, then that would be labled as a phrase break.
Such labeling, I think, would be very useful for transcription. I could then set up key commands to jump to the previous period of silence. So hypothetically (imagining a jQuery-based API):
var audio = $('audio#someid');
var silences = silenceFindingVoodoo(audio);
silences will then contain a list of times, so I can hook up some way to let the user jump around through the various silences, and then set the currentTime to a chosen value, and play it.
Is it even conceivable to do this sort of thing with Javascript?

Yes it's possible with Web Audio API, to be more precise you will need AnalyserNode. To give you a short proof of concept you can get this example, and add following code to drawTimeDomain():
var threshold = 1000;
var sum = 0;
for (var i in amplitudeArray) {
sum += Math.abs(128 - amplitudeArray[i]);
}
var test = (sum < threshold) ? 'silent' : 'sound';
console.log('silent info', test);
You will just need a additional logic to filter silent by milliseconds (e.g. any silent taking more than 500 ms should be seen as real silent )

I think this is possible using javascript (although maybe not advisable, of course). This article:
https://developer.mozilla.org/En/Using_XMLHttpRequest#Handling_binary_data
... discusses how to access files as binary data, and once you have the audio file as binary data you could do whatever you like with it (I guess, anyway - I'm not real strong with javascript). With audio files in WAV format, this would be a trivial exercise, since the data is already organized by samples in the time domain. With audio files in a compressed format (like MP3), transforming the compressed data back into time-domain samples would be so insanely difficult to do in javascript that I would found a religion around you if you managed to do it successfully.
Update: after reading your question again, I realized that it might actually be possible to do what you're discussing in javascript, even if the files are in MP3 format and not WAV format. As I understand your question, you're actually just looking to locate points of silence within the audio stream, as opposed to actually stripping out the silent stretches.
To locate the silent stretches, you wouldn't necessarily need to convert the frequency-domain data of an MP3 file back into the time-domain of a WAV file. In fact, identifying quiet stretches in audio can actually be done more reliably in the frequency domain than in the time domain. Quiet stretches tend to have a distinctively flat frequency response graph, whereas in the time domain the peak amplitudes of audible speech are sometimes not much higher than the peaks of background noise, especially if auto-leveling is occurring.
Analyzing an MP3 file in javascript would be significantly easier if the file were CBR (constant bit rate) instead of VBR (variable bit rate).

As far as I know, JavaScript is not powerful enough to do this.
You'll have to resort to flash, or some sort of server side processing to do this.
With the HTML5 audio/video tags, you might be able to trick the page into doing something like this. You could (hypothetically) identify silences server-side and send the timestamps of those silences to the client as meta data in the page (hidden fields or something) and then use that to allow JavaScript to identify those spots in the audio file.

If you use WebWorker threads you may be able to do this in Javascript, but that would require using more threads in the browser to do this. You could break up the problem into multiple threads and process it, but, it would be all but impossible to synchronize this with the playback. So, Javascript can determine the silent periods, by doing some audio processing, but since you can't link that to the playback well it would not be the best choice.
But, if you wanted to show the waveforms to the user then javascript and canvas can be used for this, but then see the next paragraph for the streaming.
Your best bet would be to have the server stream the audio and it can do the processing and find all the silences. Each of these should then be saved in a separate file, so that you can easily jump between the silences, and by streaming, your server app can determine when to load up the new file, so there isn't a break.

I don't think JavaScript is the tool you want to use to use for processing those audio files - that's asking for trouble. However, javascript could easily read a corresponding XML file which describes where those silences occur in the audio file, adjusting the user interface appropriately. Then, the question is what do you use to generate those XML files:
You can do it manually if you need to demo the capability right away. (Use audacity to see where those audio envelopes occur)
Check out this CodeProject article, which creates a wav processing library in C#. The author has created a function to extract silence from the input file. Probably a good place to start hacking.
Just two of my initial thoughts ... There are ALOT of audio processing APIs out there, but they are written for particular frameworks and application programming languages. Definitely make use of them before trying to write something from scratch ... unless you happen to really love fourier transforms.

Develop Reference

JavaScript is the programming language of the Web.