Scan/Track many files and skip already processed files with node.js - javascript

I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:
Use a file watcher like fs.watch or chokidar. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down.
Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).
All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing directory and move the files to a processed directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.

Store the files directly in the database as opposed to their location. Using Filestream is an option. Then you just add some sort of a flag that indicates if its been processed. Then you can just loop over all the files and know if they have been processed or not. Just make sure to update the table for processed files. Depending on the processing you could also limit processing to times that are convenient.
Ex.) If there is a chance a file will not be used, but it needs to be processed before use. Then you can just process the file before the call and avoid checking constantly or periodically.
Perfromance-wise this could even be faster than the filesystem in terms of read-write.
From the SQLite website:
... many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not.

As you are storing files processing info in DB, get the last processing time from DB in single query and process all the files which are created after that timestamp.
For filtering files via timestamp How to read a file from directory sorting date modified in Node JS
And if you can control directory structure than partition your files by datetime and other primary/secondary keys.

How about option 5: based on time? If you know the last time you processed the directory was at timestamp x, then the next go around you can skip all files older than x just by looking at the file stats. Then from this smaller subset you can use hashes to look for clashes.
Edit: Seems arpit and I were typing the same general idea at the same time. Note though that the sorting method in the link he included will iterate over all 10k files 3 times. You don't need to sort anything, you just need to iterate through once and process the ones that fit the bill.

Related

Split initial chunks into specific file names

I'm creating a FireFox addon that uses chunking to get individual file sizes down to below the limit of js files within an addon. This works great except for the 'initial' file, which I understand to be the entry point files I've specified. I understand why this is done, but want to be able to somehow define how these entry point files get split such that I can control what they're called. I can then add references to these wherever needed elsewhere in my extension.
How can I control chunking of the initial files, ideally specifying how many files to split it into and what the names are. Or at least having predictable names?

Merging millions of data using nodejs

i need help/tips
i have a huge amount of json data that needs to be merged, sorted and filtered. right now, they're separated into different folders. almost 2GB of json files.
what i'm doing right now is:
reading all files inside each folders
appending JSON parsed data to an Array variable inside my script.
sorting the Array variable
filtering.
save it to one file
i'm rethinking that instead of appending parsed data to a variable, maybe i should store it inside a file ?.. what do you guys think ?
what approach is better when dealing with this kind of situation ?
By the way, i'm experiencing a
Javascript Heap out of memory
you could use some kind of database, e.g. MySQL with table's engine "memory" so it would be saved in ram only and would be blazing quick and would be erased after reboot but you should truncate it anyways after the operation while it's all temp. When you will have data in the table, it will be easy to filter/sort required bits and grab data incrementally by let's say 1000 rows and parse it as needed. You will not have to hold 2gigs of data inside js.
2gigs of data will probably block your js thread during loops and you will get frozen app anyways.
If you will use some file to save temporary data to avoid database, i recommend using some temporary disk which would be mounted on RAM, so you will have much better i/o speed.

How can I save a very large in-memory object to file?

I have a very large array with thousands of items
I tried this solution:
Create a file in memory for user to download, not through server
of creating an anchor
text file
~~JSON.stringify on the array caused the tab to freeze~~ Correction: Trying to log out the result caused the tab to freeze, stringify by itself works fine
The data was originally in string form but creating an anchor with that data resulted in a no-op, I'm assuming also because the data was too big, because using dummy data successfully resulted in a file download being triggered
How can I get this item onto my filesystem?
edit/clarification:
There is a very large array that I can only access via the the browser inspector/console. I can't access it via any other language
Javascript does not allow you to read or write files, except for cookies, and I think the amount of data you are using exceeds the size limit for cookies. This is for security reasons.
However languages such as php, python and ruby allow the reading and writing of files. It appears you are using binary data, so use binary files and write functions.
As to the choice of language : if you already know one use that, or whichever you can get help with. Writing a file is a very basic operation and all three languages are equally good. If you don't know any of these languages you can literally copy and paste the code from their websites.

Custom Stream Write Length Using Nodejs

This may already be answered somewhere on the site, but if it is I couldn't find it. I also couldn't find an exact answer to my question (or at least couldn't make sense of how to implement a solution) based on any of the official Node.js documentation.
Question: Is it possible to customize the length (in bytes) of each disk write that occurs while piping the input of a readable stream into a file?
I will be uploading large files (~50Gb) and it's possible that there could be many clients doing so at the same time. In order to accomplish this I'll be slicing files at the client side and then uploading a chunk at a time. Ideally I want physical writes to disk on the server side to occur in 1Mb portions - but is this possible? And if it is possible then how can it be implemented?
You will probably use a WriteStream. While it is not documented in the fs api, any Writable does take a highWaterMark option for when to flush its buffer. See also the details on buffering.
So it's just
var writeToDisk = fs.createWriteStream(path, {highWaterMark: 1024*1024});
req.pipe(writeToDisk);
Disclaimer: I would not believe in cargo-cult "server-friendly" chunk sizes. I'd go with the default (which is 16kb), and when performance becomes a problem test other sizes to find the optimal value for the current setup.

Avoid duplicate content on Node.js server

I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance
My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).
What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.
Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)
MD5 is actually quite fast, more than fast enough for your use case. One anecdotal benchmark has it at about ~400 Megabytes per second on a single CPU (source). It wouldn't be the bottleneck in your server processing, and it is a reliable way to check for duplicate files. MD5 is vulnerable to collision attacks, but they must be painstakingly prepared; chance collisions are statistically impossible. It sounds like collisions wouldn't be too great of a problem in your application (but make sure you handle them anyway).
If you truly just want speed to the exclusion of reliability, you could go with CRC. It's not intended to be a true hash, just to detect errors in a byte stream. It has a relatively high collision rate of about 1 in a million. However, it's blazing fast; it's meant to be implemented in hardware on routers.
How about the following approach:
When the user uploads the images, it creates the MD5 sum
The image is then stored using that MD5 sum as a filename
The original image name is stored on the FS as well, but as a symlink pointing to the MD5 name.
If a user uploads an image that is a duplicate, then you can check whether the MD5 name already exists and just create the symlink.
For converting the existing images into that structure, I'm sure a fairly simple shell script using md5sum, mv and ln -s would do the trick.
One other possibility is to use something like MongoDB to store the images in a DB, which may well be easier to cluster.

Categories

Resources