I'm creating a FireFox addon that uses chunking to get individual file sizes down to below the limit of js files within an addon. This works great except for the 'initial' file, which I understand to be the entry point files I've specified. I understand why this is done, but want to be able to somehow define how these entry point files get split such that I can control what they're called. I can then add references to these wherever needed elsewhere in my extension.
How can I control chunking of the initial files, ideally specifying how many files to split it into and what the names are. Or at least having predictable names?
Related
I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:
Use a file watcher like fs.watch or chokidar. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down.
Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).
All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing directory and move the files to a processed directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.
Store the files directly in the database as opposed to their location. Using Filestream is an option. Then you just add some sort of a flag that indicates if its been processed. Then you can just loop over all the files and know if they have been processed or not. Just make sure to update the table for processed files. Depending on the processing you could also limit processing to times that are convenient.
Ex.) If there is a chance a file will not be used, but it needs to be processed before use. Then you can just process the file before the call and avoid checking constantly or periodically.
Perfromance-wise this could even be faster than the filesystem in terms of read-write.
From the SQLite website:
... many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not.
As you are storing files processing info in DB, get the last processing time from DB in single query and process all the files which are created after that timestamp.
For filtering files via timestamp How to read a file from directory sorting date modified in Node JS
And if you can control directory structure than partition your files by datetime and other primary/secondary keys.
How about option 5: based on time? If you know the last time you processed the directory was at timestamp x, then the next go around you can skip all files older than x just by looking at the file stats. Then from this smaller subset you can use hashes to look for clashes.
Edit: Seems arpit and I were typing the same general idea at the same time. Note though that the sorting method in the link he included will iterate over all 10k files 3 times. You don't need to sort anything, you just need to iterate through once and process the ones that fit the bill.
Given a path, I want to record the folder’s unique characteristics, so that next time I check these characteristics, I can tell if the folder has been changed or not.
The solution I can think of is to recursively record all the files’ path and checksum in string, and then calculate the checksum of the string.
I wonder if there’s more efficient way to do this.
You could save some calculations by looking at the mtime of the files inside the folder instead of calculating their hash.
You'd still need to do it for all the files inside the folder, as altering a file does not update the mtime of the folder containing it.
Currently, I'm looking at trying to remove part of what is basically a proprietary archive format; in order to support the ability to remove a file, I'm trying to figure out how to remove a segment of the file (given an offset and a length). I see there's plenty of append logic when it comes to the fs module of node, but nothing that seems to "splice" parts of a file.
Is this going to be even possible? Will I have to resort to the less preferred option of writing to an entirely new file instead?
Operation System handles appending to file very quickly, there is no need to rewrite the all file when you open it for appending.
But, if you wish to slice (cut) the middle of the file, it doesn't matter which programing language do you use, you have to read the whole file and save it again.
What you can do is to create a new file, and save to it two slices of the input buffer.
var fs=require('fs')
var buffer=fs.readFileSync('input_file')
fs.writeFileSync("output",buffer.slice(0,20))
fs.appendFileSync("output",buffer.slice(50,100))
I have a very large array with thousands of items
I tried this solution:
Create a file in memory for user to download, not through server
of creating an anchor
text file
~~JSON.stringify on the array caused the tab to freeze~~ Correction: Trying to log out the result caused the tab to freeze, stringify by itself works fine
The data was originally in string form but creating an anchor with that data resulted in a no-op, I'm assuming also because the data was too big, because using dummy data successfully resulted in a file download being triggered
How can I get this item onto my filesystem?
edit/clarification:
There is a very large array that I can only access via the the browser inspector/console. I can't access it via any other language
Javascript does not allow you to read or write files, except for cookies, and I think the amount of data you are using exceeds the size limit for cookies. This is for security reasons.
However languages such as php, python and ruby allow the reading and writing of files. It appears you are using binary data, so use binary files and write functions.
As to the choice of language : if you already know one use that, or whichever you can get help with. Writing a file is a very basic operation and all three languages are equally good. If you don't know any of these languages you can literally copy and paste the code from their websites.
Let me explain: I am building a node.js project that needs to check if dates match or fall within a range. If there is a match, I need to store a reference to the path of a file on the server. There are about 80 of these files. They are configurations.
I can write a giant condition in a function that can run through and check the dates. It will be fast, I'm sure. The real question is, is it smarter to let each config file store it's own date (the date is a calculation based on a date that will have to be passed in to the config file), then loop through the files, requiring each one, finding the property holding the date, checking it, then either storing the files path or not?
The requiring approach will be much less code, and it will be cleaner, but I'm wondering if I will take a huge performance hit. Is it better to just write a giant list of conditions?
Sorry if this is not clear. Let me know if I need to include anything to help clarify the question.
the date is a calculation based on a date that will have to be passed in to the config file
Then don't put the calculation result in the config file as well, but store it in memory.
On startup, run through all 80 files (in parallel?), collect the dates and do the respective calculations. Store the results in an array or so.
On each request, write a loop (not a giant hand-written condition!) to find the date, and use the file path associated to it.