Given a path, I want to record the folder’s unique characteristics, so that next time I check these characteristics, I can tell if the folder has been changed or not.
The solution I can think of is to recursively record all the files’ path and checksum in string, and then calculate the checksum of the string.
I wonder if there’s more efficient way to do this.
You could save some calculations by looking at the mtime of the files inside the folder instead of calculating their hash.
You'd still need to do it for all the files inside the folder, as altering a file does not update the mtime of the folder containing it.
Related
I'm creating a FireFox addon that uses chunking to get individual file sizes down to below the limit of js files within an addon. This works great except for the 'initial' file, which I understand to be the entry point files I've specified. I understand why this is done, but want to be able to somehow define how these entry point files get split such that I can control what they're called. I can then add references to these wherever needed elsewhere in my extension.
How can I control chunking of the initial files, ideally specifying how many files to split it into and what the names are. Or at least having predictable names?
I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:
Use a file watcher like fs.watch or chokidar. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down.
Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).
All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing directory and move the files to a processed directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.
Store the files directly in the database as opposed to their location. Using Filestream is an option. Then you just add some sort of a flag that indicates if its been processed. Then you can just loop over all the files and know if they have been processed or not. Just make sure to update the table for processed files. Depending on the processing you could also limit processing to times that are convenient.
Ex.) If there is a chance a file will not be used, but it needs to be processed before use. Then you can just process the file before the call and avoid checking constantly or periodically.
Perfromance-wise this could even be faster than the filesystem in terms of read-write.
From the SQLite website:
... many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not.
As you are storing files processing info in DB, get the last processing time from DB in single query and process all the files which are created after that timestamp.
For filtering files via timestamp How to read a file from directory sorting date modified in Node JS
And if you can control directory structure than partition your files by datetime and other primary/secondary keys.
How about option 5: based on time? If you know the last time you processed the directory was at timestamp x, then the next go around you can skip all files older than x just by looking at the file stats. Then from this smaller subset you can use hashes to look for clashes.
Edit: Seems arpit and I were typing the same general idea at the same time. Note though that the sorting method in the link he included will iterate over all 10k files 3 times. You don't need to sort anything, you just need to iterate through once and process the ones that fit the bill.
I have a text file which is written to by a python program, and then read in by another program for display on a web browser. Currently it is read in by JavaScript, but I will probably move this functionality to python, and have the results passed into javascript using an ajax Request.
The file is irregularly updated every now and then, sometimes appending one line, sometimes as many as ten. I then need to get an updated copy of the file to javascript for display in the web browser. The file may grow to as large as 100,000 lines. New data is always added to the end of the file.
As it is currently written, javascript checks the length of the file once per second, and if the file is longer than it was last time it was read in, it reads it in again, starting from the beginning, this quickly becomes unwieldy for files of 10,000+ lines. Doubly so since the program may sometimes need to update the file every single second.
What is the fastest/most efficient way to get the data displayed to the front end in javascript?
I am thinking I could:
Keep track of how many lines the file was before, and only read in from that point in the file next time.
Have one program pass the data directly to the other without it reading an intermediate file (although the file must still be written to as a permanent log for later access)
Are there specific benefits/problems with each of these approaches? How would I best implement them?
For Approach #1, I would rather not do file.next() 15,000 times in a for loop to get to where I want to start reading the file, is there a better way?
For Approach #2, Since I need to write to the file no matter what, am I saving much processing time by not reading it too?
Perhaps there are other approaches I have not considered?
Summary: The program needs to display in a web browser data from python that is constantly being updated and may grow as long as 100k lines. Since I am checking for updates every 1 second, It needs to be efficient, just in case it has to do a lot of updates in a row.
The function you seek is seek. From the docs:
f.seek(offset, from_what)
The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.
Limitation for Python 3:
In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero. Any other offset value produces undefined behaviour.
Note that seeking to a specific line is tricky, since lines can be variable length. Instead, take note of the current position in the file (f.tell()), and seek back to that.
Opening a large file and reading the last part is simple and quick: Open the file, seek to a suitable point near the end, read from there. But you need to know what you want to read. You can easily do it if you know how many bytes you want to read and display, so keeping track of the previous file size will work well without keeping the file open.
If you have recorded the previous size (in bytes), read the new content like this.
fp = open("logfile.txt", "rb")
fp.seek(old_size, 0)
new_content = fp.read() # Read everything past the current point
On Python 3, this will read bytes which must be converted to str. If the file's encoding is latin1, it would go like this:
new_content = str(new_content, encoding="latin1")
print(new_content)
You should then update old_size and save the value in persistent storage for the next round. You don't say how you record context, so I won't suggest a way.
If you can keep the file open continuously in a server process, go ahead and do it the tail -f way, as in the question that #MarcJ linked to.
Let me explain: I am building a node.js project that needs to check if dates match or fall within a range. If there is a match, I need to store a reference to the path of a file on the server. There are about 80 of these files. They are configurations.
I can write a giant condition in a function that can run through and check the dates. It will be fast, I'm sure. The real question is, is it smarter to let each config file store it's own date (the date is a calculation based on a date that will have to be passed in to the config file), then loop through the files, requiring each one, finding the property holding the date, checking it, then either storing the files path or not?
The requiring approach will be much less code, and it will be cleaner, but I'm wondering if I will take a huge performance hit. Is it better to just write a giant list of conditions?
Sorry if this is not clear. Let me know if I need to include anything to help clarify the question.
the date is a calculation based on a date that will have to be passed in to the config file
Then don't put the calculation result in the config file as well, but store it in memory.
On startup, run through all 80 files (in parallel?), collect the dates and do the respective calculations. Store the results in an array or so.
On each request, write a loop (not a giant hand-written condition!) to find the date, and use the file path associated to it.
I’m starting a project that requires assembling files and linking them to around 140 lines in a pdf table of contents (one file per line). The thing is that I have to do this 80 times and it seems very tedious to do it manually each time.
The TOC already exists, and I would like each line item to link to a corresponding pdf. (The link is alreay there, it's just that the pdf doesn't yet exist in that location). The pdf's already exist elsewhere and have predictable file paths (that change with project number).
If I have a list of files corresponding to each line item and their corresponding path, is it possible to write a script that copies the file in its original location and places the copy in the new location? (If so, how, and what is the best language to use?)
Thanks!
you can do it in Java, you might have to use a library like apache.commons:
File sourceFile = new File(source);
File targetFile = new File(target);
FileUtils.copyFile(sourceFile, targetFile);