Read file from last document read - JSON stream

Read file from last document read - JSON stream - javascript

I'm receiving a stream from a server, that is increasing exponentially. and I need to check every minute for new data, process that data, and ask for more next minute.
the data is JSON documents. receive in average ~600-700 documents per minute.
I have to avoid reading the documents already processed due to performance issues.
Is it possible to only read the data received from the last minute?

You can use a circular buffer and put there the data using a listener.
As an example, by storing there the last N documents or chunks, it mainly depends on the code of your application.
This way older data will be discarded for design and you have not to deal neither with streams' internals nor with poorly designed solutions.
It's a matter of defining the right size for the buffer, but it looks to me as a far easier problem.

Related

Use web worker to stringify

So I have an app that needs to JSON.stringify its data to put into localStorage, but as the data gets larger, this operation gets outrageously expensive.
So, I tried moving this onto a webWorker so it's off the main thread, but I'm now learning posting an object to a webWorker is even more expensive than stringifying it.
So I guess I'm asking, is there any way whatsoever to get JSON.stringify off the main thread, or at least make it less expensive?
I'm familiar with fast-json-stringify, but I don't think I can feasibly provide a complete schema every time...

You have correctly observed that passing object to web worker costs as much as serializing it. This is because web workers also need to receive serialized data, not native JS objects, because the instance objects are bound to the JS thread they were created in.
The generic solution is applicable to many programming problems: chose the right data structures when working with large datasets. When data gets larger it's better sacrifice simplicity of access for performance. Thus do any of:
Store data in indexedDB
If your large object contains lists of the same kind of entry, use indexed DB for reading and writing and you don't need to worry about serialization at all. This will require refactor of your code, but this is the correct solution for large datasets.
Store data in ArrayBuffer
If your data is mostly fixed-size values, use an ArrayBuffer. ArrayBuffer can be copied or moved to web worker pretty much instantly and if your entries are all same size, serialization can be done in parallel. For access, you may write simple wrappers classes that will translate your binary data into something more readable.

Repeatedly re-writing JSON and getting undefined due to re-write

I am currently generating some data in Python and saving the corresponding data to a JSON file. This data grows quite quickly although it is not large to the point where it will be an issue, it is also re-generated fairly fast. I am then reading this data into Javascript at a very fast rate. The issue I am having is most of the time when I am requesting the JSON data the file is returning undefined in the python size due to the fact that I re-write the JSON file each time which causes the data to be undefined. The solution I am looking for would lead to the file being undefined for far less time, I could potentially switch to other methods for transferring data between Python and Javascript if any spring to mind.

PHP - file_get_contents() reading chunks gets slower with time

I'm making some tests to store large files locally with IndexedDb API, and I'm using PHP with JSON (and AJAX on Javascript's side) to receive the file data.
Actually, I'm trying to get some videos and, to do so, I use the following PHP code:
$content_ar['content'] = base64_encode(file_get_contents("../video_src.mp4", false, NULL, $bytes_from, $package_size)));
return json_encode($content_ar);
I know base64_encode will deliver 1/3 more of information than the original, but that's not the problem right now as it's the only way I know how to retrieve binary data without losing it on the way.
As you can see, I specify from which byte it has to start to read and how many of them I want to retrieve. So, on my JS side, I know how much of the file I have already stored and I ask the script to get me from actual_size to actual_size + $package_size bytes.
What I'm seeing already is that the scripts seems to run more slowly as time goes by and depending on the file size. I'm trying to understand what happens there.
I've read that file_get_contents() stores the file contents in memory, so with big files it could be a problem (that's why I'm reading it in chunks).
But seeing it gets slower with big files (and time), may it be possible that it's still storing the whole file on memory and then delivering me the chunk I tell it to? Like it loads everything and then returns the part I demand?
Or is it just storing everything until the $bytes_from + $package_size (that's why it gets slower with time, as it increases)?
If any of the above, is there any way to get it to run more efficiently and improve performance? Maybe I have to do some operations before or after to empty memory resources?
EDIT:
I've made a screenshot showing the difference (in ms) of the moment I make the call to get the file bytes I need, and the right moment when I receive the AJAX response (before I do anything with the received data, so Javascript has no impact on the performance). Here it is:
As you can see, it's increasing with every call.
I think the problem is the time it spends to get to the initial byte I need. It does not load the whole file into memory, but it's slow until getting into the first byte to read, so as it increases the initial point, it takes more time.
EDIT 2:
Could it have something to do with the fact that I'm JSON encoding the base64 content? I've been making some performance tests and I've seen that setting $content_ar['content'] = strlen(base64_encode(file...)) is done in so much less time (when, theorically, it's doing the same work).
However, if that's the case, I still cannot understand why it increases the slowness among time. The work of encoding the same length of bytes should take the same amount of time, isn't it?
Thank you so much for your help!

Fastest way to read in very long file starting on arbitary line X

I have a text file which is written to by a python program, and then read in by another program for display on a web browser. Currently it is read in by JavaScript, but I will probably move this functionality to python, and have the results passed into javascript using an ajax Request.
The file is irregularly updated every now and then, sometimes appending one line, sometimes as many as ten. I then need to get an updated copy of the file to javascript for display in the web browser. The file may grow to as large as 100,000 lines. New data is always added to the end of the file.
As it is currently written, javascript checks the length of the file once per second, and if the file is longer than it was last time it was read in, it reads it in again, starting from the beginning, this quickly becomes unwieldy for files of 10,000+ lines. Doubly so since the program may sometimes need to update the file every single second.
What is the fastest/most efficient way to get the data displayed to the front end in javascript?
I am thinking I could:
Keep track of how many lines the file was before, and only read in from that point in the file next time.
Have one program pass the data directly to the other without it reading an intermediate file (although the file must still be written to as a permanent log for later access)
Are there specific benefits/problems with each of these approaches? How would I best implement them?
For Approach #1, I would rather not do file.next() 15,000 times in a for loop to get to where I want to start reading the file, is there a better way?
For Approach #2, Since I need to write to the file no matter what, am I saving much processing time by not reading it too?
Perhaps there are other approaches I have not considered?
Summary: The program needs to display in a web browser data from python that is constantly being updated and may grow as long as 100k lines. Since I am checking for updates every 1 second, It needs to be efficient, just in case it has to do a lot of updates in a row.

The function you seek is seek. From the docs:
f.seek(offset, from_what)
The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.
Limitation for Python 3:
In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero. Any other offset value produces undefined behaviour.
Note that seeking to a specific line is tricky, since lines can be variable length. Instead, take note of the current position in the file (f.tell()), and seek back to that.

Opening a large file and reading the last part is simple and quick: Open the file, seek to a suitable point near the end, read from there. But you need to know what you want to read. You can easily do it if you know how many bytes you want to read and display, so keeping track of the previous file size will work well without keeping the file open.
If you have recorded the previous size (in bytes), read the new content like this.
fp = open("logfile.txt", "rb")
fp.seek(old_size, 0)
new_content = fp.read() # Read everything past the current point
On Python 3, this will read bytes which must be converted to str. If the file's encoding is latin1, it would go like this:
new_content = str(new_content, encoding="latin1")
print(new_content)
You should then update old_size and save the value in persistent storage for the next round. You don't say how you record context, so I won't suggest a way.
If you can keep the file open continuously in a server process, go ahead and do it the tail -f way, as in the question that #MarcJ linked to.

In node.js, should a common set of data be in the database or a class?

In my database I have a table with data of cities. It includes the city name, translation of the name (it's a bi-lingual website), and latitude/longitude. This data will not change and every user will need to load this data. There are about 300 rows.
I'm trying to keep the pressure put on the server as low as possible (at least to a reasonable extent), but I'd also prefer to keep these in the database. Would it be best to have this data inside a class that is loaded in my main app.js file? It should be kept in memory and global to all users, correct? Or would it be better on the server to keep it in the database and select the data whenever a user needs it? The sign in screen has the listing of cities, so it would be loaded often.
I've just seen that unlike PHP, many of the Node.js servers don't have tons of memory, even the ones that aren't exactly cheap, so I'm worried about putting unnecessary things into memory.

I decided to give this a try. Using an example data set consisting of 300 rows (each containing 24 string characters and two doubles and property names), a small node.js script indicated an additional memory usage of 80 to 100 KB.
You should ask yourself:
How often will the data be used? How much of the data does a user need?
If the whole dataset will be used on a regular basis (let's say multiple times a second), you should consider keeping the data in memory. If, however, your users will need a part of the data and only once from time to time, you might consider loading the data from a database.
Can I guarantee efficient loading from the database?
An important fact is that loading parts of the data from a database might even require more memory, because the V8 garbage collector might delay the collection of the loaded data, so the same data (or multiple parts of the data) might be in memory at the same time. There is also a guaranteed overhead due to database / connection data and so on.
Is my approach sustainable?
Finally, consider the possibility of data growth. If you expect the dataset to grow by a non-trivial amount, think about the above points again and decide whether a growth is likely enough to justify database usage.

Develop Reference

JavaScript is the programming language of the Web.