Finding changes in MongoDB database - javascript

I'm designing a MongoDB database that works with a script that periodically polls a resource and gets back a response which is stored in the database. Right now my database has one collection with four fields , id, name, timestamp and data.
I need to be able to find out which names had changes in the data field between script runs, and which did not.
In pseudocode,
if(data[name][timestamp]==data[name][timestamp+1]) //data has not changed
store data in collection 1
else //data has changed between script runs for this name
store data in collection 2
Is there a query that can do this without iterating and running javascript over each item in the collection? There are millions of documents, so this would be pretty slow.
Should I create a new collection named timestamp for every time the script runs? Would that make it faster/more organized? Is there a better schema that could be used?
The script runs once a day so I won't run into a namespace limitation any time soon.

OK, this is a neat question b/c the short is basically: you will have to iterate and run javascript over each item.
The part where this gets "neat" is that this isn't really different from what an SQL solution would have to do. I mean, you're basically joining a table to itself where x.1=x.1 and y.1=y.2. Even if the relational DB can handle such a beast, it's definitely not going to be fast with millions of entries.
So the truth is, you're doing this right way. Here are the extra details I would use to make this cleaner.
Ensure that you have an index on Name/Timestamp.
Run a db.mycollection.find().foreach() across the data set.
Foreach entry you're going to a) Perform comparison. b) Save appropriately. c) Update a flag indicating that this record has been processed.
On future loads you should be able to add a query to your find. db.mycollection.find({flag:{$exists:false}}).foreach()
Use db.eval() to help with speed.
The reason for the "Name/Timestamp" index is that you're going to be looking up each "successor" by "Name/Timestamp", so you want to be quick here.
The reason for the "processed" flag is that you should never have to re-run the same item. If given timestamp 'n' you find 'n+1', then that's the only 'n+1' you're going to have.
Honestly, if you're only running this once / day, it's quite likely that the speed will be just fine, especially if you only running on new records. Just assume that it's going to take several minutes.

Related

Nodejs can i save an array in my backend, and use it when i need it?

I have a small website, when you go into it it'll show you a quote.
Till today what I was doing is, when a user goes to my website a random quote that directly comes from the database will be shown (when I say directly I mean a connection was made to the database and return a quote from it) but sometimes it took some time like 1 or 2 seconds, today I did something different when my nodejs application starts I grab every quote in the database and store them inside an Array. So when someone comes to my website I'll randomly choose a quote in the Array, and it was so much faster compared to the first way of doing it and I make some changes so when I add new quote to the database the Array automatically updated.
So here is my question, is it bad to store data inside an array and serve users with it?
There will be a few different answer according to your intentions. First of all, if the dataset of quotes are a lot in quantity. I assure you it is a very bad idea but if you are talking about a few items. Well, it's acceptable. However, if you are building a scalable application, it's not much recommended because you will keep all copies of the dataset in each node etc.
If you want a very fast quote storage, I would recommend redis (a key value storage for RAM). It shares the state for each node which means your all nodes connect to redis and the quotes are kept in redis so that you do not need to keep the copies and it becomes fast. Also, if you activate the disk record option, you can use redis as your primary quote storage. In the end, you won't update these quotes too much and they won't be searched with a complex query.
Alternatively, if your database is mysql, postgre or mongodb, you can activate ram storage option so that you don't need to keep that data on your array but directly take it form db which is much more fast but also queryable.
There's the old joke: The two hard things in software engineering are naming things, caching things, and off-by-one errors.
You're caching something: your array of strings. Then you select one at random from the array each time you need one.
What is right? You get your text string from memory, and eliminate the time-delay involved in getting it from the database. It's a good optimization.
What can go wrong?
Somebody can add or remove strings from your database, which makes your cache stale.
You can have so many text strings you blow out your nodejs RAM. This seems unlikely; it's hard to imagine a list of quotes that big. The Hebrew Bible, the New Testament, and the Qur'an together comprise less than a million words. You probably won't have more text in your quotable-quotes than that. 10-20 megabytes of RAM is nothing these days.
So, what about your stale cache in RAM? What to do?
You could ignore the problem. Who cares if the cache is stale?
You could reread the cache every so often.
Your use of RAM for this is a good optimization. But, it adds a cache to your application. A cache adds complexity, and the potential for a bug. Is the optimization worth the trouble? Only you can guess the answer to that question.
And, it's MUCH MUCH better than doing SELECT ... ORDER BY RAND() LIMIT 1; every time you need something random. That is a notorious query-performance antipattern.

Dynamically creating tables with indexedDb

On my web-app, the user can request different data lines. Those data lines have an unique "statusID" each, say "18133". When the user requests to load the data, it is either loaded from the server or from indexedDB(that part im trying to figure out). In order to make it as fast as possible, I want the index to be the timestamp of the data, as I will request ranges which are smaller than the actual data in the indexedDB. However, I am trying to figure out how to create the stores and store the data properly. I tried to dynamically create stores everytime data with a new Id is requested, but creating stores is only possible in "onupradeneeded". I also thought about storing everything in the same store, but I fear that the performance will make that bad. I do not really now how to approach this thing.
What I do know: If you index a value, it means that the data is sorted, which is exactly what I want. I dont know if the following is possible but this would solve my issue too: store everthing in the same store, index by "statusID" and index by "timestamp". This way, it would be fast too i guess.
Note that I am talking about many many datapoints, possible in the millions.
You can index by multiple values, allowing you to get all by statusID and restricting to a range for your timestamp. So I'd go with the one datastore solution. Performance should not be an issue.
This earlier post may be helpful: Javascript: Searching indexeddb using multiple indexes

How to do a bulk insert while avoiding duplicates in Postgresql

I'm working in nodejs, hosted at Heroku (free plan so far).
I get the data from elsewhere automatically (this part work fine and I get JSON or CVS), and my goal is do add them into a Prostresql DB.
While, I'm new to DB mangement and Postgresql, I've made my research before posting this. I'm aware that the COPY command exist, and how to INSERT multiple data without duplicate. But my problem is a mix of both (plus another difficulty).
I hope my question is not breaking the rules.
Short version, I need to :
Add lots of data a once
Never create duplicate
Rename column name between source data and my table
Long version with details :
The data I collect are from multiples sources (2 for now but will get bigger) and are quite big (>1000).
I also need to remap the column name to one unified system. What could be called "firstDay" on one source is called "dateBegin" in another, and I want them to be called "startDate" in my table.
If I'm using INSERT, I take care of this myself (in JS) while constructing the query. But maybe COPY could do that in a better way. Also, INSERT seem to have a limit of data you can push in one time, and so I will need to divide my query multiple time and maybe use callback or promise to avoid drowning the DB.
And finally, I will update this DB regularly and automatically and they will be a lot of duplicate. Hopefully, every piece of data has an unique id, and I have made a column PRIMARY KEY in the table that store this id. I thought it may eliminate any problem with duplicate, but I may be wrong.
My first version was very ugly (for loop making a new query a every loop) and didn't work. I was thinking about doing 1000 data at a time in a recursive way waiting for callback before sending another batch. It seem clunky and time expensive to do it that way. COPY seem perfect if I can select/rename/remap columns and avoid duplicated. I've read the documentation and I don't see a way to do that.
Thank you very much, any help is welcome. I'm still learning so please be kind.
I have done this before using temporary tables to "stage" your data and then do an INSERT SELECT to move the data from staging to your production table.
For populating your staging table you can use bulk INSERTs or COPY.
For example,
BEGIN;
CREATE TEMPORARY TABLE staging_my_table ( // your columns etc );
// Now that you have your staging table you can bulk INSERT or COPY
// into it from your code, e.g.,
INSERT INTO staging_my_table (blah, bloo, firstDay) VALUES (1,2,3), (4,5,6), etc.
// Now you can do an INSERT into your live table from your staging, e.g.,
INSERT INTO my_table (blah, bloo, startDate)
SELECT cool, bloo, firstDay
FROM staging_my_table staging
WHERE NOT EXISTS (
SELECT 1
FROM mytable
WHERE staging.bloo = mytable.bloo
);
COMMIT;
There are always exceptions, but this might just work for you.
Have a good one

Call SQL "function" (stored procedure?) every time a database column is selected

I am running MySQL 5.6. I have a number of various "name" columns in the database (in various tables). These get imported every year by each customer as a CSV data dump. There are a number of places that these names are displayed throughout this website. The issue is, the names have almost no formatting (and to this point, no sanitization existed upon importation):
Phil Eaton, PHIL EATON, Phil EATON, etc.
Thus, the website sometimes look like a mess when these names are involved. There are a number of ways that I can think to do this, but none that are particularly appealing.
First, I can have a filter in Javascript. However, as I said, these names exist in a number of places throughout this (large) site. I may end up missing a page. The names do not exist already within nice "name"-classed divs/spans, etc.
Second, I could filter in PHP (the backend). This seems about as effective as doing it in Javascript. I could do it on the API, but there was still not a central method for pulling names from the database. So I could still miss an API call anyway.
Finally, the obvious "best" way is to sanitize the existing data in place for each name column. Then at the same time, immediately start sanitizing all names that get imported each time we add a customer. The issue with the first part of this is that there are hundreds of millions of rows of names in the database. Updating these could take a long amount of time and be disruptive to the clients' daily routines.
So, the most appealing way to correct this in the short-term is to invoke a function every time a column is selected. In this way I could "decorate" every name column with a formatting function so the names will appear uniform on the frontend. So ultimately, my question is: is it possible to invoke a specific function in SQL to format each row every time a specific column is selected? In other words, maybe can I call a stored procedure every time a column is selected? (Point being, I'm trying to keep the formatting in SQL to avoid the propagation of usage.)
In MySQL you can't trigger something on SELECT, but I have an idea (it's only an idea, now I don't have time to try it, sorry).
You probably can create a VIEW on this table, with the same structure, but with the stored procedure applied to the names fields, and select from this view in your PHP.
But it has two backdraw:
You have to modify all your SELECT statements in your PHPs.
The server will always call that procedure. Maybe you can store the formatted values, then check for it (cache them).
On the other hand I agree with HLGEM, I also suggest to format the data on import, because it's a very bad practice to import something you don't check into a DB (SQL Injections?). The batch tasking is also a good idea to clean up the mess.
I presume names are called frequently so invoking a sanitization function every time they are called could severely slow down your system. Further, you can't just do a simple setting to get this, you would have to change every buit of SQL code that is run that includes names.
Personally how I would handle it is to fix the imports so they put in a sanitized version for new names. It is a bad idea to directly put any data into a database without some sort of staging and clean up.
Then I would tackle the old names and fix them in batches in a nightly run that is scheduled when the fewest people are using the system. You would have to do some testing on dev to determine how big a batch you could run without interfering with other things the database is doing. The alrger the batch the sooner you would get through all the names, but even though this will take time, it is the surest method of getting the data cleaned up and over time the data will appear better to the users. If the design of your datbase allows you to identify which are the more active names (such as an is_active flag for a customer or am order in the last year), I would prioritize the update by that. Alternatively, you could clean up one client at a time starting with whichever one has noticed the problem and is driving this change.
Other answers before give some possible solutions. But, the short answer for the specific option you are asking is : No. There is no such thing called a
"Select Statement Trigger", that too for a single column, although triggers come close for this kind of expectation, but only for Insert, Update and Delete operations.

best way to relate a variable and its ID

I am getting results from a database with a simple php while loop, one of the pieces of information is a number that links to another table where the value is stored, I can think of plenty of ways to get this information linked and display the text related to the value but I want to know the fastest way to do it as I have a huge set of results so every bit of speed will make a difference. Is an array fastest, javascript? any advice you can give me would be great.
The schema would look something like this
col_table
colID(autonumber) colName(str) colState(int) colDate(date)
state_table
stateID(int) stateType(str)
I want to select the correct state type based on the colState matching a stateID and output the stateType while preserving the stateID for so I can edit the field and update the database using the number.
Using MySQL will be faster.
If you have to get through a PHP loop to read your results and make each time a new MySQL request, your script will take longer.
You can increase speed on MySQL by creating the right kind/amount of index, choosing wisely what is store in each field.
The later you parse content, the longer it will take. If you go for js, you will have to read a DB, loop trough it in PHP and do it again in JS, and making more request again ...
A join can be a good solution. A view can be even more easier to treat. Yuo can also consider caching results
Use a timer in php and try trial and error method. Use the time returned by the timer to evaluate speed and efficiency.
you should prepare your data on server side it is faster.
Whether you choose your server or database with a fast query it depends. If you have complex object graphs then the processing of results from db in order to create associations would be time consuming so an ORM is the way to go, otherwise as is your case with a simple join i would simply retrieve all data from db.
If you use php for rendering as well then render it using php no js.
If you use js for your ui then prepare data on server side and publish it via a REST webservice in json,i.e. usind json_encode functions of php, then retrieve it from js and output.

Categories

Resources