How to do a bulk insert while avoiding duplicates in Postgresql

How to do a bulk insert while avoiding duplicates in Postgresql - javascript

I'm working in nodejs, hosted at Heroku (free plan so far).
I get the data from elsewhere automatically (this part work fine and I get JSON or CVS), and my goal is do add them into a Prostresql DB.
While, I'm new to DB mangement and Postgresql, I've made my research before posting this. I'm aware that the COPY command exist, and how to INSERT multiple data without duplicate. But my problem is a mix of both (plus another difficulty).
I hope my question is not breaking the rules.
Short version, I need to :
Add lots of data a once
Never create duplicate
Rename column name between source data and my table
Long version with details :
The data I collect are from multiples sources (2 for now but will get bigger) and are quite big (>1000).
I also need to remap the column name to one unified system. What could be called "firstDay" on one source is called "dateBegin" in another, and I want them to be called "startDate" in my table.
If I'm using INSERT, I take care of this myself (in JS) while constructing the query. But maybe COPY could do that in a better way. Also, INSERT seem to have a limit of data you can push in one time, and so I will need to divide my query multiple time and maybe use callback or promise to avoid drowning the DB.
And finally, I will update this DB regularly and automatically and they will be a lot of duplicate. Hopefully, every piece of data has an unique id, and I have made a column PRIMARY KEY in the table that store this id. I thought it may eliminate any problem with duplicate, but I may be wrong.
My first version was very ugly (for loop making a new query a every loop) and didn't work. I was thinking about doing 1000 data at a time in a recursive way waiting for callback before sending another batch. It seem clunky and time expensive to do it that way. COPY seem perfect if I can select/rename/remap columns and avoid duplicated. I've read the documentation and I don't see a way to do that.
Thank you very much, any help is welcome. I'm still learning so please be kind.

I have done this before using temporary tables to "stage" your data and then do an INSERT SELECT to move the data from staging to your production table.
For populating your staging table you can use bulk INSERTs or COPY.
For example,
BEGIN;
CREATE TEMPORARY TABLE staging_my_table ( // your columns etc );
// Now that you have your staging table you can bulk INSERT or COPY
// into it from your code, e.g.,
INSERT INTO staging_my_table (blah, bloo, firstDay) VALUES (1,2,3), (4,5,6), etc.
// Now you can do an INSERT into your live table from your staging, e.g.,
INSERT INTO my_table (blah, bloo, startDate)
SELECT cool, bloo, firstDay
FROM staging_my_table staging
WHERE NOT EXISTS (
SELECT 1
FROM mytable
WHERE staging.bloo = mytable.bloo
);
COMMIT;
There are always exceptions, but this might just work for you.
Have a good one

Related

Dynamically creating tables with indexedDb

On my web-app, the user can request different data lines. Those data lines have an unique "statusID" each, say "18133". When the user requests to load the data, it is either loaded from the server or from indexedDB(that part im trying to figure out). In order to make it as fast as possible, I want the index to be the timestamp of the data, as I will request ranges which are smaller than the actual data in the indexedDB. However, I am trying to figure out how to create the stores and store the data properly. I tried to dynamically create stores everytime data with a new Id is requested, but creating stores is only possible in "onupradeneeded". I also thought about storing everything in the same store, but I fear that the performance will make that bad. I do not really now how to approach this thing.
What I do know: If you index a value, it means that the data is sorted, which is exactly what I want. I dont know if the following is possible but this would solve my issue too: store everthing in the same store, index by "statusID" and index by "timestamp". This way, it would be fast too i guess.
Note that I am talking about many many datapoints, possible in the millions.

You can index by multiple values, allowing you to get all by statusID and restricting to a range for your timestamp. So I'd go with the one datastore solution. Performance should not be an issue.
This earlier post may be helpful: Javascript: Searching indexeddb using multiple indexes

Force serial mySQL insertions (or rebuild a table a better way)

I'm new to mySQL transactions and I'm working with some code that I inherited, but the question is more fundamental capability related than code specific.
I have a js array that I want to insert into a mySQL table. The
table size will never exceed 200 entries, most often fewer than 100.
Presently, I'm looping through the js array and sending individual
AJAX requests for each array element to an existing PHP file that
processes them as they arrive and does the sql INSERT call for each
element. async is set to 'false' in the js AJAX call.
Everything is working successfully – no INSERTS fail and no data is
lost so this isn't a code failure question – but when I later get the table
from the database and load it into an array, ordered by updated (added), the sequence of the entries is wrong, and that's an issue.
So my question to the mySQL experts is this: is there a simple handshake function I'm not seeing that will ensure that mySQL insertions are processed in order?
If the answer is no, I will need to add a sequence number to each request and add a column for it in the table. Also, if I'm going to make changes to the inherited code anyway, I'm open to pointers on what to look into for a more efficient solution.
Thanks.

As Paul Spiegel has said it's better to use an auto_increment if all you want is a repeatable sequence of events.
I am assuming that when you say 'order by updated (added)' then that is a DateTime field or some such on the table. If your inserts happen fast enough then there will be multiple rows on the same insert date/time (the field only goes to One Second resolution) and MySQL doesn't guarantee you will always get the rows back in the same order when that happens.

You could insert all your array at once:
INSERT statements that use VALUES syntax can insert multiple rows. To
do this, include multiple lists of comma-separated column values, with
lists enclosed within parentheses and separated by commas. Example:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
Read more here.

Call SQL "function" (stored procedure?) every time a database column is selected

I am running MySQL 5.6. I have a number of various "name" columns in the database (in various tables). These get imported every year by each customer as a CSV data dump. There are a number of places that these names are displayed throughout this website. The issue is, the names have almost no formatting (and to this point, no sanitization existed upon importation):
Phil Eaton, PHIL EATON, Phil EATON, etc.
Thus, the website sometimes look like a mess when these names are involved. There are a number of ways that I can think to do this, but none that are particularly appealing.
First, I can have a filter in Javascript. However, as I said, these names exist in a number of places throughout this (large) site. I may end up missing a page. The names do not exist already within nice "name"-classed divs/spans, etc.
Second, I could filter in PHP (the backend). This seems about as effective as doing it in Javascript. I could do it on the API, but there was still not a central method for pulling names from the database. So I could still miss an API call anyway.
Finally, the obvious "best" way is to sanitize the existing data in place for each name column. Then at the same time, immediately start sanitizing all names that get imported each time we add a customer. The issue with the first part of this is that there are hundreds of millions of rows of names in the database. Updating these could take a long amount of time and be disruptive to the clients' daily routines.
So, the most appealing way to correct this in the short-term is to invoke a function every time a column is selected. In this way I could "decorate" every name column with a formatting function so the names will appear uniform on the frontend. So ultimately, my question is: is it possible to invoke a specific function in SQL to format each row every time a specific column is selected? In other words, maybe can I call a stored procedure every time a column is selected? (Point being, I'm trying to keep the formatting in SQL to avoid the propagation of usage.)

In MySQL you can't trigger something on SELECT, but I have an idea (it's only an idea, now I don't have time to try it, sorry).
You probably can create a VIEW on this table, with the same structure, but with the stored procedure applied to the names fields, and select from this view in your PHP.
But it has two backdraw:
You have to modify all your SELECT statements in your PHPs.
The server will always call that procedure. Maybe you can store the formatted values, then check for it (cache them).
On the other hand I agree with HLGEM, I also suggest to format the data on import, because it's a very bad practice to import something you don't check into a DB (SQL Injections?). The batch tasking is also a good idea to clean up the mess.

I presume names are called frequently so invoking a sanitization function every time they are called could severely slow down your system. Further, you can't just do a simple setting to get this, you would have to change every buit of SQL code that is run that includes names.
Personally how I would handle it is to fix the imports so they put in a sanitized version for new names. It is a bad idea to directly put any data into a database without some sort of staging and clean up.
Then I would tackle the old names and fix them in batches in a nightly run that is scheduled when the fewest people are using the system. You would have to do some testing on dev to determine how big a batch you could run without interfering with other things the database is doing. The alrger the batch the sooner you would get through all the names, but even though this will take time, it is the surest method of getting the data cleaned up and over time the data will appear better to the users. If the design of your datbase allows you to identify which are the more active names (such as an is_active flag for a customer or am order in the last year), I would prioritize the update by that. Alternatively, you could clean up one client at a time starting with whichever one has noticed the problem and is driving this change.

Other answers before give some possible solutions. But, the short answer for the specific option you are asking is : No. There is no such thing called a
"Select Statement Trigger", that too for a single column, although triggers come close for this kind of expectation, but only for Insert, Update and Delete operations.

How to make SELECT query work when the table schema is upgraded with a new column?

I'm working on a JavaScript project which has an existing database (and hence an existing schema).
To add a new feature I must add a new column to the table. This will store a property of objects of a class.
Currently there is a SELECT query in the project which queries a bunch of fields from the database table every time the application starts and then puts these obtained results into the various needed JavaScript objects.
So, now it looks like:
let FIELDS = "field1, field2, field3";
let query = "SELECT " + FIELDS + "FROM FOO_TABLE";
Somewhere else, this query is made whenever needed (usually after app restart).
I thought of changing it to:
let FIELDS = "field1, field2, field3, new_prop";
But it won't work as in the current table, such a table doesn't exist. (Maybe after next restart, things will work, but not the first time).
What is the workaround?
Also, please note that a silent change will be better than one that will show that this property is new to everyone who works on the file.

Well you can't just add the field into the query, that won't work, so you have 2 options.
Use exceptions/errors to work out if something goes wrong. I'm not an sqlite guru (in fact i've never used sqlite directly, always via API's as with ios dev.), but if you try to run a query containing a column name that is non-existent, something will error which you can capture, work out the error code/string and decide to re-run the query minus the bad column.
This is not a good idea, its dirty and expensive. I've used exceptions to save dupe check query's in the past, but even that isn't the correct way to do things... but it does work
The only "proper" way to do what you want is to first check if the column exists. In sqlite, i don't think there's a simply select command like there is in mysql. But you can use the PRAGMA table_info('table-name') statement to get all columns, then check if your column exists prioir to running the query. This is the correct way to do things.
Having said that, if your working in a collaborative/team dev environment, you should have a much cleaner way of upgrading everyone in the group. So i'd be more tempted to address that issue of procedure rather than code my way around it.

Finding changes in MongoDB database

I'm designing a MongoDB database that works with a script that periodically polls a resource and gets back a response which is stored in the database. Right now my database has one collection with four fields , id, name, timestamp and data.
I need to be able to find out which names had changes in the data field between script runs, and which did not.
In pseudocode,
if(data[name][timestamp]==data[name][timestamp+1]) //data has not changed
store data in collection 1
else //data has changed between script runs for this name
store data in collection 2
Is there a query that can do this without iterating and running javascript over each item in the collection? There are millions of documents, so this would be pretty slow.
Should I create a new collection named timestamp for every time the script runs? Would that make it faster/more organized? Is there a better schema that could be used?
The script runs once a day so I won't run into a namespace limitation any time soon.

OK, this is a neat question b/c the short is basically: you will have to iterate and run javascript over each item.
The part where this gets "neat" is that this isn't really different from what an SQL solution would have to do. I mean, you're basically joining a table to itself where x.1=x.1 and y.1=y.2. Even if the relational DB can handle such a beast, it's definitely not going to be fast with millions of entries.
So the truth is, you're doing this right way. Here are the extra details I would use to make this cleaner.
Ensure that you have an index on Name/Timestamp.
Run a db.mycollection.find().foreach() across the data set.
Foreach entry you're going to a) Perform comparison. b) Save appropriately. c) Update a flag indicating that this record has been processed.
On future loads you should be able to add a query to your find. db.mycollection.find({flag:{$exists:false}}).foreach()
Use db.eval() to help with speed.
The reason for the "Name/Timestamp" index is that you're going to be looking up each "successor" by "Name/Timestamp", so you want to be quick here.
The reason for the "processed" flag is that you should never have to re-run the same item. If given timestamp 'n' you find 'n+1', then that's the only 'n+1' you're going to have.
Honestly, if you're only running this once / day, it's quite likely that the speed will be just fine, especially if you only running on new records. Just assume that it's going to take several minutes.

Develop Reference

JavaScript is the programming language of the Web.