Is there anyway how to insert data in parallel from an external data source? Meaning I have multiple APIs/Endpoint that provide similar dataset that will be inserted in a database.
For example:
My current code is looping through each API and saving it to the database. My target behavior is the image above and hopefully dynamic. Meaning I can add multiple Endpoints and can insert in parallel when calling my insert function.
Yes, you can do this.
To prepare to write the code you will be wise to tool up a version of the MySQL API in node that works with async/await (that is, a Promise-based API).
Then tool up to use a mysql connection pool. You can limit the total number of connections in a pool. That is wise because too many connections can overwhelm your MySQL server.
const mysql = require('mysql2/promise')
const pool = mysql.createPool({
host: 'host',
user: 'redacted',
database: 'redacted',
waitForConnections: true,
connectionLimit: 6,
queueLimit: 0
})
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}
Then write each API access operation as an async function with a loop in it. Something like this gets a connection to use, even for multiple sequential queries, for each API operation.
async function apiOne(pool) {
while (true) {
const result = await (api_operation)
connection = await pool.getConnection()
const [rows, fields] = await connection.execute(whatever)
const [rows, fields] = await connection.execute(whatever_else)
connection.release()
await sleep(1000) // wait one second
}
}
Do getConnection() inside the loop, not outside it. Pool.getConnection() is very fast because it re-uses existing connections. Doing it inside the loop allows your pool to limit the number of simultaneous connections.
The sleep() function is optional of course. You can use it to control how fast the loop runs.
Write as many of these functions as you need. This is a good way to handle multiple APIs because the code for each one is isolated in its own function.
Finally, use Promise.all() to run all your async functions concurrently.
const concurrents = []
concurrents.push (apiOne(pool))
concurrents.push (apiTwo(pool))
concurrents.push (apiThree(pool))
Promise.all (concurrents).then() /* run all the ApiXxx functions */
Beware, this sample code is dangerously oversimplified. It lacks any error or exception handling, which you need in long-running code.
Related
I would like to make 10,000 concurrent HTTP requests. I am currently doing it by using Promise.all. However, I seem to be rate limited in some way, it takes around 15-30 mins to complete all 10,000 requests. Is there something in axios or in the http requests in node that is limiting me? How can I raise the limt if there is one?
const axios = require('axios');
function http_request(url) {
return new Promise(async (resolve) => {
await axios.get(url);
// -- DO STUFF
resolve();
});
}
async function many_requests(num_requests) {
let all_promises = [];
for (let i = 0; i < num_requests; i++) {
let url = 'https://someurl.com/' + i;
let promise = http_request(url);
all_promises.push(promise);
}
return Promise.all(all_promises);
}
async function run() {
await many_requests(10000);
}
run();
In Node.js there are two types of threads: one Event Loop (aka the
main loop, main thread, event thread, etc.), and a pool of k Workers
in a Worker Pool (aka the threadpool).
...
The Worker Pool of Node.js is implemented in libuv (docs), which
exposes a general task submission API.
Event loop run in a thread, push tasks to pool of k Workers. And these workers will run parallel. Default number of work in pool is 4. You can set more.
source
libuv
Default UV_THREADPOOLSIZE is 4. You can set UV_THREADPOOLSIZE as link. Limit of it depend on os, you need check your os:
set UV_THREADPOOL_SIZE
I’m working on an application that uses Firebase Functions as a API interface between my web application and Google Cloud SQL (MySQL 5.7).
I have a process for importing records from the client app; basically the client app reads a CSV file then executes a function for every row in the CSV file. The function executes three or four queries during processing of the record (checking to see if the main record exists, creating it and/or other needed records, updating a stats record for this process).
The function’s called sequentially for each row, so there’s never more than one request (row) processed at a time executing 3 or 4 queries before returning data to the client app which then processes the next row (async/await).
The process works great for CSV files with 1 to 100 rows. As soon as it goes above about 900 rows, the Firebase Functions starts reporting ERROR Error: ER_CON_COUNT_ERROR: Too many connections
My code, shown below, originally had a connection limit of 10, but I bumped it up to 100 connections but it still fails.
Here’s my code that executes the SQL queries:
import * as functions from "firebase-functions";
import * as mysql from 'mysql';
export async function executeQuery(cmd: string) {
const mySQLConfig = {
host: functions.config().sql.prodhost,
user: functions.config().sql.produser,
password: functions.config().sql.prodpswd,
database: functions.config().sql.proddatabase,
connectionLimit: 100,
}
var pool: any;
if (!pool) {
pool = mysql.createPool(mySQLConfig);
}
return new Promise(function (resolve, reject) {
//#ts-ignore
pool.query(cmd, function (error, results) {
if (error) {
return reject(error);
}
resolve(results);
});
});
}
As I understand it, with a pool like I think I’ve implemented above, each request will get a connection up to the max connections. Each connection will automatically return to the pool once its done processing the request. So, even if it takes a while to release the connection, with the connection limit at 100, I should be able to process quite a few rows (20 or so at least) before there’s contention for connections and then the process will queue up and wait for free connections before continuing. If that’s right, what’s happening here?
I found an article here: https://cloud.google.com/sql/docs/mysql/manage-connections that describes some additional settings I can use to tweak connection management:
// 'connectTimeout' is the maximum number of milliseconds before a timeout
// occurs during the initial connection to the database.
connectTimeout: 10000,
// 'acquireTimeout' is the maximum number of milliseconds to wait when
// checking out a connection from the pool before a timeout error occurs.
acquireTimeout: 10000,
// 'waitForConnections' determines the pool's action when no connections are
// free. If true, the request will queued and a connection will be presented
// when ready. If false, the pool will call back with an error.
waitForConnections: true, // Default: true
// 'queueLimit' is the maximum number of requests for connections the pool
// will queue at once before returning an error. If 0, there is no limit.
queueLimit: 0, // Default: 0
I’m tempted to try bumping up the timeouts, but I’m not sure whether that’s actually impacting me here.
Since I’m running this in Firebase Functions (Google Cloud Functions under the covers), do these settings even really apply? Isn’t my function’s VM resetting after every execution or at least my function terminating after every execution? Does the pool even exist in this context? If not, then how do I do this type of processing in Functions?
One option is, of course, to push all of my processing to the function, just send up a JSON object for the row array and let the function process them all at once. This, I think, should make proper use of pools, but I’m worried I’ll bump up against execution limits in Functions (5 minutes) which is why I built it like I did.
Stupid developer trick, I was paying such close attention to my pool code that I missed that I'm declaring the pool variable in the wrong place. Moving the pool declaration outside of the method fixed my problem. With the code the way it was, I was creating a pool with every SQL query which quickly used up all of my connections.
I am running a backend server with NodeJS. The backend holds a function that makes requests to an external API. As the external API provider isn't too happy about constant requests, I need to throttle my function that makes the requests to this external API. My current solution is to use the Bottleneck library.
With that library I can define a limit on how often a specific function is called in a certain amount of time (also I can limit the number of concurrent instances that execute a specific function). There is only one downside: I can neither access nor change the queue of "waiting" function calls, meaning that one client can basically make a lot of requests and block the function for other clients.
Is there a way to sort of implement a queue in NodeJS for function calls? If other clients make requests aswell, I need to take that into account and somehow mix up the execution order to be fair again (and not first in first out/first come first serve).
This is my current setup with Bottleneck, but as described above, the behaviour is FIFO and therefore other clients are getting "blocked".
const Bottleneck = require("bottleneck");
const limiter = new Bottleneck({
minTime: 1000,
});
router.post("/", async (req, res) => {
...
const result = await requestHandler(xml, 0);
async function requestHandler(xml, recursionCounter) {
...
result = await limiter.schedule(() => soapRequest(URL, xml));
...
}
}
async function soapRequest(url, xml) {...}
I have a question regarding SQL connection pools. My team is using the knex.js library in one of our node applications to make database query's.
The application from time to time needs to switch databases. So my team created an initialization function that returns a knex object configured to the correct database. Then that object is used to do said query. To me this seems redundant and can cause bad performance, because we initiate a knex object every time need to do a query instead of reusing a single knex object. Which i could ignore if knex already does this when you which databases (and if anyone could shed light on this question as well that would be FANTASTIC !) . Moreover, (and this leads me to my question titled above) the connection pool properties are redefined. So does that mean we are creating new pools every time, or does the SQL ( SQL Sever in this case) reuse the connection pool you already defined ? The question might not be Knex specific, like if i used a library like knex for C#, and call that library a similar way, would SQL Server know not to make more connection pools?
Example code:
/** db.js
* #param {any} database
* #returns db: Knex
*/
module.exports = ( database ) => {
var knex = require('knex')({
client: 'mssql',
connection: {
database: database,
server: '127.0.0.1',
user: 'your_database_user',
password: 'your_database_password'
},
pool: {
min: 0,
max: 10,
idleTimeoutMillis: 5000,
softIdleTimeoutMillis: 2000,
evictionRunIntervalMillis: 500
}
});
return knex;
};
Index.js
var db = require('./db.js');
/**
* #returns users:Array
*/
const getUsers = async() => {
const users = await db('master')
.select()
.from('users_table')
.orderBy('user_id');
return users;
}
Short answer: The 'singleton' nature of the node require() statement prevents reinitialization of multiple occurrences of knex. So the initially created pool continues to be used for the duration of your process, not recreated, as long as you don't discard the db. variable reference.
More discussion...
... my team created an initialization function that returns a knex
object configured to the correct database. Then that object is used to
do said query. To me this seems redundant and can cause bad
performance, because we initiate a knex object every time need to do a
query instead of reusing a single knex object. Which i could ignore if
knex already does this when you switch databases...
var db = require('./db.js');
The node.js require statement creates a singleton object. (You probably already know) this means that the first time the module is called by your program using the require statement, the module and it's data will be initialized, but successive identical require calls will just reuse the same module reference and will not reinitialize the module.
... the connection pool properties are redefined. So does that mean
we are creating new pools every time, or does the SQL ( SQL Sever
in this case) reuse the connection pool you already defined ?
So since the require()-ed module is not reinitialized, then the originally created pool will not be re-created. Unless you discard the db variable reference (discussed more below).
The question might not be Knex specific, like if i used a library like
knex for C#, and call that library a similar way, would SQL Server
know not to make more connection pools?
Generally speaking, you need to build or acquire connection some code to properly manage a pool of connections throughout the life of your process. Knex and most other database wrappers do this for us. (Under the covers Knex uses this library before v0.18.3 and this one on/after.)
Properly initializing and then using the singly initialized pooling code throughout the life of your application process accomplishes this. Discarding the pool and recreating it within your process defeats the purpose of having pooling. Often pooling is setup as part of process initialization.
Also, this was probably just a misstatement within your question, but your Node.js module is making the connection pools, not the SQL Server.
... The application from time to time needs to switch databases. my
team created an initialization function that returns a knex object
configured to the correct database.
From that statement, I would expect to see code like the following:
var db = require('./db.js');
var dbOther = require('./dbOther.js');
... which each establishes a different database connection. If you are instead using:
var db = require('./db.js');
// ... do other stuff here in the same module ...
var db = require('./dbOther.js');
... then you are likely throwing away the original reference to your first database, and in that case, YES, you are discarding your DB connection and connection pool as you switch connections.
Or, you could do something like the following:
// initialize the 2 connection pools
const dbFirst = require('./db.js');
const dbOther = require('./dbOther.js');
// set the active connection
var db = dbFirst;
// change the active connection
db = dbOther;
In short, I've run into an issue where multiple parallel GET requests to my Node.js server cause the server to get "clogged up" and hang, thus resulting in timeouts for the clients (503, service unavailable).
After a lot of performance analysis, I've realized it's a CPU issue. The specific request (we'll call it GET /foo) queries data from multiple services over HTTP, and then does a lot of computation, and returns the results to the client, like this:
Client request GET /foo
/foo controller queries data over HTTP from multiple other services`
/foo controller then does a bunch of iterations over the data to compile some output for the client
Step 3 takes around 2 seconds to complete. However, if I send 2 requests in parallel to /foo, each client will receive their response in about 4 seconds. When I run the app in a cluster using more cores, the requests run much faster, but not quite what I want.
Seems like I have several options here:
pre-compute the response (ideally would like to avoid this for now, since it will require a whole "cache invalidation" scheme), or
/foo sends the CPU-blocking computation asynchronously to another process (using Heroku, so that would be another dyno), and then I can use a websocket or something to push the results to the client (again, very complex for my situation), or
somehow yield to a child process in the request and return the results to the client
Would love to do something like option 3. Something like this:
get('/foo', function*(request) {
// I/O, so not blocking the event loop (I think)
let data = yield getData(request)
// make this happen in a different process
let response = yield doSomeHeavyProcessing(data)
return response
})
I've omitted a lot of implementation details above, but if it's necessary to know, I'm using Koa and Node.js 6.
Ideally, doSomeHeavyProcessing would do the CPU-intensive computation in some separate process, and when it's done, still send the results back in a "synchronous" fashion to the request client.
Been trying to wrap my head around child processes, web workers, fibers, etc., and have been doing some basic "hello worlds" with these to get them to do basically the above, but to no avail. Can post more details if necessary.
Here are some approaches that you can try:
1.
Split blocking computation in small chunks and use setImmediate to place the next chunk of work at the end of the event queue. So computation is no longer blocking and other requests can be processed.
2.
Microsoft recently released napajs. As stated in their README
As it evolves, we find it useful to complement Node.js in CPU-bound tasks, with the capability of executing JavaScript in multiple V8 isolates and communicating between them.
I haven't tried it, but it looks very promising:
var napa = require('napajs');
var zone1 = napa.zone.create('zone1', { workers: 4 });
get('/foo', function*(request) {
let data = yield getData(request)
let response = yield zone1.execute(doSomeHeavyProcessing, [data])
return response
})
3. If nothing of the above is enough and you need to spread the load across multiple machines, then you probably couldn't avoid using some sort of message queue to distribute work to different servers. In this case check out ZeroMQ. It is extremely easy to use from node, and you can implement any kind of distributed messaging pattern with it.
You could utilize Child process with additional wrapper for convenience.
worker.js - this module will run in a separate process and will do the heavy work
const crypto = require('crypto');
function doHeavyWork(data) {
return crypto.pbkdf2Sync(data, 'salt', 100000, 64, 'sha512');
}
process.on('message', (message) => {
const result = doHeavyWork(message.data);
process.send({ id: message.id, result });
});
client.js - a convenience (but primitive) wrapper for Child process
const cp = require('child_process');
let worker;
const resolves = new Map();
module.exports = {
init(moduleName, errorCallback) {
worker = cp.fork(moduleName);
worker.on('error', errorCallback);
worker.on('message', (message) => {
const resolve = resolves.get(message.id);
resolves.delete(message.id);
if (!resolve) {
errorCallback(new Error(`Got response from worker with unknown id: ${message.id}`));
return;
}
resolve(message.result);
});
console.log(`Service PID: ${process.pid}, Worker PID: ${worker.pid}`);
},
doHeavyWorkRemotly(data) {
const id = `${Date.now()}${Math.random()}`;
return new Promise((resolve) => {
worker.send({ id, data });
resolves.set(id, resolve);
});
}
}
I use fork() to utilize an additional communication channel as it is stated in the docs.
Also I keep a record of all submitted to worker process requests (const resolves = new Map();) and resolve Promises (resolve(message.result);) only when the worker process returns response for the specific request (const resolve = resolves.get(message.id);).
run.js - a startup module, it utilizes co to 'execute' generators.
const co = require('co');
const client = require('./client');
function errorCallback(error) {
console.log('Got an unexpected error!');
console.log(error);
}
client.init('./worker.js', errorCallback);
function* run() {
while(true) {
yield client.doHeavyWorkRemotly('mydata');
}
}
co(run);
To test it simply run node run.js, it will print
Service PID: XXXX, Worker PID: XXXX
then take a look at CPU utilization, worker process will probably take around 100% of CPU while Service will be quite idle.