I would like to make 10,000 concurrent HTTP requests. I am currently doing it by using Promise.all. However, I seem to be rate limited in some way, it takes around 15-30 mins to complete all 10,000 requests. Is there something in axios or in the http requests in node that is limiting me? How can I raise the limt if there is one?
const axios = require('axios');
function http_request(url) {
return new Promise(async (resolve) => {
await axios.get(url);
// -- DO STUFF
resolve();
});
}
async function many_requests(num_requests) {
let all_promises = [];
for (let i = 0; i < num_requests; i++) {
let url = 'https://someurl.com/' + i;
let promise = http_request(url);
all_promises.push(promise);
}
return Promise.all(all_promises);
}
async function run() {
await many_requests(10000);
}
run();
In Node.js there are two types of threads: one Event Loop (aka the
main loop, main thread, event thread, etc.), and a pool of k Workers
in a Worker Pool (aka the threadpool).
...
The Worker Pool of Node.js is implemented in libuv (docs), which
exposes a general task submission API.
Event loop run in a thread, push tasks to pool of k Workers. And these workers will run parallel. Default number of work in pool is 4. You can set more.
source
libuv
Default UV_THREADPOOLSIZE is 4. You can set UV_THREADPOOLSIZE as link. Limit of it depend on os, you need check your os:
set UV_THREADPOOL_SIZE
Related
How do I send bulk HTTP GET requests using Axios, for example:
let maxI = 3000;
let i = 0;
do{
i = i + 1 ;
await exampleUrl = axios.get(`https://hellowWorld.com/${i}`);
} while (i < maxI);
How will I be able to receive the data from all the provided URLs and can this be merged into a single variable? And how can I make sure that this gets executed quickly?
I know about axios.all, but I don't know how to apply it in my case.
Thank you in advance.
You can do something like this but be careful, servers will reject your request if you make them in bulk to prevent DDOS and this also doesn't guarantee that all the requests would return successfully and you will receive all the data, here is the snippet for it:
import axios from "axios";
const URL = "https://randomuser.me/api/?results=";
async function getData() {
const requests = [];
for (let i = 1; i < 6; i++) {
requests.push(axios.get(URL + i));
}
const responses = await Promise.allSettled(requests);
console.log(responses);
const result = [];
responses.forEach((item) => {
if (item.status === "rejected") return;
result.push(item.value.data.results);
});
console.log(result.flat());
}
getData();
AFAIK, it is impossible to increase the speed/reduce the time taken to complete your batch requests unless you implement a batch request handling API on server which would reduce both number of requests handled by the server and the number of requests made by the browser. The solution I gave you is just to demonstrate how it could be done from client side but your approach is not an optimal way to do it.
There is a different limit for number of parallel requests that can be made to a single domain for different browsers because of which we cannot reduce the time taken to execute queries.
Please read through these resources for further information they will be of great help:
Bulk Requests Implementation
Browser batch request ajax
Browser request limits
Limit solution
Is there anyway how to insert data in parallel from an external data source? Meaning I have multiple APIs/Endpoint that provide similar dataset that will be inserted in a database.
For example:
My current code is looping through each API and saving it to the database. My target behavior is the image above and hopefully dynamic. Meaning I can add multiple Endpoints and can insert in parallel when calling my insert function.
Yes, you can do this.
To prepare to write the code you will be wise to tool up a version of the MySQL API in node that works with async/await (that is, a Promise-based API).
Then tool up to use a mysql connection pool. You can limit the total number of connections in a pool. That is wise because too many connections can overwhelm your MySQL server.
const mysql = require('mysql2/promise')
const pool = mysql.createPool({
host: 'host',
user: 'redacted',
database: 'redacted',
waitForConnections: true,
connectionLimit: 6,
queueLimit: 0
})
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}
Then write each API access operation as an async function with a loop in it. Something like this gets a connection to use, even for multiple sequential queries, for each API operation.
async function apiOne(pool) {
while (true) {
const result = await (api_operation)
connection = await pool.getConnection()
const [rows, fields] = await connection.execute(whatever)
const [rows, fields] = await connection.execute(whatever_else)
connection.release()
await sleep(1000) // wait one second
}
}
Do getConnection() inside the loop, not outside it. Pool.getConnection() is very fast because it re-uses existing connections. Doing it inside the loop allows your pool to limit the number of simultaneous connections.
The sleep() function is optional of course. You can use it to control how fast the loop runs.
Write as many of these functions as you need. This is a good way to handle multiple APIs because the code for each one is isolated in its own function.
Finally, use Promise.all() to run all your async functions concurrently.
const concurrents = []
concurrents.push (apiOne(pool))
concurrents.push (apiTwo(pool))
concurrents.push (apiThree(pool))
Promise.all (concurrents).then() /* run all the ApiXxx functions */
Beware, this sample code is dangerously oversimplified. It lacks any error or exception handling, which you need in long-running code.
I am running a backend server with NodeJS. The backend holds a function that makes requests to an external API. As the external API provider isn't too happy about constant requests, I need to throttle my function that makes the requests to this external API. My current solution is to use the Bottleneck library.
With that library I can define a limit on how often a specific function is called in a certain amount of time (also I can limit the number of concurrent instances that execute a specific function). There is only one downside: I can neither access nor change the queue of "waiting" function calls, meaning that one client can basically make a lot of requests and block the function for other clients.
Is there a way to sort of implement a queue in NodeJS for function calls? If other clients make requests aswell, I need to take that into account and somehow mix up the execution order to be fair again (and not first in first out/first come first serve).
This is my current setup with Bottleneck, but as described above, the behaviour is FIFO and therefore other clients are getting "blocked".
const Bottleneck = require("bottleneck");
const limiter = new Bottleneck({
minTime: 1000,
});
router.post("/", async (req, res) => {
...
const result = await requestHandler(xml, 0);
async function requestHandler(xml, recursionCounter) {
...
result = await limiter.schedule(() => soapRequest(URL, xml));
...
}
}
async function soapRequest(url, xml) {...}
Here is how I write a document and it's subcollections:
public async setEvent(event: EventInterface): Promise<void[]> {
return new Promise<void[]>(async (resolve, reject) => {
const writePromises: Promise<void>[] = [];
event.setID(event.getID() || this.afs.createId());
event.getActivities()
.forEach((activity) => {
activity.setID(activity.getID() || this.afs.createId());
writePromises.push(this.afs.collection('events').doc(event.getID()).collection('activities').doc(activity.getID()).set(activity.toJSON()));
activity.getAllStreams().forEach((stream) => {
this.logger.info(`Steam ${stream.type} has size of GZIP ${getSize(this.getBlobFromStreamData(stream.data))}`);
writePromises.push(this.afs
.collection('events')
.doc(event.getID())
.collection('activities')
.doc(activity.getID())
.collection('streams')
.doc(stream.type) // #todo check this how it behaves
.set({
type: stream.type,
data: this.getBlobFromStreamData(stream.data),
}))
});
});
try {
await Promise.all(writePromises);
await this.afs.collection('events').doc(event.getID()).set(event.toJSON());
resolve()
} catch (e) {
Raven.captureException(e);
// Try to delete the parent entity and all subdata
await this.deleteEvent(event.getID());
reject('Something went wrong')
}
})
}
However when I look at the network tab:
I see one request firing up, well ok so far , req_0 data is my activity but looking further on the same request I can see:
So it adds more data and that should not happen because:
a) I pass the size of the request to the firestore (1mb)
b) due to slow connection I pass the time limit to write.
Most interesting is that this behavior happens when I have a slow network.
EDIT: Here is the payload of the request example:
Anyone, to explain why this?
What happens is the so-called batching, so your write operations will not fire immediately, they will be aggregated into a single request because doing network I/O is expensive in terms of time and battery life.
Minimizing network I/O saves battery life (as stated above) and that is actually the main concern.
There's "magic" happening under the hood
In short, I've run into an issue where multiple parallel GET requests to my Node.js server cause the server to get "clogged up" and hang, thus resulting in timeouts for the clients (503, service unavailable).
After a lot of performance analysis, I've realized it's a CPU issue. The specific request (we'll call it GET /foo) queries data from multiple services over HTTP, and then does a lot of computation, and returns the results to the client, like this:
Client request GET /foo
/foo controller queries data over HTTP from multiple other services`
/foo controller then does a bunch of iterations over the data to compile some output for the client
Step 3 takes around 2 seconds to complete. However, if I send 2 requests in parallel to /foo, each client will receive their response in about 4 seconds. When I run the app in a cluster using more cores, the requests run much faster, but not quite what I want.
Seems like I have several options here:
pre-compute the response (ideally would like to avoid this for now, since it will require a whole "cache invalidation" scheme), or
/foo sends the CPU-blocking computation asynchronously to another process (using Heroku, so that would be another dyno), and then I can use a websocket or something to push the results to the client (again, very complex for my situation), or
somehow yield to a child process in the request and return the results to the client
Would love to do something like option 3. Something like this:
get('/foo', function*(request) {
// I/O, so not blocking the event loop (I think)
let data = yield getData(request)
// make this happen in a different process
let response = yield doSomeHeavyProcessing(data)
return response
})
I've omitted a lot of implementation details above, but if it's necessary to know, I'm using Koa and Node.js 6.
Ideally, doSomeHeavyProcessing would do the CPU-intensive computation in some separate process, and when it's done, still send the results back in a "synchronous" fashion to the request client.
Been trying to wrap my head around child processes, web workers, fibers, etc., and have been doing some basic "hello worlds" with these to get them to do basically the above, but to no avail. Can post more details if necessary.
Here are some approaches that you can try:
1.
Split blocking computation in small chunks and use setImmediate to place the next chunk of work at the end of the event queue. So computation is no longer blocking and other requests can be processed.
2.
Microsoft recently released napajs. As stated in their README
As it evolves, we find it useful to complement Node.js in CPU-bound tasks, with the capability of executing JavaScript in multiple V8 isolates and communicating between them.
I haven't tried it, but it looks very promising:
var napa = require('napajs');
var zone1 = napa.zone.create('zone1', { workers: 4 });
get('/foo', function*(request) {
let data = yield getData(request)
let response = yield zone1.execute(doSomeHeavyProcessing, [data])
return response
})
3. If nothing of the above is enough and you need to spread the load across multiple machines, then you probably couldn't avoid using some sort of message queue to distribute work to different servers. In this case check out ZeroMQ. It is extremely easy to use from node, and you can implement any kind of distributed messaging pattern with it.
You could utilize Child process with additional wrapper for convenience.
worker.js - this module will run in a separate process and will do the heavy work
const crypto = require('crypto');
function doHeavyWork(data) {
return crypto.pbkdf2Sync(data, 'salt', 100000, 64, 'sha512');
}
process.on('message', (message) => {
const result = doHeavyWork(message.data);
process.send({ id: message.id, result });
});
client.js - a convenience (but primitive) wrapper for Child process
const cp = require('child_process');
let worker;
const resolves = new Map();
module.exports = {
init(moduleName, errorCallback) {
worker = cp.fork(moduleName);
worker.on('error', errorCallback);
worker.on('message', (message) => {
const resolve = resolves.get(message.id);
resolves.delete(message.id);
if (!resolve) {
errorCallback(new Error(`Got response from worker with unknown id: ${message.id}`));
return;
}
resolve(message.result);
});
console.log(`Service PID: ${process.pid}, Worker PID: ${worker.pid}`);
},
doHeavyWorkRemotly(data) {
const id = `${Date.now()}${Math.random()}`;
return new Promise((resolve) => {
worker.send({ id, data });
resolves.set(id, resolve);
});
}
}
I use fork() to utilize an additional communication channel as it is stated in the docs.
Also I keep a record of all submitted to worker process requests (const resolves = new Map();) and resolve Promises (resolve(message.result);) only when the worker process returns response for the specific request (const resolve = resolves.get(message.id);).
run.js - a startup module, it utilizes co to 'execute' generators.
const co = require('co');
const client = require('./client');
function errorCallback(error) {
console.log('Got an unexpected error!');
console.log(error);
}
client.init('./worker.js', errorCallback);
function* run() {
while(true) {
yield client.doHeavyWorkRemotly('mydata');
}
}
co(run);
To test it simply run node run.js, it will print
Service PID: XXXX, Worker PID: XXXX
then take a look at CPU utilization, worker process will probably take around 100% of CPU while Service will be quite idle.