Async requests over an API with request rate limiter - javascript

I'm working in a project where I need to make requests over an API. The requests return data about a support ticket, but the problem is that i have about 500 tickets to get data about and each one requires one request. To speed up the requests, i tried to build a async routine that generate many requests at the same time. But, since the API that i'm integrating with has a rate limiter of 10 requests per second, some of the routines get the answer "Limit Exceed". If I make the requests sequentially, it's take about 5 minutes.
That way, someone has a tip for me in that task? I tried some solutions like rate-limiter of NodeJS, but it just generate 10 requests simultaneously, and didn't give any kind of error treatment or retry if the request fail.
About the language, it not have restriction, the project is written in NodeJS but have some python code too and didn't have problem to integrate another language.

Something like this isn't too difficult to create yourself, and it'd give you the flexibility you need.
There are fancy ways like tracking the start and completion time of each and checking if you've sent 10 in the second.
The system probably also limits it to 10 active requests going (i.e., you can't spin up 100 requests, 10 each second, and let them all process).
If you assume this, I'd say launch 10 all at once, then let them complete, then launch the next batch. You could also launch 10, then start 1 additional each time one finishes. You could think of this like a "thread pool".
You can easily track this with a simple variable tracking how many calls are going. Then, just check how many calls are going once a second (to avoid the 1 second limit) and if you have available "threads", fire off that many more new requests.
It could look something like this:
const threadLimit = 10;
const rateLimit = 1000; // ms
let activeThreads = 0;
const calls = new Array(100).fill(1).map((_, index) => index); // create an array 0 through 99 just for an example
function run() {
if (calls.length == 0) {
console.log('complete');
return;
}
// threadLimit - activeThreads is how many new threads we can start
for (let i = 0; i < threadLimit - activeThreads && calls.length > 0; i++) {
activeThreads++; // add a thread
call(calls.shift())
.then(done);
}
setInterval(run, rateLimit);
}
function done(val) {
console.log(`Done ${val}`);
activeThreads--; // remove a thread
}
function call(val) {
console.log(`Starting ${val}`);
return new Promise(resolve => waitToFinish(resolve, val));
}
// random function to simulate a network call
function waitToFinish(resolve, val) {
const done = Math.random() < .1; // 10% chance to finish
done && resolve(val)
if (!done) setInterval(() => waitToFinish(resolve, val), 10);
return done;
}
run();
Basically, run() just starts up however many new threads it can, based on the limit and how many are done. Then, it just repeats the process every second, adding new ones as it can.
You might need to play with the threadLimit and rateLimit values, as most rate limiting systems don't actually let you go up right to the limit and don't release it as soon as it's done.

Related

How to check what the throttling limit is for your access to an endpoint with JS

[![enter image description here][1]][1]I need to implement code to check what my throttling limit is on an endpoint (I know it's x times per minute). I've only been able to find an example of this in python, which I have never used. It seems like my options are to run a script to send the request repeatedly until it throttles me or, if possible, query the API to see what the limit is.
Does anyone have a good idea on how to go about this?
Thanks.
Note: The blank space is just data from the api calls.
[1]: https://i.stack.imgur.com/gAFQQ.png
This starts concurency number of workers (I'm using workers as a loose term here; don't # me). Each one makes as many requests as possible until one of the requests is rate-limited or it runs out of time. It them reports how many of the requests completed successfully inside the given time window.
If you know the rate-limit window (1 minute based on your question), this will find the rate-limit. If you need to discover the window, you would want to intentionally exhaust the limit, then slow down the requests and measure the time until they started going through again. The provided code does not do this.
// call apiCall() a bunch of times, stopping when a apiCall() resolves
// false or when "until" time is reached, whichever comes first. For example
// if your limit is 50 req/min (and you give "until" enough time to
// actuially complete 50+ requests) this will call apiCall() 50 times. Each
// call should return a promise resolving to TRUE, so it will be counted as
// a success. On the 51st call you will presumably hit the limit, the API
// will return an error, apiCall() will detect that, and resolve to false.
// This will cause the worker to stop making requests and return 50.
async function workerThread(apiCall, until) {
let successfullRequests = 0;
while(true) {
const success = await apiCall();
// only count it if the request was successfull
// AND finished within the timeframe
if(success && Date.now() < until) {
successfullRequests++;
} else {
break;
}
}
return successfullRequests;
}
// this just runs a bunch of workerThreads in parallell, since by doing a
// single request at a time, you might not be able to hit the limit
// depending on how slow the API is to return. It returns the sum of each
// workerThread(), AKA the total number of apiCall()s that resolved to TRUE
// across all threads.
async function testLimit(apiCall, concurency, time) {
const endTime = Date.now() + time;
// launch "concurency" number of requests
const workers = [];
while(workers.length < concurency) {
workers.push(workerThread(apiCall, endTime));
}
// sum the number of requests that succeded from each worker.
// this implicitly waits for them to finish.
let total = 0;
for(const worker of workers) {
total += await worker;
}
return total;
}
// put in your own code to make a trial API call.
// return true for success or false if you were throttled.
async function yourAPICall() {
try {
// this is a really sloppy example API
// the limit is ROUGHLY 5/min, but because of the sloppy server-side
// implimentation you might get 4-6.
const resp = await fetch("https://9072997.com/demos/rate-limit/");
return resp.ok;
} catch {
return false;
}
}
// this is a demo of how to use the function
(async function() {
// run 2 requests at a time for 5 seconds
const limit = await testLimit(yourAPICall, 2, 5*1000);
console.log("limit is " + limit + " requests in 5 seconds");
})();
Note that this method measures the quota available to itself. If other clients or previous requests have already depleted the quota, it will affect the result.

Interval synced to Time

I want to execute a function in an interval. Yeah I could use setInterval but I need the interval to be synced to the timestamp or something.
Like I want to execute the interval on two different devices and they should run in the exact same second or even ms if possible. But depending on when I star the script these intervals would be offset if I would use setInterval method.
I've already tried this but it kinda acts weird.
setInterval(() => {
if (new Date().getTime() % 1000 * 10 == 0) {
console.log(new Date().toLocaleTimeString())
}
}, 1);
Like I want to execute the interval on two different devices and they should run in the exact same second or even ms if possible.
There's no guarantee that you can do this, not least because the JavaScript thread on one of the devices may be busy doing something else at that precise moment (it could even be tied up for several seconds).
Other than that, there's the issue of synchronizing the devices. Options are:
Some kind of synchronization event you send simultaneously to both devices. You'd run your code in response to the synchronization event received from your server. This is naturally subject to network delays, it requires a server to send the event (probably over web sockets), and is subject to the above caveat about the JavaScript thread being busy.
Relying on the devices being synced to exactly the same time source (for instance, perhaps they're both using a NIST time server or similar). If you know their times are synchronized sufficiently for your purposes, you can schedule your timer to fire at a precise moment, like this:
// Fire at exactly 14:30 GMT on 2021-04-21
const target = new Date(Date.UTC(2021, 3, 21, 14, 30)); // 3 = April, starts with 0 = Jan
const delay = Date.now() - target;
if (delay < 0) {
// It's already later than that
} else {
setTimeout(() => {
// Your code here
}, delay);
}
BUT, again, if the JavaScript thread is busy at that precise moment, the timer callback will run later, when the thread is free.
The code above schedules a single event, but if you need a recurring one, you can do the same basic logic: Determine the date/time you want the next callback to occur, find out how many milliseconds it is between now and then (Date.now() - target), and schedule the callback for that many milliseconds later.

NodeJS: How to generate all previous cached requests automatically on expiration

I'm building a Blog API using Nodejs and my data coming from a scraping service that scraped data from multiples news websites live, so there's no database.
The scraping service takes around 30 seconds to return a response for page 1 of all sites I scraping with. ( Imagin with me how pagination will be looks like in my app :( )
If you don't know what scaping is just thinking of it as multiple APIs
and I get data from each one then combine all in one results array.
So because of the long response time, I start using the node-cache package for caching and it saves my request time from 30 seconds to 6 milliseconds ( Wooow right? )
The problem is when my cache gets expired after x time, I need to wait for a random user to hit my endpoint again to regenerate the cache again with the new data and he will wait for the whole 30 seconds until he gets a response.
I need to avoid that as much as I could, so any Ideas? I have searched a lot and not getting any useful results!!, All articles talk about how to cache not techniques.
#Update
I have found kinda a solution the package I'm using for caching provided in their API Documentation an event called cache.on('expired', cb) means I can listen to any cache get expired.
What I have done is kinda an endless loop making the request to my self every time a cache get expired
The code
class MyScraperService {
constructor() {
this.cache = new NodeCache({ stdTTL: 30, checkperiod: 5, useClones: false });
this.cache.on('expired', (key: string, data: Article[]) => {
console.log('key: ', key);
// send a request to get all my articless again and again once the cahce get expires
this.articles( key.charAt( key.length -1 ) ); // page number
});
}
async articles(page: string): Promise<Article[]> {
// nodeCache()
if (this.cache.get(`articles_page_${page}`)) {
let all: Article[] = this.cache.get(`articles_page_${page}`); //.sort(() => Math.random() - 0.5);
return all.sort(() => Math.random() - 0.5);
}
let artilces: any = await Promise.all([
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page)
]);
let all: Article[] = [];
for (let i = 0; i < artilces.length; i++) {
const article = artilces[i];
all.push(...article);
}
this.cache.set(`articles_page_${page}`, all);
all = all.sort(() => Math.random() - 0.5);
return all;
}
}
You might be able to schedule a cronjob to call your scraper every [cachingTime-scrapExecutionTime](in this case, 30) seconds with cron.
I would also suggest you to increase the caching request to above 1m, which will divide the number of requests to the other websites.

Designing bots using Node.JS, that run with random timeouts and a common routine

I'd like to use Node.JS to design some bots. Here are the requirements of these bots:
There are upto 10 'bots'. Im not sure how to do this in NodeJS,
considering its single threaded, Im assuming if there are 10 worker
items that are asynchronous, that will be representative of 10
'bots'?
Bots perform a basic REST task, like a POST to a remote server. Assume every POST is a success (or assume we dont care if there is a failure). While the remote server is the same and the function is the same (POST), there may be variants in arguments, and each bot will supply the variable arguments, like payload to POST
Bots should somewhat model human behavior by randomly sleeping for some k seconds before firing off a task. Then it queues itself for another random k seconds before performing another task. This adds a level of complexity that Im unable to wrap my head around - if I use a sleep/timeout function like setTimeout or setInterval will 10 of such workers sleep in parallel or sleep serially. If they sleep serially then I dont have 10 bots, instead I have 10 serial workers queued in order of sleep!
What I have tried so far:
Since Im new to NodeJS, i havent been able to accurately find the right way to deal with this.
I looked into beanstalkd which is a workqueue, but it appears serial that the consumer service will order items serially
Im currently evaluating async.parallel but it appears that the parallelism is a 'barrier' where all parallel jobs will proceed to the next iteration only after all parallel jobs have finished the function to be executed in parallel, but in my case I'd like, for example, bot 3 to be requeued and sleeping on its 2nd iteration even though bot 7 is on a long sleep in its first iteration
The asynchronous nature of javascript means that when each 'bot' is sleeping it doesn't block and cause the other bots to sleep. For example, in this code:
'use strict';
var start = Date.now();
var printTime = function() {
console.log(Date.now() - start + 'ms');
};
setTimeout(function() {
printTime();
}, 500);
setTimeout(function() {
printTime();
}, 1000);
Should print (approximately):
500ms
1000ms
Rather than:
500ms
1500ms
Something like this should work just fine:
'use strict';
var fetch = require('node-fetch');
// Each bot waits between 5 and 30 seconds
var minDelay = 5000;
var maxDelay = 30000;
var numBots = 10;
var botTask = function() {
fetch('http://somewhere.com/foo', { method: 'POST', body: 'a=1' });
};
var getDelay = function() {
return minDelay + Math.random() * (maxDelay - minDelay);
};
var runBot = function() {
setTimeout(function() {
botTask();
runBot();
}, getDelay());
};
for (var i = 0; i !== numBots; i++) {
runBot();
}
Here is a very simple framework:
var bot = {
act: function() {
//make post request here
var delay = Math.random() * 500; /*set random delay to mimic human */
setTimeout(this.act.bind(this), delay);
}
}
var bots = [];
for (var i = 0; i < 10; i++) {
bots.push(Object.create(bot));
}
bots.forEach(function(bot) {
bot.act();
});
We have a master bot template, the bot variable. bot.act is a function that will send the POST request, then set a timeout on itself after a delay. The rest is just boilerplate, adding 10 bots to a list, and starting each bot. It's really that simple... no work queues, no async parallel...

Execute function on interval only if server is not under heavy load

I'm performing a routine check on my DB every hour or so by doing
setInterval(function() {
myCheckFunction();
}, 3600000)
I'm thinking of something like :
setInterval(function() {
server.getConnections(function(err, count) {
if (count < X) {
myCheckFunction();
}
}, 3600000)
To check if there's isn't too much work being done right now.
Is it a good idea ?
If so what value could X have ?
If not, should I try differently or just do the test, non regarding of the current load ?
I don't expect millions of connections, this is just a proof of concept and my teacher asked me to take care of that kind of things.
Thanks !
Edit: Why do I want to avoid heavy load ? Because the routine check could take a couple of minutes and require a lot of work for the server. It has to contact at least 5 DNS server for each entry in the DB, request HTML code and hash it, then compare answers. The check for one entry can take up to 6 seconds due the the fact that the DNS servers can be Timeout and the DB is hosted separately from the project.
This is what you are looking for:
toobusy
With this particular requirements, I would perform this check on server side instead.
var checkFunctionCaller = function () {
//call async function on server side and provide a callback function
//to be called as server returns an answer
server.myCheckFunction(function (result) {
// if server was able to run check - it return 'done' and everything is fine
if(result === "done") {
//do nothing
}
//otherwise, re-schedule a check in five minutes
setTimeout(checkFunctionCaller, 5 * 60 * 1000);
});
}
setInterval(checkFunctionCaller, 3600000);

Categories

Resources