I'm building a Blog API using Nodejs and my data coming from a scraping service that scraped data from multiples news websites live, so there's no database.
The scraping service takes around 30 seconds to return a response for page 1 of all sites I scraping with. ( Imagin with me how pagination will be looks like in my app :( )
If you don't know what scaping is just thinking of it as multiple APIs
and I get data from each one then combine all in one results array.
So because of the long response time, I start using the node-cache package for caching and it saves my request time from 30 seconds to 6 milliseconds ( Wooow right? )
The problem is when my cache gets expired after x time, I need to wait for a random user to hit my endpoint again to regenerate the cache again with the new data and he will wait for the whole 30 seconds until he gets a response.
I need to avoid that as much as I could, so any Ideas? I have searched a lot and not getting any useful results!!, All articles talk about how to cache not techniques.
#Update
I have found kinda a solution the package I'm using for caching provided in their API Documentation an event called cache.on('expired', cb) means I can listen to any cache get expired.
What I have done is kinda an endless loop making the request to my self every time a cache get expired
The code
class MyScraperService {
constructor() {
this.cache = new NodeCache({ stdTTL: 30, checkperiod: 5, useClones: false });
this.cache.on('expired', (key: string, data: Article[]) => {
console.log('key: ', key);
// send a request to get all my articless again and again once the cahce get expires
this.articles( key.charAt( key.length -1 ) ); // page number
});
}
async articles(page: string): Promise<Article[]> {
// nodeCache()
if (this.cache.get(`articles_page_${page}`)) {
let all: Article[] = this.cache.get(`articles_page_${page}`); //.sort(() => Math.random() - 0.5);
return all.sort(() => Math.random() - 0.5);
}
let artilces: any = await Promise.all([
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page),
this.xxScraper(page)
]);
let all: Article[] = [];
for (let i = 0; i < artilces.length; i++) {
const article = artilces[i];
all.push(...article);
}
this.cache.set(`articles_page_${page}`, all);
all = all.sort(() => Math.random() - 0.5);
return all;
}
}
You might be able to schedule a cronjob to call your scraper every [cachingTime-scrapExecutionTime](in this case, 30) seconds with cron.
I would also suggest you to increase the caching request to above 1m, which will divide the number of requests to the other websites.
Related
[![enter image description here][1]][1]I need to implement code to check what my throttling limit is on an endpoint (I know it's x times per minute). I've only been able to find an example of this in python, which I have never used. It seems like my options are to run a script to send the request repeatedly until it throttles me or, if possible, query the API to see what the limit is.
Does anyone have a good idea on how to go about this?
Thanks.
Note: The blank space is just data from the api calls.
[1]: https://i.stack.imgur.com/gAFQQ.png
This starts concurency number of workers (I'm using workers as a loose term here; don't # me). Each one makes as many requests as possible until one of the requests is rate-limited or it runs out of time. It them reports how many of the requests completed successfully inside the given time window.
If you know the rate-limit window (1 minute based on your question), this will find the rate-limit. If you need to discover the window, you would want to intentionally exhaust the limit, then slow down the requests and measure the time until they started going through again. The provided code does not do this.
// call apiCall() a bunch of times, stopping when a apiCall() resolves
// false or when "until" time is reached, whichever comes first. For example
// if your limit is 50 req/min (and you give "until" enough time to
// actuially complete 50+ requests) this will call apiCall() 50 times. Each
// call should return a promise resolving to TRUE, so it will be counted as
// a success. On the 51st call you will presumably hit the limit, the API
// will return an error, apiCall() will detect that, and resolve to false.
// This will cause the worker to stop making requests and return 50.
async function workerThread(apiCall, until) {
let successfullRequests = 0;
while(true) {
const success = await apiCall();
// only count it if the request was successfull
// AND finished within the timeframe
if(success && Date.now() < until) {
successfullRequests++;
} else {
break;
}
}
return successfullRequests;
}
// this just runs a bunch of workerThreads in parallell, since by doing a
// single request at a time, you might not be able to hit the limit
// depending on how slow the API is to return. It returns the sum of each
// workerThread(), AKA the total number of apiCall()s that resolved to TRUE
// across all threads.
async function testLimit(apiCall, concurency, time) {
const endTime = Date.now() + time;
// launch "concurency" number of requests
const workers = [];
while(workers.length < concurency) {
workers.push(workerThread(apiCall, endTime));
}
// sum the number of requests that succeded from each worker.
// this implicitly waits for them to finish.
let total = 0;
for(const worker of workers) {
total += await worker;
}
return total;
}
// put in your own code to make a trial API call.
// return true for success or false if you were throttled.
async function yourAPICall() {
try {
// this is a really sloppy example API
// the limit is ROUGHLY 5/min, but because of the sloppy server-side
// implimentation you might get 4-6.
const resp = await fetch("https://9072997.com/demos/rate-limit/");
return resp.ok;
} catch {
return false;
}
}
// this is a demo of how to use the function
(async function() {
// run 2 requests at a time for 5 seconds
const limit = await testLimit(yourAPICall, 2, 5*1000);
console.log("limit is " + limit + " requests in 5 seconds");
})();
Note that this method measures the quota available to itself. If other clients or previous requests have already depleted the quota, it will affect the result.
Similar to this question:
Fetch API leaks memory in Chrome
When using fetch to regularly poll data, Chrome's memory usage continually increases without ever releasing the memory, which eventually causes a crash.
https://jsfiddle.net/abfinz/ahc65y3s/13/
const state = {
response: {},
count: 0
}
function poll(){
fetch('https://upload.wikimedia.org/wikipedia/commons/3/3d/LARGE_elevation.jpg')
.then(response => {
state.response = response;
state.count = state.count + 1;
if (state.count < 20){
console.log(state.count);
setTimeout(poll, 3000);
} else {
console.log("Will restart in 1 minute");
state.count = 0;
setTimeout(poll, 60000);
}
});
}
poll();
This JSFiddle demonstrates the issue fairly well. By polling for data every 3 seconds, it seems that something is causing Chrome to continually hold onto the memory. If I let it stop and wait for a few minutes, it usually will release the memory, but if polling continues, it usually holds onto it. Also, if I let it run for a few full cycles, even forcing garbage collection from the perfomance tab of the dev tools doesn't always release all of the memory.
The memory doesn't show up in the JS Heap. I have to use the Task Manager to see it.
Occasionally, the memory will clear while actively polling, but inevitably builds to extreme levels.
Edge also shows the issue, but seems to be more proactive in clearing the memory. Though it still eventually builds to 1GB+ of extra memory usage.
Am I doing something wrong, or is this a bug? Any ideas on how I can get this kind of polling to work long-term without the memory leak?
I played around a bit with it and it seems to be a bug with the handling of the response so that it won't free the allocated memory if you are not calling any of the response functions.
The chrome task manager and the windows task manager report the same size of 30 MB constantly if i start the code snippet here using this order of execution. Meanwhile it runs on jsfiddle too with 30 MB on request #120.
const state = {
response: {},
count: 0
},
uri = 'https://upload.wikimedia.org/wikipedia/commons/3/3d/LARGE_elevation.jpg';
!function poll() {
const controller = new AbortController(),
signal = controller.signal;
// using this you can cancel it and destroy it completly.
fetch(uri, { signal })
// this is triggered as soon as the header data is transfered
.then(response => {
/**
* Continung without doing anything on response
* fills the memory.
*
* Chrome downloads the file in the background and
* seems to wait for the use of a call on the
* response.fx() or an abort signal.
*
* This seems to be a bug or a small design mistake
* if response is never used.
*
* If response.json(), .blob(), .body() or .text() is
* called the memory will be free'd.
*/
return response.blob();
}).then((binary) => {
// store data to a var
return state.response = binary;
}).catch((err) => {
console.error(err);
}).finally(() => {
// and start the next poll
console.log(++state.count, state.response.type, (state.response.size / 1024 / 1024).toFixed(2)+' MB');
requestAnimationFrame(poll);
// console.dir(state.response);
// reduces memory a bit more
controller.abort();
})
}()
I'm doing a simple fetch as follows
useLayoutEffect(() => {
var v1 = performance.now();
fetch('https://example.com/api/')
.then(result => {
var v2 = performance.now();
console.log("total time taken for fetch = "+(v2-v1)+"milliseconds");
return result.json()
})
.then(data => {
var v3 = performance.now();
console.log("total time taken = "+(v3-v1)+"milliseconds");
console.log(data.data);
})
},[])
In the server side, I included the server execution time which shows 0.00759 seconds to process the entire php script including the php query.
I have access the API directly via browser and it's extremely fast as well, almost instantly.
However when I use react/javascript, there is a huge delay sometimes up to 7-10 seconds.
I wish to know how I can find out where is the bottleneck for the lag/delay.
I have developed an Actor+PuppeteerCrawler+Proxy based crawler and want to rescrape failed pages. To increase the chance for the rescrape, I want to switch to another proxyUrl. The idea is, to create a new crawler with a modified launchPupperteer function and a different proxyUrl, and re-enque the failed pages. Please check the sample code below.
But unfortunately, it doesn't work, although I reset the request queue by using drop and reopening. Is it possible to rescraped failed pages by using PuppeteerCrawler with a different proxyUrl and how?
Best regards,
Wolfgang
for(let retryCount = 0; retryCount <= MAX_RETRY_COUNT; retryCount++){
if(retryCount){
// Try to reset the request queue, so that failed request shell be rescraped
await requestQueue.drop();
requestQueue = await Apify.openRequestQueue(); // this is necessary to avoid exceptions
// Re-enqueue failed urls in array failedUrls >>> ignored although using drop() and reopening request queue!!!
for(let failedUrl of failedUrls){
await requestQueue.addRequest({url: failedUrl});
}
}
crawlerOptions.launchPuppeteerFunction = () => {
return Apify.launchPuppeteer({
// generates a new proxy url and adds it to a new launchPuppeteer function
proxyUrl: createProxyUrl()
});
};
let crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
}
I think your approach should work but on the other hand it should not be necessary. I'm not sure what createProxyUrl does.
You can supply a generic proxy URL with auto username which will use all your datacenter proxies at Apify. Or you can provide proxyUrls directly to PuppeteerCrawler.
Just don't forget that you have to switch browser to get a new IP from the proxy. More in this article - https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler
I'm working in a project where I need to make requests over an API. The requests return data about a support ticket, but the problem is that i have about 500 tickets to get data about and each one requires one request. To speed up the requests, i tried to build a async routine that generate many requests at the same time. But, since the API that i'm integrating with has a rate limiter of 10 requests per second, some of the routines get the answer "Limit Exceed". If I make the requests sequentially, it's take about 5 minutes.
That way, someone has a tip for me in that task? I tried some solutions like rate-limiter of NodeJS, but it just generate 10 requests simultaneously, and didn't give any kind of error treatment or retry if the request fail.
About the language, it not have restriction, the project is written in NodeJS but have some python code too and didn't have problem to integrate another language.
Something like this isn't too difficult to create yourself, and it'd give you the flexibility you need.
There are fancy ways like tracking the start and completion time of each and checking if you've sent 10 in the second.
The system probably also limits it to 10 active requests going (i.e., you can't spin up 100 requests, 10 each second, and let them all process).
If you assume this, I'd say launch 10 all at once, then let them complete, then launch the next batch. You could also launch 10, then start 1 additional each time one finishes. You could think of this like a "thread pool".
You can easily track this with a simple variable tracking how many calls are going. Then, just check how many calls are going once a second (to avoid the 1 second limit) and if you have available "threads", fire off that many more new requests.
It could look something like this:
const threadLimit = 10;
const rateLimit = 1000; // ms
let activeThreads = 0;
const calls = new Array(100).fill(1).map((_, index) => index); // create an array 0 through 99 just for an example
function run() {
if (calls.length == 0) {
console.log('complete');
return;
}
// threadLimit - activeThreads is how many new threads we can start
for (let i = 0; i < threadLimit - activeThreads && calls.length > 0; i++) {
activeThreads++; // add a thread
call(calls.shift())
.then(done);
}
setInterval(run, rateLimit);
}
function done(val) {
console.log(`Done ${val}`);
activeThreads--; // remove a thread
}
function call(val) {
console.log(`Starting ${val}`);
return new Promise(resolve => waitToFinish(resolve, val));
}
// random function to simulate a network call
function waitToFinish(resolve, val) {
const done = Math.random() < .1; // 10% chance to finish
done && resolve(val)
if (!done) setInterval(() => waitToFinish(resolve, val), 10);
return done;
}
run();
Basically, run() just starts up however many new threads it can, based on the limit and how many are done. Then, it just repeats the process every second, adding new ones as it can.
You might need to play with the threadLimit and rateLimit values, as most rate limiting systems don't actually let you go up right to the limit and don't release it as soon as it's done.