Running tasks in parallel with reliability in cloud functions - javascript

I'm streaming and processing tweets in Firebase Cloud Functions using the Twitter API.
In my stream, I am tracking various keywords and users of Twitter, hence the influx of tweets is very high and a new tweet is delivered even before I have processed the previous tweet, which leads to lapses as the new tweet sometimes does not get processed.
This is how my stream looks:
...
const stream = twitter.stream('statuses/filter', {track: [various, keywords, ..., ...], follow: [userId1, userId2, userId3, userId3, ..., ...]});
stream.on('tweet', (tweet) => {
processTweet(tweet); //This takes time because there are multiple network requests involved and also sometimes recursively running functions depending on the tweets properties.
})
...
processTweet(tweet) essentially is compiling threads from twitter, which takes time depending upon the length of the thread. Sometimes a few seconds also. I have optimised processTweet(tweet) as much as possible to compile the threads reliably.
I want to run processTweet(tweet) parallelly and queue the tweets that are coming in at the time of processing so that it runs reliably as the twitter docs specify.
Ensure that your client is reading the stream fast enough. Typically you should not do any real processing work as you read the stream. Read the stream and hand the activity to another thread/process/data store to do your processing asynchronously.
Help would be very much appreciated.

This twitter streaming API will not work with Cloud Functions.
Cloud Functions code can only be invoked in response to incoming events, and the code may only run for up to 9 minutes max (default 60 seconds). After that, the function code is forced to shut down. With Cloud Functions, there is no way to continually process some stream of data coming from an API.
In order to use this API, you will need to use some other compute product that allows you to run code indefinitely on a dedicated server instance, such as App Engine or Compute Engine.

Related

Handle Multiple Concurent Requests for Express Sever on Same Endpoint API

this question might be duplicated but I am still not getting the answer. I am fairly new to node.js so I might need some help. Many have said that node.js is perfectly free to run incoming requests asynchronously, but the code below shows that if multiple requests hit the same endpoint, say /test3, the callback function will:
Print "test3"
Call setTimeout() to prevent blocking of event loop
Wait for 5 seconds and send a response of "test3" to the client
My question here is if client 1 and client 2 call /test3 endpoint at the same time, and the assumption here is that client 1 hits the endpoint first, client 2 has to wait for client 1 to finish first before entering the event loop.
Can anybody here tells me if it is possible for multiple clients to call a single endpoint and run concurrently, not sequentially, but something like 1 thread per connection kind of analogy.
Of course, if I were to call other endpoint /test1 or /test2 while the code is still executing on /test3, I would still get a response straight from /test2, which is "test2" immediately.
app.get("/test1", (req, res) => {
console.log("test1");
setTimeout(() => res.send("test1"), 5000);
});
app.get("/test2", async (req, res, next) => {
console.log("test2");
res.send("test2");
});
app.get("/test3", (req, res) => {
console.log("test3");
setTimeout(() => res.send("test3"), 5000);
});
For those who have visited, it has got nothing to do with blocking of event loop.
I have found something interesting. The answer to the question can be found here.
When I was using chrome, the requests keep getting blocked after the first request. However, with safari, I was able to hit the endpoint concurrently. For more details look at the following link below.
GET requests from Chrome browser are blocking the API to receive further requests in NODEJS
Run your application in cluster. Lookup Pm2
This question needs more details to be answer and is clearly an opinion-based question. just because it is an strawman argument I will answer it.
first of all we need to define run concurrently, it is ambiguous if we assume the literal meaning in stric theory nothing RUNS CONCURRENTLY
CPUs can only carry out one instruction at a time.
The speed at which the CPU can carry out instructions is called the clock speed. This is controlled by a clock. With every tick of the clock, the CPU fetches and executes one instruction. The clock speed is measured in cycles per second, and 1c/s is known as 1 hertz. This means that a CPU with a clock speed of 2 gigahertz (GHz) can carry out two thousand million (or two billion for those in the US) for the rest of us/world 2000 million cycles per second.
cpu running multiple task "concurrently"
yes you're right now-days computers even cell phones comes with multi core which means the number of tasks running at the same time will depend upon the number of cores, but If you ask any expert such as this Associate Staff Engineer AKA me will tell you that is very very rarely you'll find a server with more than one core. why would you spend 500 USD for a multi core server if you can spawn a hold bunch of ...nano or whatever option available in the free trial... with kubernetes.
Another thing. why would you handle/configurate node to be incharge of the routing let apache and/or nginx to worry about that.
as you mentioned there is one thing call event loop which is a fancy way of naming a Queue Data Structure FIFO
so in other words. no, NO nodejs as well as any other programming language out there will run
but definitly it depends on your infrastructure.

Memory Space issue while adding more images for generating word document using docx node library

const imageResponse = await axios.get(url[0], {
responseType: "arraybuffer",
});
const buffer = Buffer.from(imageResponse.data, "utf-8");
const image = Media.addImage(doc, buffer);
I'm using the above code inside one loop that will execute 100 times because it has 100 images. Each image size is max 150kb. I deployed the cloud function with 256mb. I'm getting "Error: memory limit exceeded. Function invocation was interrupted".
Problem statement:
I need to add 250 images in word document. I'm getting memory limit exceeded error.
Q&A
Is there any way to get one image and add to word document, after that clearing the memory used by the image?
How to effectively use this plugin in firebase cloud function with cloud storage for images?
Environment:
Firebase Cloud Function (NodeJs)
Size : 256mb
Word Doc Generating Library : docx (https://docx.js.org/#/)
For the kind of scenario you are describing, as Doug mentions, you should consider increasing your resources to better handle the requests to your functions.
You can set the memory using the flag memory using the gcloud command available for deploy your functions, for example:
gcloud beta functions deploy my_function --runtime=python37 --trigger-event=providers/cloud.firestore/eventTypes/document.write --trigger-resource=projects/project_id/databases/(default)/documents/messages/{pushId}
--memory=AmountOfMemory
I recommend you take a look at the best practices for cloud functions document where is explained:
"Local disk storage in the temporary directory is an in-memory
filesystem. Files that you write consume memory available to your
function, and sometimes persist between invocations. Failing to
explicitly delete these files may eventually lead to an out-of-memory
error and a subsequent cold start."
For have a better perspective about how Cloud functions manage the requests, check this document where is mentioned:
"Cloud Functions handles incoming requests by assigning them to
instances of your function. Depending on the volume of requests, as
well as the number of existing function instances, Cloud Functions may
assign a request to an existing instance or create a new one
Each instance of a function handles only one concurrent request at a
time. This means that while your code is processing one request, there
is no possibility of a second request being routed to the same
instance. Thus the original request can use the full amount of
resources (CPU and memory) that you requested."

Java Rest service Jersey multiple connections doesn't work

I'm using Queue.js as a libary for loading data from a java RestService.
https://github.com/mbostock/queue
I used the following code:
queue()
.defer(d3.json, "rest/v1/status/geographicalData")
.defer(d3.json, "rest/v1/status/geographicalFeatures")
.defer(d3.json, "rest/v1/status/classes")
.awaitAll(function(error, results) { console.log("all done!" + results.size)});
On the libary website is the queue method as following decribed:
queue([parallelism])
Constructs a new queue with the specified parallelism. If parallelism is not specified, the queue has infinite parallelism. Otherwise, parallelism is a positive integer. For example, if parallelism is 1, then all tasks will be run in series. If parallelism is 3, then at most three tasks will be allowed to proceed concurrently; this is useful, for example, when loading resources in a web browser.
I have the problem loading the data takes around 3 minutes. After that I tried loading the data synchron. I got the same time result as executing all 3 at once. So I guess they are not loaded parallel. How can I execute the loading of the the elements parallel?
Update
Java Jersey Rest Service threads
Thread before
Thread after running the website
I started the rest service in the debugging mode after loading the website I saw multiple threads so the Rest service shouldn't be the problem?
After that I looked at the database connections:
DB before
DB after running the website
After laoding the website I saw multiple open database connections so the database should be also ok?
After that I messerued the loading time with the firefox developer tools, but it took much longer with more data. But if its' runnign parallel it should be finished at the same time?
Browser before (less datasources)
Browser after (more datascources)

Meteor Shutdown/Restart Interrupting Running Server-Side Functions

I am building out a data analysis tool in Meteor and am running into issues with the Meteor server shutdown and restart. Server-side, I am pinging several external APIs on a setInterval, trimming the responses down to just new data that I haven't already captured, running a batch of calculations on that new data, and finally storing the computed results in Mongo. For every chunk of new data that I receive from the external APIs, there are about 15 different functions/computations that I need to run, and each of the 15 outputs are being stored in Mongo inside separate documents. Client-side, users can subscribe to any one out of the 15 documents, enabling them to view the data in the manner in which they please.
new data is captured from the API as {A} and {A} is stored in Mongo
|
begin chain
|
function1 -> transforms {A} into {B} and stores {B} in Mongo
|
function2 -> transforms {A} into {C} and stores {C} in Mongo
|
...
|
function15 -> transforms {A} into {P} and stores {P} in Mongo
|
end chain
The problem, is that when I shut down Meteor, or deploy new code to the server (which will automatically restart the server), the loop that iterates through these functions is being interrupted. Say, functions 1-7 ran successfully, and then Meteor restarted, causing functions 8-15 to never run (or even worse, for function 8 to be interrupted while 9-15 never ran). This causes my documents to no longer be in sync with {A} data that was stored before the loop began.
How can this risk be mitigated? Is it possible, to tell Meteor to gracefully shutdown / wait until this process has completed? Thanks!
I've never heard of a graceful shutdown, and even if there were one, I'm not sure I'd trust it...
What I'd do is assign a complete flag to {A} that is triggered when function15 is finished. If the functions are async (or just expensive, but I doubt that's your case...), then create a record in {B}:{P} when {A} is created with the _id of {A} and a complete flag.
Regardless, run the functions on a query of batches where complete is false.
There is currently no official way of a graceful shutdown. So you would have to come up with some other way of making sure your data isn't stored in a inconsistent state.
The easiest way that comes up in my mind would be to disable meteor automatic restarts with meteor --once.
Then add a shutdown mode to your application.
In shutdown mode your application should not pick up new tasks only finish what it is currently working on. This would be easy to do if you use meteor-synced-cron which has a stop method which doesn't kill currently running functions.
Make sure that a task finishing never leaves a document in a inconsistent state.
So you can have multiple tasks working on a document just make sure that when task1 finishes the document will always be in a state to be picked up by task2.

Update value in DB every five minutes

I am building a webapp where user have a ranking based on their twitter activity and their activity on my website.
Therefore I'd like to update their rank every five minutes, pulling their latest activity from twitter and update it in my database. I was thinking of using something like this:
var minutes = 5, the_interval = minutes * 60 * 1000;
setInterval(function() {
// my update here
}, the_interval);
However, I have several questions about this code:
where should I save it to make sure it is run?
will it slow my program or is it a problem to pull data out of twitter every five minute? Should I use their streaming API instead?
Note: I am using mongoDB
I'd suggest that you create a scheduled task/chron job/etc. (depends on your host OS) to call a separate Node.JS application that performs the specific tasks you want to do periodically and then it would exit when complete. (Or you could use a ChildProcess potentially as well).
While Node.JS is async, there's no need, given the description you provided, to perform this work within the same application process that is serving a web application. In fact, as it sounds like "busy work", it would be best handled by a distinct process to avoid impacting directly any of your interactive web users.
The placement shouldn't really matter as Node is asynchronous.

Categories

Resources