Cloud Function scheduler with memory allocation management - javascript

I'm trying to create a scheduled function on Firebase Cloud Functions with third-party APIs. As the size of the data collected through the third-party API and passed to this scheduled function is huge, it returns Function invocation was interrupted. Error: memory limit exceeded.
I have written this index.js (below) with help, but still looking for the way how it should handle the output data of large size inside the scheduler function.
index.js
const firebaseAdmin = require("firebase-admin");
const firebaseFunctions = require("firebase-functions");
firebaseAdmin.initializeApp();
const fireStore = firebaseAdmin.firestore();
const express = require("express");
const axios = require("axios");
const cors = require("cors");
const serviceToken = "SERVICE-TOKEN";
const serviceBaseUrl = "https://api.service.com/";
const app = express();
app.use(cors());
const getAllExamples = async () => {
var url = `${serviceBaseUrl}/examples?token=${serviceToken}`;
var config = {
method: "get",
url: url,
headers: {}
};
return axios(config).then((res) => {
console.log("Data saved!");
return res.data;
}).catch((err) => {
console.log("Data not saved: ", err);
return err;
});
}
const setExample = async (documentId, dataObject) => {
return fireStore.collection("examples").doc(documentId).set(dataObject).then(() => {
console.log("Document written!");
}).catch((err) => {
console.log("Document not written: ", err);
});
}
module.exports.updateExamplesRoutinely = firebaseFunctions.pubsub.schedule("0 0 * * *").timeZone("America/Los_Angeles").onRun(async (context) => {
const examples = await getAllExamples(); // This returns an object of large size, containing 10,000+ arrays
const promises = [];
for(var i = 0; i < examples.length; i++) {
var example = examples[i];
var exampleId = example["id"];
if(exampleId && example) promises.push(setExample(exampleId, example));
}
return Promise.all(promises);
});
Firebase's official document simply tells how to set timeout and memory allocation manually as below, and I'm looking for the way how we should incorporate it with the above scheduler function.
exports.convertLargeFile = functions
.runWith({
// Ensure the function has enough memory and time
// to process large files
timeoutSeconds: 300,
memory: "1GB",
})
.storage.object()
.onFinalize((object) => {
// Do some complicated things that take a lot of memory and time
});

Firebase's official document simply tells how to set timeout and
memory allocation manually as below, and I'm looking for the way how
we should incorporate it with the above scheduler function.
You should do as follows:
module.exports.updateExamplesRoutinely = firebaseFunctions
.runWith({
timeoutSeconds: 540,
memory: "8GB",
})
.pubsub
.schedule("0 0 * * *")
.timeZone("America/Los_Angeles")
.onRun(async (context) => {...)
However you may still encounter the same error if you treat a huge number of "examples" in your CF. As you mentioned in the comment to the other answer it is advisable to cut it into chunks.
How to do that? It's highly depending on your specific case (ex: do you processes 10,000+ examples at each run? Or it is only going to happen once, in order to "digest" a backlog?).
You could treat only a couple of thousands of docs in the scheduled function and schedule it to run every xx seconds. Or you could distribute the work among several instances of the CF by using PubSub triggered versions of your Cloud Function.

Related

cron jobs on firebase cloud functions

I have two cloud functions
One cloud function is for set or updating existing scheduled job
Canceling an existing scheduled job
I am using import * as the schedule from 'node-schedule'; to manage Scheduling a jobs
The problem is cus createJob function is triggered and jobId is returned, but later when I triger cancelJob function all prev scheduled cron jobs do not exist cus node-schedule lives in memory and I can't access the jobs:
this will return empty object: const allJobs = schedule.scheduledJobs;
Does anyone have some solution on this situation?
UTILS this is the main logic that is called when some of my cloud functions are triggered
enter code here
// sendgrid
import * as sgMail from '#sendgrid/mail';
import * as schedule from 'node-schedule';
sgMail.setApiKey(
'apikey',
);
import {
FROM_EMAIL,
EMAIL_TEMPLATE_ID,
MESSAGING_SERVICE_SID,
} from './constants';
export async function updateReminderCronJob(data: any) {
try {
const {
to,
...
} = data;
const message = {
to,
from: FROM_EMAIL,
templateId: EMAIL_TEMPLATE_ID,
};
const jobReferences: any[] = [];
// Stop existing jobs
if (jobIds && jobIds.length > 0) {
jobIds.forEach((j: any) => {
const job = schedule.scheduledJobs[j?.jobId];
if (job) {
job.cancel();
}
});
}
// Create new jobs
timestamps.forEach((date: number) => {
const job = schedule.scheduleJob(date, () => {
if (selectedEmail) {
sgMail.send(message);
}
});
if (job) {
jobReferences.push({
jobId: job.name,
});
}
});
console.warn('jobReferences', jobReferences);
return jobReferences;
} catch (error) {
console.error('Error updateReminderCronJob', error);
return null;
}
}
export async function cancelJobs(jobs: any) {
const allJobs = schedule.scheduledJobs;
jobs.forEach((job: any) => {
if (!allJobs[job?.jobId]) {
return;
}
allJobs[job.jobId].cancel();
});
}
node-schedule will not work effectively in Cloud Functions because it requires that the scheduling and execution all be done on a single machine that stays running without interruption. Cloud Functions does not fully support this behavior, as it will dynamically scale up and down to zero the number of machines servicing requests (even if you set min instances to 1, it may still reduce your active instances to 0 in some cases). You will get unpredictable behavior if you try to schedule this way.
The only way you can get reliable scheduling using Cloud Functions is with pub/sub functions as described in the documentation. Firebase scheduled functions make this a bit easier by managing some of the details. You will not be able to dynamically control repeating jobs, so you will need to build some way to periodically run a job and check to see if it should run at that moment.

MongoDB bulkWrite in NodeJS causes a memory leak

I am currently trying to read a 23GB CSV file into MongoDB under NodeJS and am using the bulkWrite function for this as I also want to do updates etc. here.
My problem is that I call the bulkWrite function in my loop every X CSV entries and notice a steadily increasing memory consumption until the script runs into a memory exception at the end.
I'm using node-mongodb-native in version 4.12.1 and also tried 4.10.
Here is my simplified script to test this problem.
import { MongoClient } from "mongodb";
import fs from "fs";
const uri = "mongodb://root:example#localhost:27017";
const client = new MongoClient(uri);
const database = client.db("test");
const collection = database.collection("data");
const readStream = fs.createReadStream("./files/large-data.csv", "utf-8");
var docs = [];
async function writData(docs) {
collection.bulkWrite(
docs.map((doc) => ({
insertOne: {
prop: doc,
},
})),
{
writeConcern: { w: 0, j: false },
ordered: false,
}
);
}
readStream.on("data", async function (chunk) {
docs.push(chunk.toString());
if (docs.length >= 100) {
var cloneData = JSON.parse(JSON.stringify(docs)); // https://jira.mongodb.org/browse/NODE-608
docs = [];
await writData(cloneData);
}
});
I tried a lot of stuff most important if I comment out the await writeData(cloneData); line the script runs with a stable memory consumption of 100MB but if I use the function the memory consumption increases to multiple GB until it crashes.
I also tried to expose the garbage collection --expose-gc and placed global.gc(); into my if statement but it doesn't helped.
So for me it looks like collection.bulkWrite store some information somewhere that I need to clean up but I can't find any information about it. It would be great if anyone has any ideas or experience on what else I can try.
use event-stream split and send on streams.
import fs from "fs"
import es from "event-stream"
fs.createReadStream(filename)
.pipe(es.split())
.pipe(es.parse())

Why the following code takes more time locally?

I wrote the following lambda to move messages from queueA to queueB
async function reprocess_messages(fromQueue, toQueue) {
try {
const response1 = await sqs.send(new GetQueueUrlCommand({ QueueName: fromQueue }));
const response2 = await sqs.send(new GetQueueUrlCommand({ QueueName: toQueue }));
const fromQueueUrl = response1.QueueUrl;
const toQueueUrl = response2.QueueUrl;
let completed = false;
while (!completed) {
completed = await moveMessage(toQueueUrl, fromQueueUrl);
// console.log(status);
}
// console.log(completed);
return completed;
} catch (err) {
console.error(err);
}
}
async function moveMessage(toQueueUrl, fromQueueUrl) {
try {
const receiveMessageParams = {
MaxNumberOfMessages: 10,
MessageAttributeNames: ["Messsages"],
QueueUrl: fromQueueUrl,
VisibilityTimeout: 2,
WaitTimeSeconds: 0,
};
const receiveData = await sqs.send(new ReceiveMessageCommand(receiveMessageParams));
// console.log(receiveData);
if (!receiveData.Messages) {
console.log("finished");
return true;
}
const messages = [];
receiveData.Messages.forEach(msg => {
messages.push({ body: msg["Body"], receiptHandle: msg["ReceiptHandle"] });
});
const sendMsg = async ({ body, receiptHandle }) => {
const sendMessageParams = {
MessageBody: body,
QueueUrl: toQueueUrl
};
await sqs.send(new SendMessageCommand(sendMessageParams));
// console.log("Success, message sent. MessageID: ", sentData.MessageId);
return "Success";
};
const deleteMsg = async ({ body, receiptHandle }) => {
const deleteMessageParams = {
QueueUrl: fromQueueUrl,
ReceiptHandle: receiptHandle
};
await sqs.send(new DeleteMessageCommand(deleteMessageParams));
// console.log("Message deleted", deleteData);
return "Deleted";
};
const sent = await Promise.all(messages.map(sendMsg));
// console.log(sent);
await Promise.all(messages.map(deleteMsg));
// console.log(deleted);
console.log(sent.length);
return false;
} catch (err) {
console.log(err);
}
}
export const handler = async function (event, context) {
console.log("Invoking lambda");
const response = await reprocess_messages("queueA", "queueB");
console.log(response);
}
With lambda config of 256 MB it takes 19691ms and with 512 MB it takes 10171ms to move 1000 messages from queueA to queueB. However, on my local system when I run reprocess_messages(qA, qB) it takes around 2 mins to move messages from queueA to queueB.
Does this mean that if I increase the memory limit to 1024 MB it will take only around 5000ms and how can I find the optimal memory limit?
It will most likely always be the case that running code on your local machine to interact with AWS services will be slower than if you were to run this code on an AWS service like Lambda. When your code is running on Lambda and interacting with AWS services in the same region, the latency is often only from the AWS network which can be drastically lower than the latency between your network and the AWS region you're working with.
In terms of finding the optimal performance, this is trial and error. You are trying to find the sweet spot between price and performance. There are tools like compute optimizer that can assist you with this.
A useful note to bear in mind here is that as you increase Lambda memory, you can avail of further vCPU cores, it is roughly every 1,720mb more gives you another vCPU core. So while your current performance increase is tied to just a memory increase, if you were to increase to a number that gives an additional vCPU core, it may have a greater affect. Unfortunately it can be difficult to theorize the results and it is best just to trial and error the different scenarios

Error: ENOENT: no such file or directory even when file exists in Firebase Cloud Functions

I'm relatively new to Cloud Functions and have been trying to solve this issue for a while. Essentially, the function I'm trying to write is called whenever there is a complete upload onto Firebase Cloud Storage. However, when the function runs, half the time, it runs to the following error:
The following error occured: { Error: ENOENT: no such file or directory, open '/tmp/dataprocessing/thisisthefilethatiswritten.zip'
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: '/tmp/dataprocessing/thisisthefilethatiswritten.zip' }
Here's the code:
const functions = require('firebase-functions');
const admin = require('firebase-admin')
const inspect = require('util').inspect
const path = require('path');
const os = require('os');
const fs = require('fs-extra');
const firestore = admin.firestore()
const storage = admin.storage()
const runtimeOpts = {
timeoutSeconds: 540,
memory: '2GB'
}
const uploadprocessing = functions.runWith(runtimeOpts).storage.object().onFinalize(async (object) => {
const filePath = object.name
const fileBucket = object.bucket
const bucket_fileName = path.basename(filePath);
const uid = bucket_fileName.match('.+?(?=_)')
const original_filename = bucket_fileName.split('_').pop()
const bucket = storage.bucket(fileBucket);
const workingDir = path.join(os.tmpdir(), 'dataprocessing/');
const tempFilePath = path.join(workingDir, original_filename);
await fs.ensureDir(workingDir)
await bucket.file(filePath).download({destination: tempFilePath})
//this particular code block I included because I was worried that the file wasn't
//being uploaded to the tmp directly, but the results of the function
//seems to confirm to me that the file does exist.
await fs.ensureFile(tempFilePath)
console.log('success!')
fs.readdirSync(workingDir).forEach(file => {
console.log('file found: ', file);
});
console.log('post success')
fs.readdirSync('/tmp/dataprocessing').forEach(file => {
console.log('tmp file found: ', file);
})
fs.readFile(tempFilePath, function (err, buffer) {
if (!err) {
//data processing comes here. Please note that half the time it never gets into this
//loop as instead it goes into the else function below and outputs that error.
}
else {
console.log("The following error occured: ", err);
}
})
fs.unlinkSync(tempFilePath);
return
})
module.exports = uploadprocessing;
I've been trying so many different things and the weird thing is that when I add code into the "if (!err)" (which doesn't actually run because of the err) it just arbitrarily starts working sometimes quite consistently, but then it stops working when I add different code. I would have assumed that the issue arises from the code that I added, but then the error comes up literally when I just change/add/remove comments as well... Which should technically have no effect on the function running...
Any thoughts? Thank you in advance!!! :)
fs.readFile is asynchronous and returns immediately. Your callback function is invoked some time later with the contents of the buffer. This means that fs.unlinkSync is going to delete the file at the same time it's being read. This means you effectively have a race condition, and it's possible that the file will be removed before it's ever read.
Your code should wait until the read is complete before moving on to the delete. Perhaps you want to use fs.readFileSync instead.

Web scraping with Nightmare Cloud function works locally, but not when deployed

I am trying to scrape a JavaScript calendar and return a JSON array of its events in Google Cloud Functions (Blaze plan). The below function works, but only when ran locally through the Firebase emulators. It deploys successfully, but every call causes a timeout. No errors were thrown in the logs or anything. Both the local and deployed functions running on Node.js 10. (EDIT: I found this article mentioning xvfb is required to use Nightmare without a display, but I'm not sure how I would add that to Firebase, or even install it)
const functions = require('firebase-functions');
const Nightmare = require('nightmare'); //latest version
const retrieveEventsOpts = { memory: "2GB", timeoutSeconds: 60 };
exports.retrieveEventsArray = functions.runWith(retrieveEventsOpts).https.onRequest(async (request, response) => {
nightmare = Nightmare({show: false})
try {
await nightmare
.goto('https://www.csbcsaints.org/calendar')
.evaluate(() => document.querySelector('body').innerHTML )
.then(firstResponse => {
let responseJSON = parseHTMLForEvents(firstResponse) //Just a function that synchronously parses the HTML string to a JSON array
return response.status(200).json(responseJSON)
}).catch(error => {
return response.status(500).json(error)
})
} catch(error) {
return response.status(500).json(error)
}
})

Categories

Resources