I am currently trying to read a 23GB CSV file into MongoDB under NodeJS and am using the bulkWrite function for this as I also want to do updates etc. here.
My problem is that I call the bulkWrite function in my loop every X CSV entries and notice a steadily increasing memory consumption until the script runs into a memory exception at the end.
I'm using node-mongodb-native in version 4.12.1 and also tried 4.10.
Here is my simplified script to test this problem.
import { MongoClient } from "mongodb";
import fs from "fs";
const uri = "mongodb://root:example#localhost:27017";
const client = new MongoClient(uri);
const database = client.db("test");
const collection = database.collection("data");
const readStream = fs.createReadStream("./files/large-data.csv", "utf-8");
var docs = [];
async function writData(docs) {
collection.bulkWrite(
docs.map((doc) => ({
insertOne: {
prop: doc,
},
})),
{
writeConcern: { w: 0, j: false },
ordered: false,
}
);
}
readStream.on("data", async function (chunk) {
docs.push(chunk.toString());
if (docs.length >= 100) {
var cloneData = JSON.parse(JSON.stringify(docs)); // https://jira.mongodb.org/browse/NODE-608
docs = [];
await writData(cloneData);
}
});
I tried a lot of stuff most important if I comment out the await writeData(cloneData); line the script runs with a stable memory consumption of 100MB but if I use the function the memory consumption increases to multiple GB until it crashes.
I also tried to expose the garbage collection --expose-gc and placed global.gc(); into my if statement but it doesn't helped.
So for me it looks like collection.bulkWrite store some information somewhere that I need to clean up but I can't find any information about it. It would be great if anyone has any ideas or experience on what else I can try.
use event-stream split and send on streams.
import fs from "fs"
import es from "event-stream"
fs.createReadStream(filename)
.pipe(es.split())
.pipe(es.parse())
Related
I'm trying to create a scheduled function on Firebase Cloud Functions with third-party APIs. As the size of the data collected through the third-party API and passed to this scheduled function is huge, it returns Function invocation was interrupted. Error: memory limit exceeded.
I have written this index.js (below) with help, but still looking for the way how it should handle the output data of large size inside the scheduler function.
index.js
const firebaseAdmin = require("firebase-admin");
const firebaseFunctions = require("firebase-functions");
firebaseAdmin.initializeApp();
const fireStore = firebaseAdmin.firestore();
const express = require("express");
const axios = require("axios");
const cors = require("cors");
const serviceToken = "SERVICE-TOKEN";
const serviceBaseUrl = "https://api.service.com/";
const app = express();
app.use(cors());
const getAllExamples = async () => {
var url = `${serviceBaseUrl}/examples?token=${serviceToken}`;
var config = {
method: "get",
url: url,
headers: {}
};
return axios(config).then((res) => {
console.log("Data saved!");
return res.data;
}).catch((err) => {
console.log("Data not saved: ", err);
return err;
});
}
const setExample = async (documentId, dataObject) => {
return fireStore.collection("examples").doc(documentId).set(dataObject).then(() => {
console.log("Document written!");
}).catch((err) => {
console.log("Document not written: ", err);
});
}
module.exports.updateExamplesRoutinely = firebaseFunctions.pubsub.schedule("0 0 * * *").timeZone("America/Los_Angeles").onRun(async (context) => {
const examples = await getAllExamples(); // This returns an object of large size, containing 10,000+ arrays
const promises = [];
for(var i = 0; i < examples.length; i++) {
var example = examples[i];
var exampleId = example["id"];
if(exampleId && example) promises.push(setExample(exampleId, example));
}
return Promise.all(promises);
});
Firebase's official document simply tells how to set timeout and memory allocation manually as below, and I'm looking for the way how we should incorporate it with the above scheduler function.
exports.convertLargeFile = functions
.runWith({
// Ensure the function has enough memory and time
// to process large files
timeoutSeconds: 300,
memory: "1GB",
})
.storage.object()
.onFinalize((object) => {
// Do some complicated things that take a lot of memory and time
});
Firebase's official document simply tells how to set timeout and
memory allocation manually as below, and I'm looking for the way how
we should incorporate it with the above scheduler function.
You should do as follows:
module.exports.updateExamplesRoutinely = firebaseFunctions
.runWith({
timeoutSeconds: 540,
memory: "8GB",
})
.pubsub
.schedule("0 0 * * *")
.timeZone("America/Los_Angeles")
.onRun(async (context) => {...)
However you may still encounter the same error if you treat a huge number of "examples" in your CF. As you mentioned in the comment to the other answer it is advisable to cut it into chunks.
How to do that? It's highly depending on your specific case (ex: do you processes 10,000+ examples at each run? Or it is only going to happen once, in order to "digest" a backlog?).
You could treat only a couple of thousands of docs in the scheduled function and schedule it to run every xx seconds. Or you could distribute the work among several instances of the CF by using PubSub triggered versions of your Cloud Function.
I am trying to use pino for logging in to my node app Server and I do have some large logs coming, so rotating the files every day would be more practical.
So I used pino.multistream() and require('file-stream-rotator')
My code works, but for performance reasons, I would not like to use the streams in the main thread.
according to the doc, I should use pino.transport():
[pino.multistream()] differs from pino.transport() as all the streams will be executed within the main thread, i.e. the one that created the pino instance.
https://github.com/pinojs/pino/releases?page=2
However, I can't manage to combine pino.transport() and file-stream-rotator.
my code that does not work completely
-> logs the first entries, but is not exportable because it blocks the script with the error
throw new Error('the worker has exited')
Main file
const pino = require('pino')
const transport = pino.transport({
target: './custom-transport.js'
})
const logger = pino(transport)
logger.level = 'info'
logger.info('Pino: Start Service Logging...')
module.exports = {
logger
}
custom-transport.js file
const { once } = require('events')
const fileStreamRotator = require('file-stream-rotator')
const customTransport = async () => {
const stream = fileStreamRotator.getStream({ filename: 'myfolder/custom-logger.log', frequency: 'daily' })
await once(stream, 'open')
return stream
}
module.exports = customTransport
first and foremost, I'm very new to this. I've been following the tutorials at the Discord.js Site, with the goal being to make a discord bot for the Play by Post DnD server I'm in where everyone wants to gain experience via word count.
I mention I'm new to this because this is my first hands-on experience with Javascript, a lot of the terminology goes over my head.
So, the problem seems to be where I've broken away from the tutorial. It goes over command handlers, which I want to stick with because it seems to be good practice and easier to work with down the line when something most breaks (And I know it will). But the tutorial for Databases (Currency/Sequelizer) doesn't really touch on command handlers beyond "Maintain references".
But that's enough foreword, the problem is in trying to get a command that checks the database for a player's current experience points and level.
I have the seemingly relevant files organized with the index.js and dbObjects.js together, a models folder for the Users, and LevelUp(CurrencyShop in the tutorial) and a separate folder for the Commands like the problematic one, xpcheck.js
I can get the command to function without breaking, using the following,
const { Client, Collection, Formatters, Intents } = require('discord.js');
const { SlashCommandBuilder } = require('#discordjs/builders');
const experience = new Collection();
const level = new Collection();
Reflect.defineProperty(experience, 'getBalance', {
/* eslint-disable-next-line func-name-matching */
value: function getBalance(id) {
const user = experience.get(id);
return user ? user.balance : 0;
},
});
Reflect.defineProperty(level, 'getBalance', {
/* eslint-disable-next-line func-name-matching */
value: function getBalance(id) {
const user = level.get(id);
return user ? user.balance : 1;
},
});
module.exports = {
data: new SlashCommandBuilder()
.setName('xpcheck')
.setDescription('Your current Experience and Level'),
async execute(interaction) {
const target = interaction.options.getUser('user') ?? interaction.user;
return interaction.reply(`${target.tag} is level ${level.getBalance(target.id)} and has ${experience.getBalance(target.id)} experience.`);;
},
};
The problem is that the command doesn't reference the database. It returns default values (1st level, 0 exp) every time.
I tried getting the command to reference the database, one of many attempts being this one;
const { Client, Collection, Formatters, Intents } = require('discord.js');
const { SlashCommandBuilder } = require('#discordjs/builders');
const Sequelize = require('sequelize');
const { Users, LevelUp } = require('./DiscordBot/dbObjects.js');
module.exports = {
data: new SlashCommandBuilder()
.setName('xpcheck')
.setDescription('Your current Experience and Level'),
async execute(interaction) {
const experience = new Collection();
const level = new Collection();
const target = interaction.options.getUser('user') ?? interaction.user;
return interaction.reply(`${target.tag} is level ${level.getBalance(target.id)} and has ${experience.getBalance(target.id)} experience.`);;
},
};
However, when I run node deploy-commands.js, it throws
Error: Cannot find module './DiscordBot/dbObjects.js'
It does the same thing even if I remove the /DiscordBot, or any other way I've attempted to make a constant for it. I'm really uncertain what I should do to alleviate this issue.
My file structure, for reference, is:
v DiscordBot
v commands
xpcheck.js
v models
LevelUp.js
UserItems.js
Users.js
dbInit.js
dbObjects.js
deploy-commands.js
index.js
As was pointed out in the comments, the problem was simple, the solution simpler.
Correcting
const { Users, LevelUp } = require('./dbObjects.js');
to
const { Users, LevelUp } = require('../dbObjects.js');
allows it to search the main directory for the requisite file.
My scenario is to use pouch db data in ionic and I successfully added pouch db package to ionic and created a sample and it worked fine. Now I have a scenario I have the below file
000003.log in which I have all the data, but in ionic it is storing in the indexdb so how can I use this 000003.log data and copy it to indexeddb or is there any way copy the contents ?
Below is my app code
import { Injectable } from '#angular/core';
import PouchDB from 'pouchdb';
#Injectable({
providedIn: 'root'
})
export class DataService {
private database: any;
private myNotes: any;
constructor() {
this.database = new PouchDB('my-notes');
}
public addNote(theNote: string): Promise<string> {
const promise = this.database
.put({
_id: ('note:' + (new Date()).getTime()),
note: theNote
})
.then((result): string => (result.id));
return (promise);
}
getMyNotes() {
return new Promise(resolve => {
let _self = this;
this.database.allDocs({
include_docs: true,
attachments: true
}).then(function (result) {
// handle result
_self.myNotes = result.rows;
console.log("Results: " + JSON.stringify(_self.myNotes));
resolve(_self.myNotes);
}).catch(function (err) {
console.log(err);
});
});
}
How to export/import the existing database in ionic app? Do I have to store in file system or indexeddb?
By default PouchDb will use IndexDb, so its doing it correctly. If you want to change storage you need to setup a different adapter.
I don't see where you set up the options for the local adapter, so I think you are missing the local & adapter setup options to support it
Now use the correct adapter you want PouchDB here
I've created an Ionic 5/Angular repo that demonstrates how to take a local pouchdb as described in the OP and load it as a default canned database in the app.
https://github.com/ramblin-rose/canned-pouch-db
The hurdles were not huge, but I encountered some problems along the way, mainly some wrangling with respect to pouchdb's es modules and module default exports.
Specifically, the documentation for pouchdb-replication-stream is not helpful for incorporation for Ionic5/Angular. I assumed the import
import ReplicationStream from 'pouchdb-replication-stream';
Would just work, but unfortunately at runtime this dread error would popup
Type Error: Promise is not a constructor
Ouch! That's a show stopper. However I came across the pouchdb-replication-stream issue es modules
Which prompted the solution:
import ReplicationStream from 'pouchdb-replication-stream/dist/pouchdb.replication-stream.min.js';
Anyway the highlights of the repo are 'can-a-pouchdb.js' and 'data.service.ts'.
can-a-pouchdb.js
This script will create a local node pouchdb and then serialize that db to app/assets/db, which is later loaded by the ionic app.
The important bits of code:
// create some trivial docs
const docs = [];
const dt = new Date(2021, 6, 4, 12, 0, 0);
for (let i = 0; i < 10; i++, dt.setMinutes(dt.getMinutes() + i)) {
docs[i] = {
_id: "note:" + dt.getTime(),
note: `Note number ${i}`,
};
}
// always start clean - remove database dump file
fs.rmdirSync(dbPath, { recursive: true });
PouchDB.plugin(replicationStream.plugin);
PouchDB.adapter(
"writableStream",
replicationStream.adapters.writableStream
);
const db = new PouchDB(dbName);
console.log(JSON.stringify(docs));
await db.bulkDocs(docs);
//
// dump db to file.
//
fs.mkdirSync(dumpFileFolder, { recursive: true });
const ws = fs.createWriteStream(dumpFilePath);
await db.dump(ws);
To recreate the canned database run the following from the CL:
$ node can-a-pouchdb.js
data.service.ts
Here's how the app's pouchdb is hydrated from the canned database. Take note the db is using the memory adapter, because as a demo app not persisting the db is desirable.
public async init(): Promise<void> {
if (this.db === undefined) {
PouchDB.plugin(PouchdbAdapterMemory);
PouchDB.plugin(ReplicationStream.plugin);
this.db = new PouchDB(DataService.dbName, { adapter: 'memory' });
// if the db is empty, hydrate it with the canned db assets/db
const info = await this.db.info();
if (info.doc_count === 0) {
//load the asset into a string
const cannedDbText = await this.http
.get('/assets/db/mydb.dump.txt', {
responseType: 'text',
})
.toPromise();
// hydrate the db
return (this.db as any).load(
MemoryStream.createReadStream(cannedDbText)
);
}
}
I'm relatively new to Cloud Functions and have been trying to solve this issue for a while. Essentially, the function I'm trying to write is called whenever there is a complete upload onto Firebase Cloud Storage. However, when the function runs, half the time, it runs to the following error:
The following error occured: { Error: ENOENT: no such file or directory, open '/tmp/dataprocessing/thisisthefilethatiswritten.zip'
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: '/tmp/dataprocessing/thisisthefilethatiswritten.zip' }
Here's the code:
const functions = require('firebase-functions');
const admin = require('firebase-admin')
const inspect = require('util').inspect
const path = require('path');
const os = require('os');
const fs = require('fs-extra');
const firestore = admin.firestore()
const storage = admin.storage()
const runtimeOpts = {
timeoutSeconds: 540,
memory: '2GB'
}
const uploadprocessing = functions.runWith(runtimeOpts).storage.object().onFinalize(async (object) => {
const filePath = object.name
const fileBucket = object.bucket
const bucket_fileName = path.basename(filePath);
const uid = bucket_fileName.match('.+?(?=_)')
const original_filename = bucket_fileName.split('_').pop()
const bucket = storage.bucket(fileBucket);
const workingDir = path.join(os.tmpdir(), 'dataprocessing/');
const tempFilePath = path.join(workingDir, original_filename);
await fs.ensureDir(workingDir)
await bucket.file(filePath).download({destination: tempFilePath})
//this particular code block I included because I was worried that the file wasn't
//being uploaded to the tmp directly, but the results of the function
//seems to confirm to me that the file does exist.
await fs.ensureFile(tempFilePath)
console.log('success!')
fs.readdirSync(workingDir).forEach(file => {
console.log('file found: ', file);
});
console.log('post success')
fs.readdirSync('/tmp/dataprocessing').forEach(file => {
console.log('tmp file found: ', file);
})
fs.readFile(tempFilePath, function (err, buffer) {
if (!err) {
//data processing comes here. Please note that half the time it never gets into this
//loop as instead it goes into the else function below and outputs that error.
}
else {
console.log("The following error occured: ", err);
}
})
fs.unlinkSync(tempFilePath);
return
})
module.exports = uploadprocessing;
I've been trying so many different things and the weird thing is that when I add code into the "if (!err)" (which doesn't actually run because of the err) it just arbitrarily starts working sometimes quite consistently, but then it stops working when I add different code. I would have assumed that the issue arises from the code that I added, but then the error comes up literally when I just change/add/remove comments as well... Which should technically have no effect on the function running...
Any thoughts? Thank you in advance!!! :)
fs.readFile is asynchronous and returns immediately. Your callback function is invoked some time later with the contents of the buffer. This means that fs.unlinkSync is going to delete the file at the same time it's being read. This means you effectively have a race condition, and it's possible that the file will be removed before it's ever read.
Your code should wait until the read is complete before moving on to the delete. Perhaps you want to use fs.readFileSync instead.