I am currently trying to make a data pipeline using Node.js
Of course, it's not the best way to make it but I want to try implementing it anyways before I make improvements upon it.
This is the situation
I have multiple gzip compressed csv files on AWS S3. I get these "objects" using aws sdk
like the following and make them into readStream
const unzip = createGunzip()
const input = s3.getObject(parameterWithBucketandKey)
.createReadStream()
.pipe(unzip)
and using the stream above I create readline interface
const targetFile = createWriteSTream('path to target file');
const rl = createInterface({
input: input
})
let first = true;
rl.on('line', (line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
and this is wrapped as a promise
and I have array of filenames to be retrieved from AWS S3 and map those array of filenames like this
const arrayOfFileNames = [name1, name2, name3 ... and 5000 more]
const arrayOfPromiseFileProcesses= arrayOfFileNames.map((filename) => return promiseFileProcess(filename))
await Promise.all(arrayOfPromiseFileProcesses);
// the result should be multiple gzip files that are compressed again.
sorry I wrote in pseudocode if it needs more to provide context then I will write more but I thought this would give a general contenxt of my problem.
My problem is that it writes to a file fine, but when i change the file_name it it doesn't create one afterwards. I am lost in this synchronous and asynchronous world...
Please give me a hint/reference to read upon. Thank you.
line event handler must be a async function as it invokes await
rl.on('line', async(line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
Related
As there are things called 'callback hell'. It was the only way I can get a file from a server to my vps pc, and upload it. The process was simple:
Download a .json file from the ftp server
Edit the .json file on the pc
Upload the .json file and delete the pc's copy.
However my problem was this: Although it downloads once, it returns the upload based on how many times I command it during 1 session (command #1, does it once, command#2, does it twice, etc).
I tried to run it as imperative, but gets nullified. Had to resort to callback hell to run the code almost properly. The trigger works to initialize the command, but the command and session goof'd.
(( //declaring my variables as parameters
ftp=new (require('ftp'))(),
fs=require('fs'),
serverFolder='./Path/Of/Server/',
localFolder='./Path/Of/Local/',
file='some.json',
{log}=console
)=>{
//run server if its ready
ftp.on('ready',()=>{
//collect a list of files from the server folder
ftp.list(serverFolder+file,(errList,list)=>
errList|| typeof list === 'object' &&
list.forEach($file=>
//if the individual file matches, resume to download the file
$file.name===file&&(
ftp.get(serverFolder+file,(errGet,stream)=>
errGet||(
log('files matched! cdarry onto the operation...'),
stream.pipe(fs.createReadStream(localFolder+file)),
stream.once('close',()=>{
//check if the file has a proper size
fs.stat(localFolder+file,(errStat,stat)=>
errStat || stat.size === 0
//will destroy server connection if bytes = 0
?(ftp.destroy(),log('the file has no value'))
//uploads if the file has a size, edits, and ships
:(editThisFile(),
ftp.put(
fs.createReadStream(localFolder+file),
serverFolder+file,err=>err||(
ftp.end(),log('process is complete!')
))
//editThisFile() is a place-holder editor
//edits by path, and object
)
})
)
)
)
)
);
});
ftp.connect({
host:'localHost',
password:'1Forrest1!',
port:'21',
keepalive:0,
debug: console.log.bind(console)
});
})()
The main problem is: it'll return a copy of the command over and over as 'carry over' for some reason.
Edit: although the merits of "programming style" is different than common meta. It all leads to the same issue of callback hell. Any recommendations are needed.
For readability, I had help editing my code to ease difficulty. Better Readability version
The ftp modules API leads to the callback hell. It also hasn't been maintained for a while and is buggy. Try a module with promises like basic-ftp.
With promises the code flow becomes much easier to reason with and errors don't require specific handling, unless you want to.
const ftp = require('basic-ftp')
const fsp = require('fs').promises
async function updateFile(localFile, serverFile){
const client = new ftp.Client()
await client.access({
host: 'localHost',
password: '1Forrest1!',
})
await client.downloadTo(localFile, serverFile)
const stat = await fsp.stat(localFile)
if (stat.size === 0) throw new Error('File has no size')
await editThisFile(localFile)
await client.uploadFrom(localFile, serverFile)
}
const serverFolder = './Path/Of/Server'
const localFolder = './Path/Of/Local'
const file = 'some.json'
updateFile(localFolder + file, serverFolder + file).catch(console.error)
I have to fetch about 30 files with ES6, each of them consists of 100 MB lines of text.
I parse the text, line by line, counting some data points. The result is a small array like
[{"2014":34,"2015":34,"2016":34,"2017":34,"2018":12}]
I'm running into memory problems while parsing the files (Chrome simply crashes the debugger), probably because I am parsing them all with map:
return Promise.all(filenamesArray.map( /*fetch each file in filenamesArray */ )).
then(() => { /*parse them all */ })
I'm not posting the full code because I know it's wrong anyway. What I would like to do is
Load a single file with fetch
Parse its text with a result array such as above
Return the result array and store it somewhere until every file has been parsed
Give the js engine / gc enough time to clear the text from step 1 from memory
Load the next file (continue with 1, but only after step 1-4 are finished!).
but I can't seem to find a solution for that. Could anyone show me an example?
I don't care if its promises, callback functions, async/await...as long as each file is parsed completely before the next one is started.
EDIT 2020825
Sorry for my late update, I only came around fixing my problem now.
While I appreciate Josh Linds answer, I realized that I still have a problem with the async nature of fetch which I apparently did not describe well enough:
How do I deal with promises to make sure one file is finished and its memory may be released? I implemented Joshs solution with Promises.all, only to discover that this would still load all files first and then start processing them.
Luckily, I found another SO question with almost the same problem:
Resolve promises one after another (i.e. in sequence)?
and so I learned about async functions. In order to use them with fetch, this question helped me:
How to use fetch with async/await?
So my final code looks like this:
//returns a promise resolving with an array of all processed files
loadAndCountFiles(filenamesArray) {
async function readFiles(filenamesArray) {
let resultArray = [];
for (const filename of filenamesArray) {
const response = await fetch(filename);
const text = await response.text();
//process the text and return a much smaller extract
const yearCountObject = processText(text);
resultArray.push({
filename: filename,
yearCountObject: yearCountObject
});
console.log("processed file " + filename);
}
return resultArray;
}
return new Promise(
(resolve, reject) => {
console.log("starting filecount...");
readFiles(filenamesArray)
.then(resultArray => {
console.log("done: " + resultArray);
resolve(resultArray);
})
.catch((error) => {
reject(error);
})
}
);
}
Now every file is fetched and processed before the next.
Global variable:
dictionary = {};
In main:
fileNamesArray.forEach(fname => readFile(fname));
Functions:
const readFile = (fname) => {
/* get file */.then(file => {
/* parse file */
addToDict(year); // year is a string. Call this when you find a year
})
}
const addToDict = (key) => {
if (dictionary[key]) dictionary[key]++;
else dictionary[key] = 1;
}
I'm trying to write in a text file, but not at the end like appendFile() do or by replacing the entiere content...
I saw it was possible to chose where you want to start with start parameter of fs.createwritestream() -> https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
But there is no parameter to say where to stop writting, right ? So it remove all the end of my file after I wrote with this function.
const fs = require('fs');
var logger = fs.createWriteStream('result.csv', {
flags: 'r+',
start: 20 //start to write at the 20th caracter
})
logger.write('5258,525,98951,0,1\n') //example a new line to write
Is there a way to specify where to stop writting in the file to have something like:
....
data from begining
....
5258,525,98951,0,1
...
data till the end
...
I suspect you mean, "Is it possible to insert in the middle of the file." The answer to that is: No, it isn't.
Instead, to insert, you have to:
Determine how big what you're inserting is
Copy the data at your insertion point to that many bytes later in the file
Write your data
Obviously when doing #2 you need to be sure that you're not overwriting data you haven't copied yet (either by reading it all into memory first or by working in blocks, from the end of the file toward the insertion point).
(I've never looked for one, but there may be an npm module out there that does this for you...)
You could read/parse your file at first. Then apply the modifications and save the new file.
Something like:
const fs = require("fs");
const fileData = fs.readFileSync("result.csv", { encoding: "utf8" });
const fileDataArray = fileData.split("\n");
const newData = "5258,525,98951,0,1";
const index = 2; // after each row to insert your data
fileDataArray.splice(index, 0, newData); // insert data into the array
const newFileData = fileDataArray.join("\n"); // create the new file
fs.writeFileSync("result.csv", newFileData, { encoding: "utf8" }); // save it
I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.
var spawn = require('child_process').spawn;
// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
But when I do, node gives the error:
spawn E2BIG
I came across this article
So piping the data to the child process seems to be the way to go. My code is now:
var spawn = require('child_process').spawn;
console.log('creating child........................');
var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);
var pipe = child.stdio[3];
pipe.write(Buffer('awesome'));
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
And then in getStatistics.js:
console.log('im inside child');
process.stdin.on('data', function(data) {
console.log('data is ', data);
process.exit(0);
});
However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?
EDIT
I had to abandon the buffer approach. Now I'm sending the array as a message:
var cp = require('child_process');
var child = cp.fork('/getStatistics.js');
child.send({
dataAsNumbers: dataAsNumbers
});
But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.
With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.
shm-typed-array is a very simple module that seems suited to your application. Example:
parent.js
"use strict";
const shm = require('shm-typed-array');
const fork = require('child_process').fork;
// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');
// Fill with dummy data
Array.prototype.fill.call(data, 1);
// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
console.log(`Got answer: ${sum}`);
// Demo only; ideally you'd re-use the same child
child.kill();
});
child.send(data.key);
child.js
"use strict";
const shm = require('shm-typed-array');
process.on('message', key => {
// Get access to shared memory
const data = shm.get(key, 'Float64Array');
// Perform processing
const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);
// Return processed data
process.send(sum);
});
Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.
Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.
I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following
// main.js
const fork = require('child_process').fork
const child = fork('./getStats.js')
const dataAsNumbers = Array(100000).fill(0).map(() =>
Array(100).fill(0).map(() => Math.round(Math.random() * 100)))
child.send({
dataAsNumbers: dataAsNumbers,
})
And
// getStats.js
process.on('message', function (data) {
console.log('data is ', data)
process.exit(0)
})
node main.js 2.72s user 0.45s system 103% cpu 3.045 total
I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.
If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.
Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.
Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.
You could consider using OS pipes you'll find a gist here as an input to your node child application.
I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process.
Here is an example of clustering.
var cluster = require('cluster');
var numCPUs = 4;
if (cluster.isMaster) {
for (var i = 0; i < numCPUs; i++) {
var worker = cluster.fork();
console.log('id', worker.id)
}
} else {
doSomeWork()
}
function doSomeWork(){
for (var i=1; i<10; i++){
console.log(i)
}
}
More info sending messages across workers question 8534462.
Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.
Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.
You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.
The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.
To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.
For long process tasks you could use something like gearman You could do the heavy work process on workers, in this way you can setup how many workers you need, for example I do some file processing in this way, if I need scale you create more worker instance, also I have different workers for different tasks, process zip files, generate thumbnails, etc, the good of this is the workers can be written on any language node.js, Java, python and can be integrated on your project with ease
// worker-unzip.js
const debug = require('debug')('worker:unzip');
const {series, apply} = require('async');
const gearman = require('gearmanode');
const {mkdirpSync} = require('fs-extra');
const extract = require('extract-zip');
module.exports.unzip = unzip;
module.exports.worker = worker;
function unzip(inputPath, outputDirPath, done) {
debug('unzipping', inputPath, 'to', outputDirPath);
mkdirpSync(outputDirPath);
extract(inputPath, {dir: outputDirPath}, done);
}
/**
*
* #param {Job} job
*/
function workerUnzip(job) {
const {inputPath, outputDirPath} = JSON.parse(job.payload);
series([
apply(unzip, inputPath, outputDirPath),
(done) => job.workComplete(outputDirPath)
], (err) => {
if (err) {
console.error(err);
job.reportError();
}
});
}
function worker(config) {
const worker = gearman.worker(config);
if (config.id) {
worker.setWorkerId(config.id);
}
worker.addFunction('unzip', workerUnzip, {timeout: 10, toStringEncoding: 'ascii'});
worker.on('error', (err) => console.error(err));
return worker;
}
a simple index.js
const unzip = require('./worker-unzip').worker;
unzip(config); // pass host and port of the Gearman server
I normally run workers with PM2
the integration with your code it's very easy. something like
//initialize
const gearman = require('gearmanode');
gearman.Client.logger.transports.console.level = 'error';
const client = gearman.client(configGearman); // same host and port
the just add work to the queue passing the name of the functions
const taskpayload = {inputPath: '/tmp/sample-file.zip', outputDirPath: '/tmp/unzip/sample-file/'}
const job client.submitJob('unzip', JSON.stringify(taskpayload));
job.on('complete', jobCompleteCallback);
job.on('error', jobErrorCallback);
Suppose I have a readable stream, e.g. request(URL). And I want to write its response on the disk via fs.createWriteStream() and piping with the request. But at the same time I want to calculate a checksum of the downloading data via crypto.createHash() stream.
readable -+-> calc checksum
|
+-> write to disk
And I want to do it on the fly, without buffering an entire response in memory.
It seems that I can implement it using oldschool on('data') hook. Pseudocode below:
const hashStream = crypto.createHash('sha256');
hashStream.on('error', cleanup);
const dst = fs.createWriteStream('...');
dst.on('error', cleanup);
request(...).on('data', (chunk) => {
hashStream.write(chunk);
dst.write(chunk);
}).on('end', () => {
hashStream.end();
const checksum = hashStream.read();
if (checksum != '...') {
cleanup();
} else {
dst.end();
}
}).on('error', cleanup);
function cleanup() { /* cancel streams, erase file */ };
But such approach looks pretty awkward. I tried to use stream.Transform or stream.Writable to implement something like read | calc + echo | write but I'm stuck with the implementation.
Node.js readable streams have a .pipe method which works pretty much like the unix pipe-operator, except that you can stream js objects as well as just strings of some type.
Here's a link to the doc on pipe
An example of the use in your case could be something like:
const req = request(...);
req.pipe(dst);
req.pipe(hash);
Note that you still have to handle errors per stream as they're not propagated and the destinations are not closed if the readable errors.