i have this function :
const list = [];
(async () => {
await fs.readdir(JSON_DIR, async (err, files) => {
await files.forEach(async filename => {
const readStream = fs.createReadStream(path.join("output/scheduled", filename));
const parseStream = json.createParseStream();
await parseStream.on('data', async (hostlist: HostInfo[]) => {
hostlist.forEach(async host => {
list.push(host);
});
});
readStream.pipe(parseStream);
})
});
//here list.length = 0
console.log(list.length);
})();
the function read from a directory of large json files, and reads them, for each file,it create a stream that starts reading the json, and the stream can be working at the same time.
at the end of the function i need to save the variable host in the list, but when i check the lis at the end, is empty.
how can i save the content of the host to a global variable, so it can be accessible in the end.
i tought as solution to check when every file is finished reading using and end event.
though to access the list at the end, i need another event to start when all other events are finished.
and looks complicated.
i have been using the big-json library,
https://www.npmjs.com/package/big-json
You could use a counter to determine when the streams have finished processing.
You can use readdirSync for executing the operation synchronously.
const list: HostInfo[] = [];
(() => {
const files = fs.readdirSync(JSON_DIR);
let streamFinished = 0;
let streamCount = files.length;
files.forEach((filename) => {
const readStream = fs.createReadStream(
path.join('output/scheduled', filename)
);
const parseStream = json.createParseStream();
parseStream.on('error', (err) => {
// Handle errors
})
parseStream.on('data', (hostlist: HostInfo[]) => {
list.push(...hostlist);
});
parseStream.on('end', () => {
streamFinished++;
if (streamFinished === streamCount) {
// End of all streams...
}
console.log(list.length);
})
readStream.pipe(parseStream);
});
})();
Related
I wrote javascript code for a web crawler that scraps data from a list of websites (in csv file) in a single browser instance (code below). Now I want to modify the code for the scenario in which every single website in the list runs parallel at the same time in two browser instances. For example, a website www.a.com in the list should run in parallel at the same time on two browser instances and the same goes for the rest of the websites. If anyone can help me, please. I would be very thankful.
(async () => {
require("dotenv").config();
if (!process.env.PROXY_SPKI_FINGERPRINT) {
throw new Error("PROXY_SPKI_FINGERPRINT is not defined in environment.");
}
const fs = require("fs");
const fsPromises = fs.promises;
const pptr = require("puppeteer");
const browser = await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
for (let i in sites) {
const site = sites[i];
console.log(`[${i}] ${site}`);
try {
await fsPromises.appendFile("data.txt", JSON.stringify(await crawl(browser, site)) + "\n");
} catch (e) {
console.error(e);
}
}
await browser.close();
async function crawl(browser, site) {
const page = await browser.newPage();
try {
const grepResult = [];
page.on("request", async request => {
request.continue();
})
page.on("response", async response => {
try {
if (response.request().resourceType() === "script" &&
response.headers()["content-type"] &&
response.headers()["content-type"].includes("javascript")) {
const js = await response.text();
const grepPartResult = grepMagicWords(js);
grepResult.push([response.request().url(), grepPartResult]);
}
} catch (e) {}
});
await page.setRequestInterception(true);
try {
await page.goto("http://" + site, {waitUntil: "load", timeout: 60000});
await new Promise(resolve => { setTimeout(resolve, 10000); });
} catch (e) { console.error(e); }
const [flows, url] = await Promise.race([
page.evaluate(() => [J$.FLOWS, document.URL]),
new Promise((_, reject) => { setTimeout(() => { reject(); }, 5000); })
]);
return {url: url, grepResult: grepResult, flows: flows};
} finally {
await page.close();
}
function grepMagicWords(js) {
var re = /(?:\'|\")(?:g|s)etItem(?:\'|\")/g, match, result = [];
while (match = re.exec(js)) {
result.push(js.substring(match.index - 100, match.index + 100));
}
return result;
}
}
})();
You can launch multiple browsers and run them in parallel. You would have to restructure your app slighltly for that. Create a wrapper for crawl which launches it with a new browser instance. I created crawlNewInstance which does that for you. You would also need to run crawlNewInstance() in parallel.
Checkout this code:
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
const crawlerProms = sites.map(async (site, index) => {
try {
console.log(`[${index}] ${site}`);
await fsPromises.appendFile("data.txt", JSON.stringify(await crawlNewInstance(site)) + "\n");
} catch (e) {
console.error(e);
}
}
// await all the crawlers!.
await Promise.all(crawlerProms)
async function crawlNewInstance(site) {
const browser = await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
const result = await crawl(browser, site)
await browser.close()
return result
}
optional
The above answers basically the question. But If you want to go further I was in a run and had nothing todo :)
If you have plenty of pages, which you wanted to crawl in parallel and for example limit the amount of parallel requests you could use a Queue:
var { EventEmitter} = require('events')
class AsyncQueue extends EventEmitter {
limit = 2
enqueued = []
running = 0
constructor(limit) {
super()
this.limit = limit
}
isEmpty() {
return this.enqueued.length === 0
}
// make sure to only pass `async` function to this queue!
enqueue(fn) {
// add to queue
this.enqueued.push(fn)
// start a job. If max instances are already running it does nothing.
// otherwise it runs a new job!.
this.next()
}
// if a job is done try starting a new one!.
done() {
this.running--
console.log('job done! remaining:', this.limit - this.running)
this.next()
}
async next() {
// emit if queue is empty once.
if(this.isEmpty()) {
this.emit('empty')
return
}
// if no jobs are available OR limit is reached do nothing
if(this.running >= this.limit) {
console.log('queueu full.. waiting!')
return
}
this.running++
console.log('running job! remaining slots:', this.limit - this.running)
// first in, first out! so take first element in array.
const job = this.enqueued.shift()
try {
await job()
} catch(err) {
console.log('Job failed!. ', err)
this.emit('error', err)
}
// job is done!
// Done() will call the next job if there are any available!.
this.done()
}
}
The queue could be utilised with this code:
// create queue
const limit = 3
const queue = new AsyncQueue(limit)
// listen for any errors..
queue.on('error', err => {
console.error('error occured in queue.', err)
})
for(let site of sites) {
// enqueue all crawler jobs.
// pass an async function which does whatever you want. In this case it crawls
// a web page!.
queue.enqueue(async() => {
await fsPromises.appendFile("data.txt", JSON.stringify(await crawlNewInstance(site)) + "\n");
})
}
// helper for watiing for the queue!
const waitForQueue = async () => {
if(queue.isEmpty) return Promise.resolve()
return new Promise((res, rej) => {
queue.once('empty', res)
})
}
await waitForQueue()
console.log('crawlers done!.')
Even further with BrowserPool
It would also be possible to reuse your browser instances, so it would not be necessary to start a new browser instance for every crawling process. This can be done using this Browserpool helper class
var pptr = require('puppeteer')
async function launchPuppeteer() {
return await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
}
// manages browser connections.
// creates a pool on startup and allows getting references to
// the browsers! .
class BrowserPool {
browsers = []
async get() {
// return browser if there is one!
if(this.browsers.length > 0) {
return this.browsers.splice(0, 1)[0]
}
// no browser available anymore..
// launch a new one!
return await launchPuppeteer()
}
// used for putting a browser back in pool!.
handback(browser) {
this.browsers.push(browser)
}
// shuts down all browsers!.
async shutDown() {
for(let browser of this.browsers) {
await browser.close()
}
}
}
You can then remove crawlNewInstance() and adjust the code to look like this finally:
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
// create browserpool
const pool = new BrowserPool()
// create queue
const limit = 3
const queue = new AsyncQueue(3)
// listen to errors:
queue.on('error', err => {
console.error('error in the queue detected!', err)
})
// enqueue your jobs
for(let site of sites) {
// enqueue an async function which takes a browser from pool
queue.enqueue(async () => {
try {
// get the browser and crawl a page!.
const browser = await pool.get()
const result = await crawl(browser, site)
await fsPromises.appendFile("data.txt", JSON.stringify(result) + "\n");
// return the browser back to pool so other crawlers can use it! .
pool.handback(browser)
} catch(err) {
console.error(err)
}
})
}
// helper for watiing for the queue!
const waitForQueue = async () => {
// maybe jobs fail in a few milliseconds so check first if its already empty..
if(queue.isEmpty) return Promise.resolve()
return new Promise((res, rej) => {
queue.once('empty', res)
})
}
// wait for the queue to finish :)
await waitForQueue()
// in the very end, shut down all browser:
await pool.shutDown()
console.log('done!.')
Have fun and feel free to leave a comment.
Trying to extract all files from a folder and all it's subdirectories. The content of a directory is called against an external api.
export const extractFiles = (filesOrDirectories) => {
const files = [];
const getFiles = (filesOrDirectories) => {
filesOrDirectories.forEach(async fileOrDirectory => {
if (fileOrDirectory.type === 'directory') {
const content = await getDirectoryContent(fileOrDirectory.id);
getFiles(content);
} else {
files.push(fileOrFolder)
}
});
}
// files should be returned here when it's done. But how do I know when there are no more directories
};
A recursive function which calls itself when it founds a directory. Otherwise push the file to an array.
But how can I know when there are no more directories to extract?
You will know there are no more directories to explore when the function ends.
However it should be noted that since there is asynchronous code inside your extractFiles function, you will have to await the result of any following recursion.
export const extractFiles = async(filesOrDirectories) => {
const files = [];
const getFiles = async(filesOrDirectories) => {
for (const file of filesOrDirectories) {
if (fileOrDirectory.type === 'directory') {
const content = await getDirectoryContent(fileOrDirectory.id);
await getFiles(content);
} else {
files.push(fileOrFolder)
}
}
}
await getFiles(filesOrDirectories)
return files;
};
const extractedFiles = await extractFiles();
EDIT:
Please note, a forEach will function in unexpected ways when combined with asynchronous code, please refactor to use a for...of loop.
I am writing some code that loops over a CSV and creates a JSON file based on the CSV. Included in the JSON is an array named photos, which is to contain the returned urls for the images that are being uploaded to Google Cloud Storage within the function. However, having the promise wait for the uploads to finish has me stumped, since everything is running asynchronously, and finishes off the promise and the JSON compilation prior to finishing the bucket upload and returning the url. How can I make the promise resolve after the urls have been retrieved and added to currentJSON.photos?
const csv=require('csvtojson')
const fs = require('fs');
const {Storage} = require('#google-cloud/storage');
var serviceAccount = require("./my-firebase-storage-spot.json");
const testFolder = './Images/';
var csvFilePath = './Inventory.csv';
var dirArr = ['./Images/Subdirectory-A','./Images/Subdirectory-B','./Images/Subdirectory-C'];
var allData = [];
csv()
.fromFile(csvFilePath)
.subscribe((json)=>{
return new Promise((resolve,reject)=>{
for (var i in dirArr ) {
if (json['Name'] == dirArr[i]) {
var currentJSON = {
"photos" : [],
};
fs.readdir(testFolder+json['Name'], (err, files) => {
files.forEach(file => {
if (file.match(/.(jpg|jpeg|png|gif)$/i)){
var imgName = testFolder + json['Name'] + '/' + file;
bucket.upload(imgName, function (err, file) {
if (err) throw new Error(err);
//returned uploaded img address is found at file.metadata.mediaLink
currentJSON.photos.push(file.metadata.mediaLink);
});
}else {
//do nothing
}
});
});
allData.push(currentJSON);
}
}
resolve();
})
},onError,onComplete);
function onError() {
// console.log(err)
}
function onComplete() {
console.log('finito');
}
I've tried moving the resolve() around, and also tried placing the uploader section into the onComplete() function (which created new promise-based issues).
Indeed, your code is not awaiting the asynchronous invocation of the readdir callback function, nor of the bucket.upload callback function.
Asynchronous coding becomes easier when you use the promise-version of these functions.
bucket.upload will return a promise when omitting the callback function, so that is easy.
For readdir to return a promise, you need to use the fs Promise API: then you can use
the promise-based readdir method and use
promises throughout your code.
So use fs = require('fs').promises instead of fs = require('fs')
With that preparation, your code can be transformed into this:
const testFolder = './Images/';
var csvFilePath = './Inventory.csv';
var dirArr = ['./Images/Subdirectory-A','./Images/Subdirectory-B','./Images/Subdirectory-C'];
(async function () {
let arr = await csv().fromFile(csvFilePath);
arr = arr.filter(obj => dirArr.includes(obj.Name));
let allData = await Promise.all(arr.map(async obj => {
let files = await fs.readdir(testFolder + obj.Name);
files = files.filter(file => file.match(/\.(jpg|jpeg|png|gif)$/i));
let photos = await Promise.all(
files.map(async file => {
var imgName = testFolder + obj.Name + '/' + file;
let result = await bucket.upload(imgName);
return result.metadata.mediaLink;
})
);
return {photos};
}));
console.log('finito', allData);
})().catch(err => { // <-- The above async function runs immediately and returns a promise
console.log(err);
});
Some remarks:
There is a shortcoming in your regular expression. You intended to match a literal dot, but you did not escape it (fixed in above code).
allData will contain an array of { photos: [......] } objects, and I wonder why you would not want all photo elements to be part of one single array. However, I kept your logic, so the above will still produce them in these chunks. Possibly, you intended to have other properties (next to photos) as well, which would make it actually useful to have these separate objects.
The problem is the your code is not waiting in your forEach. I would highly recommend to look for stream and try to do things in parallel as much as possible. There is one library which is very powerful and does that job for you. The library is etl.
You can read rows from csv in parallel and process them in parallel rather than one by one.
I have tried to explain the lines in the code below. Hopefully it makes sense.
const etl = require("etl");
const fs = require("fs");
const csvFilePath = `${__dirname }/Inventory.csv`;
const testFolder = "./Images/";
const dirArr = [
"./Images/Subdirectory-A",
"./Images/Subdirectory-B",
"./Images/Subdirectory-C"
];
fs.createReadStream(csvFilePath)
.pipe(etl.csv()) // parse the csv file
.pipe(etl.collect(10)) // this could be any value depending on how many you want to do in parallel.
.pipe(etl.map(async items => {
return Promise.all(items.map(async item => { // Iterate through 10 items
const finalResult = await Promise.all(dirArr.filter(i => i === item.Name).map(async () => { // filter the matching one and iterate
const files = await fs.promises.readdir(testFolder + item.Name); // read all files
const filteredFiles = files.filter(file => file.match(/\.(jpg|jpeg|png|gif)$/i)); // filter out only images
const result = await Promise.all(filteredFiles).map(async file => {
const imgName = `${testFolder}${item.Name}/${file}`;
const bucketUploadResult = await bucket.upload(imgName); // upload image
return bucketUploadResult.metadata.mediaLink;
});
return result; // This contains all the media link for matching files
}));
// eslint-disable-next-line no-console
console.log(finalResult); // Return arrays of media links for files
return finalResult;
}));
}))
.promise()
.then(() => console.log("finsihed"))
.catch(err => console.error(err));
Here's a way to do it where we extract some of the functionality into some separate helper methods, and trim down some of the code. I had to infer some of your requirements, but this seems to match up pretty closely with how I understood the intent of your original code:
const csv=require('csvtojson')
const fs = require('fs');
const {Storage} = require('#google-cloud/storage');
var serviceAccount = require("./my-firebase-storage-spot.json");
const testFolder = './Images/';
var csvFilePath = './Inventory.csv';
var dirArr = ['./Images/Subdirectory-A','./Images/Subdirectory-B','./Images/Subdirectory-C'];
var allData = [];
// Using nodejs 'path' module ensures more reliable construction of file paths than string manipulation:
const path = require('path');
// Helper function to convert bucket.upload into a Promise
// From other responses, it looks like if you just omit the callback then it will be a Promise
const bucketUpload_p = fileName => new Promise((resolve, reject) => {
bucket.upload(fileName, function (err, file) {
if (err) reject(err);
resolve(file);
});
});
// Helper function to convert readdir into a Promise
// Again, there are other APIs out there to do this, but this is a rl simple solution too:
const readdir_p = dirName => new Promise((resolve, reject) => {
fs.readdir(dirName, function (err, files) {
if (err) reject(err);
resolve(files);
});
});
// Here we're expecting the string that we found in the "Name" property of our JSON from "subscribe".
// It should match one of the strings in `dirArr`, but this function's job ISN'T to check for that,
// we just trust that the code already found the right one.
const getImageFilesFromJson_p = jsonName => new Promise((resolve, reject) => {
const filePath = path.join(testFolder, jsonName);
try {
const files = await readdir_p(filePath);
resolve(files.filter(fileName => fileName.match(/\.(jpg|jpeg|png|gif)$/i)));
} catch (err) {
reject(err);
}
});
csv()
.fromFile(csvFilePath)
.subscribe(async json => {
// Here we appear to be validating that the "Name" prop from the received JSON matches one of the paths that
// we're expecting...? If that's the case, this is a slightly more semantic way to do it.
const nameFromJson = dirArr.find(dirName => json['Name'] === dirName);
// If we don't find that it matches one of our expecteds, we'll reject the promise.
if (!nameFromJson) {
// We can do whatever we want though in this case, I think it's maybe not necessarily an error:
// return Promise.resolve([]);
return Promise.reject('Did not receive a matching value in the Name property from \'.subscribe\'');
}
// We can use `await` here since `getImageFilesFromJson_p` returns a Promise
const imageFiles = await getImageFilesFromJson_p(nameFromJson);
// We're getting just the filenames; map them to build the full path
const fullPathArray = imageFiles.map(fileName => path.join(testFolder, nameFromJson, fileName));
// Here we Promise.all, using `.map` to convert the array of strings into an array of Promises;
// if they all resolve, we'll get the array of file objects returned from each invocation of `bucket.upload`
return Promise.all(fullPathArray.map(filePath => bucketUpload_p(filePath)))
.then(fileResults => {
// So, now we've finished our two asynchronous functions; now that that's done let's do all our data
// manipulation and resolve this promise
// Here we just extract the metadata property we want
const fileResultsMediaLinks = fileResults.map(file => file.metadata.mediaLink);
// Before we return anything, we'll add it to the global array in the format from the original code
allData.push({ photos: fileResultsMediaLinks });
// Returning this array, which is the `mediaLink` value from the metadata of each of the uploaded files.
return fileResultsMediaLinks;
})
}, onError, onComplete);
You are looking for this library ELT.
You can read rows from CSV in parallel and process them in parallel rather than one by one.
I have tried to explain the lines in the code below. Hopefully, it makes sense.
const etl = require("etl");
const fs = require("fs");
const csvFilePath = `${__dirname }/Inventory.csv`;
const testFolder = "./Images/";
const dirArr = [
"./Images/Subdirectory-A",
"./Images/Subdirectory-B",
"./Images/Subdirectory-C"
];
fs.createReadStream(csvFilePath)
.pipe(etl.csv()) // parse the csv file
.pipe(etl.collect(10)) // this could be any value depending on how many you want to do in parallel.
.pipe(etl.map(async items => {
return Promise.all(items.map(async item => { // Iterate through 10 items
const finalResult = await Promise.all(dirArr.filter(i => i === item.Name).map(async () => { // filter the matching one and iterate
const files = await fs.promises.readdir(testFolder + item.Name); // read all files
const filteredFiles = files.filter(file => file.match(/\.(jpg|jpeg|png|gif)$/i)); // filter out only images
const result = await Promise.all(filteredFiles).map(async file => {
const imgName = `${testFolder}${item.Name}/${file}`;
const bucketUploadResult = await bucket.upload(imgName); // upload image
return bucketUploadResult.metadata.mediaLink;
});
return result; // This contains all the media link for matching files
}));
// eslint-disable-next-line no-console
console.log(finalResult); // Return arrays of media links for files
return finalResult;
}));
}))
.promise()
.then(() => console.log("finsihed"))
.catch(err => console.error(err));
I use in my project a function:
function readStream(file) {
console.log("starte lesen");
const readStream = fs.createReadStream(file);
readStream.setEncoding('utf8');
return new Promise((resolve, reject) => {
let data = "";
readStream.on("data", chunk => data += chunk);
readStream.on("end", () => {resolve(data);});
readStream.on("error", error => reject(error));
});
}
It will read an xml file with around 800 lines in. If I add:
readStream.on("end", () => {console.log(data); resolve(data);});
Then the xml data is complete. Everything is fine. But if I call now this readStream from another function:
const dpath = path.resolve(__basedir, 'tests/downloads', 'test.xml');
let xml = await readStream(dpath);
console.log(xml);
then the XML data is cut. I think 800 lines is nothing big. So what can happen that the data is cut at this position but not in the function itself.
I have tried it like following way, it seems working for me.
For complete running example clone node-cheat xml-streamer and run node main.js.
xml-streamer.js:
const fs = require('fs');
module.exports.readStream = function (file) {
console.log("read stream started");
const readStream = fs.createReadStream(file);
readStream.setEncoding('utf8');
return new Promise((resolve, reject) => {
let data = "";
readStream.on("data", chunk => data += chunk);
readStream.on("end", () => {console.log(data); resolve(data);});
readStream.on("error", error => reject(error));
});
}
main.js:
const path = require('path');
const _streamer = require('./xml-streamer');
async function main() {
const xml = await _streamer.readStream( path.resolve(__dirname, 'files', 'test.xml'));
console.log(xml);
}
main();
P.S. In above mentioned node-cheat test xml file has 1121 lines.
Sometimes sync + async code can get a race condition when it's called in the same tick. Try using setImmediate(resolve, data) on your event handler, which will resolve on the next process tick.
Alternatively, if you're targeting node v12 or higher you can use the stream async iterator interface, which will be much cleaner for your code:
async function readStream(file) {
console.log("starte lesen");
const readStream = fs.createReadStream(file);
readStream.setEncoding('utf8');
let data = "";
for await (const chunk of readStream) {
out += chunk;
}
return out;
}
If you happens to use a modern version of node, there's fs.promises
const { promises: fs } = require('fs')
;(async function main() {
console.log(await fs.readFile('./input.txt', 'utf-8'));
})()
I know there are other answers that are similar to this question, but I'm in a slightly different situation. Consider this block of code:
fileSelected = (e) => {
const files = e.target.files;
_.map(files, file => {
reader.readAsDataURL(file);
reader.onprogress = () => {...}
reader.onerror = () => {...}
reader.onload = () => {
const resp = await uploadAttachment(file);
// do something
}
}
}
This is iterating asynchronously when I want it sequentially. I want every new instance of FileReader to finish before moving on to the next file... I know it's not ideal, but I'm maxing out 10 files at a time.
I created a separate function to return a new Promise and used fileSelected function to loop through like so:
readFile = (file) => {
return new Promise(() => {
reader.readAsDataURL(file);
reader.onprogress...
reader.onerror...
reader.onload...
...
}
}
fileSelected = async (e) => {
for (const file of files) {
await readFile(file);
}
}
But it goes through the first file fine, but it doesn't move on to the next file. What could be the issue here? Why is it returning early?
Use async keyword inorder to use await(If you are not using ES2019)
fileSelected = async (e) => {
for (const file of files) {
await readFile(file);
}
}