Most efficient way to download array of remote file URLs in Node? - javascript

I'm working on a Node project where I have an array of files such as
var urls = ["http://web.site/file1.iso", "https://web.site/file2.pdf", "https://web.site/file3.docx", ...];
I'm looking to download those files locally in the most efficient way possible. There could be as many as several dozen URLs in this array... Is there a good library that would help me abstract this out? I need something that I can call with the array and the desired local directory to that will follow redirects, work with http & https, intelligently limit simultaneous downloads, etc.

node-fetch is a lovely little library that brings fetch capability to node. Since fetch returns a promise, managing parallel downloads is simple. Here's an example:
const fetch = require('node-fetch')
const fs = require('fs')
// You can expand this array to include urls are required
const urls = ['http://web.site/file1.iso', 'https://web.site/file2.pdf']
// Here we map the list of urls -> a list of fetch requests
const requests = urls.map(fetch)
// Now we wait for all the requests to resolve and then save them locally
Promise.all(requests).then(files => {
files.forEach(file => {
file.body.pipe(fs.createWriteStream('PATH/FILE_NAME.EXT'))
})
})
Alternatively, you could write each file as it resolves:
const fetch = require('node-fetch')
const fs = require('fs')
const urls = ['http://web.site/file1.iso', 'https://web.site/file2.pdf']
urls.map(file => {
fetch(file).then(response => {
response.body.pipe(fs.createWriteStream('DIRECTORY_NAME/' + file))
})
})

Related

Get a particular URL in Node JS other ways

I have a REST API of Reddit. I am trying to parse the JSON output to get the URL of the responses. When I try to send the request, I get multiple outputs, but I am not sure how to do it as it's a random response.
https
.get("https://www.reddit.com/r/cute/random.json", resp => {
let data = "";
resp.on("data", chunk => {
data += chunk;
});
const obj = JSON.parse(data);
resp.on("end", () => {
console.log(obj.url);
});
})
.on("error", err => {
console.log("Error: " + err.message);
});
This is the code I have got. I used the default Node's http library and I don't think it worked. I have never used any Node Libraries, so it will be helpful if you can suggest them too. And also let me know if what I have done is right.
I understand that http is a core library of Node JS, but I strongly suggest you to use something like node-fetch. Make sure you run the following command on your terminal (or cmd) where your package.json file exists:
$ npm install node-fetch
This will install the node-fetch library, which acts similarly to how the Web based fetch works.
const fetch = require("node-fetch");
const main = async () => {
const json = await fetch("https://www.reddit.com/r/cute/random.json").then(
res => res.json()
);
console.log(
json
.map(entry => entry.data.children.map(child => child.data.url))
.flat()
.filter(Boolean)
);
};
main();
The URLs that you are looking for, I could find in the data.children[0].data.url so I did a map there. I hope this is something that might help you.
I get multiple output for the same code, run multiple times, because the URL you have used is a Random Article Fetching URL. From their wiki:
/r/random takes you to a random subreddit. You can find a link to /r/random in the header above. Reddit gold members have access to /r/myrandom, which is right next to the random button. Myrandom takes you to any of your subscribed subreddits randomly. /r/randnsfw takes you to a random NSFW (over 18) subreddit.
The output for me is like this:
[ 'https://i.redd.it/pjom447yp8271.jpg' ] // First run
[ 'https://i.redd.it/h9b00p6y4g271.jpg' ] // Second run
[ 'https://v.redd.it/lcejh8z6zp271' ] // Third run
Since it has only one URL, I changed the code to get the first one:
const fetch = require("node-fetch");
const main = async () => {
const json = await fetch("https://www.reddit.com/r/cute/random.json").then(
res => res.json()
);
console.log(
json
.map(entry => entry.data.children.map(child => child.data.url))
.flat()
.filter(Boolean)[0]
);
};
main();
Now it gives me:
'https://i.redd.it/pjom447yp8271.jpg' // First run
'https://i.redd.it/h9b00p6y4g271.jpg' // Second run
'https://v.redd.it/lcejh8z6zp271' // Third run
'https://i.redd.it/b46rf6zben171.jpg' // Fourth run
Preview
I hope this helps you. Feel free to ask me if you need more help. Other alternatives include axios, but I am not sure if this can be used on backend.

How to provide a polyfill to a node module

I'm trying to use the unsplash-js module in a NodeJS/express server. As the docs say: "This library depends on fetch to make requests to the Unsplash API. For environments that don't support fetch, you'll need to provide a polyfill".
I've tried something like that, which I am quite ashamed of showing, but for the sake of learning I will
const Unsplash = require('unsplash-js').default;
const fetch = require('node-fetch');
Unsplash.prototype.fetch = fetch;
const unsplash = new Unsplash({
applicationId: process.env.UNSPLASH_API_KEY,
});
And also this
Node.prototype.fetch = fetch;
But of course, nothing worked.
How do I do that?
You need to set fetch to global variable like below:
global.fetch = require('node-fetch')

Have array from Puppeteer/Cheerios program. Have ionic phone app design. How to move the array to the angular code?

This is my first phone app. I am using Ionic for the cross-platform work which uses Angular as you know I'm sure. I have a separate program which scrapes a webpage using puppeteer and cheerio and creates an array of values from the web page. This works.
I'm not sure how I get the array in my web scraping program read by my ionic/angular program.
I have a basic ionic setup and am just trying a most basic activity of being able to see the array from the ionic/angular side but after trying to put it in several places I realized I really didnt know where to import the code to ionic/angular which returns the array or where to put the webscraper code directly in one of the .ts files or ???
This is my web scraping program:
const puppeteer = require('puppeteer'); // live webscraping
let scrape = async () => {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('--page url here --'); // link to page
const result = await page.evaluate(() => {
let data = []; // Create an empty array that will store our data
let elements = document.querySelectorAll('.list-myinfo-block'); // Select all Products
let photo_elements = document.getElementsByTagName('img'); //
var photo_count = 0;
for (var element of elements) { // Loop through each product getting photos
let picture_link = photo_elements[photo_count].src;
let name = element.childNodes[1].innerText;
let itype = element.childNodes[9].innerText
data.push({
picture_link,
name,
itype
}); // Push an object with the data onto our array
photo_count = photo_count + 1;
}
return data;
});
browser.close();
return result; // Return the data
};
scrape().then((value) => {
console.log(value); // Success!
});
When I run the webscraping program I see the array with the correct values in it. Its getting it into the ionic part of it. Sometimes the ionic phone page will show up with nothing in it, sometimes it says it cannot find "/" ... I've tried so many different places and looked all over the web that I have quite a combination of errors. I know I'm putting it in the wrong places - or maybe not everywhere I should. Thank you!
You need a server which will run the scraper on demand.
Any scraper that uses a real browser (ie: chromium) will have to run in a OS that supports it. There is no other way.
Think about this,
Does your mobile support chromium and nodeJS? It does not. There are no chromium build for mobile which supports automation with nodeJS (yet).
Can you run a browser inside another browser? You cannot.
Way 1: Remote wsEndpoint
There are some services which offers wsEndpoint but I will not mention them here. I will describe how you can create your own wsEndPoint and use it.
Run browser and Get wsEndpoint
The following code will launch a puppeteer instance whenever you connect to it. You have to run it inside a server.
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxyy.ws(req, socket, head, { target })
})
.listen(8080);
When you run this on the server/terminal, you can use the ip of the server to connect. In my case it's ws://127.0.0.1:8080.
Use puppeteer-web
Now you will need to install puppeteer-web on your mobile/web app. To bundle Puppeteer using Browserify follow the instruction below.
Clone Puppeteer repository:
git clone https://github.com/GoogleChrome/puppeteer && cd puppeteer
npm install
npm run bundle
This will create ./utils/browser/puppeteer-web.js file that contains Puppeteer bundle.
You can use it later on in your web page to drive another browser instance through its WS Endpoint:
<script src='./puppeteer-web.js'></script>
<script>
const puppeteer = require('puppeteer');
const browser = await puppeteer.connect({
browserWSEndpoint: '<another-browser-ws-endpont>'
});
// ... drive automation ...
</script>
Way 2: Use an API
I will use express for a minimal setup. Consider your scrape function is exported to a file called scrape.js and you have the following index.js file.
const express = require('express')
const scrape= require('./scrape')
const app = express()
app.get('/', function (req, res) {
scrape().then(data=>res.send({data}))
})
app.listen(8080)
This will launch a express API on the port 8080.
Now if you run it with node index.js on a server, you can call it from any mobile/web app.
Helpful Resources
I had some fun with puppeteer and webpack,
playground-react-puppeteer
playground-electron-react-puppeteer-example
To keep the api running, you will need to learn a bit about backend and how to keep the server alive etc. See these links for full understanding of creating the server and more,
Official link to puppeteer-web
Puppeteer with docker
Docker with XVFB and Puppeteer
Puppeteer with chrome extension
Puppeteer with local wsEndpoint
Avoid memory leak on server

Stream array of remote files to amazon S3 in Node.js

I have an array of URLs to files that I want to upload to an Amazon S3 bucket. There are 2916 URLs in the array and the files have a combined size of 361MB.
I try to accomplish this using streams to avoid using too much memory. My solution works in the sense that all 2916 files get uploaded, but (at least some of) the uploads seem to be incomplete, as the total size of the uploaded files varies between 200MB and 361MB for each run.
// Relevant code below (part of a larger function)
/* Used dependencies and setup:
const request = require('request');
const AWS = require('aws-sdk');
const stream = require('stream');
AWS.config.loadFromPath('config.json');
const s3 = new AWS.S3();
*/
function uploadStream(path, resolve) {
const pass = new stream.PassThrough();
const params = { Bucket: 'xxx', Key: path, Body: pass };
s3.upload(params, (err, data) => resolve());
return pass;
}
function saveAssets(basePath, assets) {
const promises = [];
assets.map(a => {
const url = a.$.url;
const key = a.$.path.substr(1);
const localPromise = new Promise(
(res, rej) => request.get(url).pipe(uploadStream(key, res))
);
promises.push(localPromise);
});
return Promise.all(promises);
}
saveAssets(basePath, assets).then(() => console.log("Done!"));
It's a bit messy with the promises, but I need to be able to tell when all files have been uploaded, and this part seems to work well at least (it writes "Done!" after ~25 secs when all promises are resolved).
I am new to streams so feel free to bash me if I approach this the wrong way ;-) Really hope I can get some pointers!
It seems I was trying to complete too many requests at once. Using async.eachLimit I now limit my code to a maximum of 50 concurrent requests which is the sweetspot for me in terms of trade-off between execution time, memory consumption and stability (all of the downloads completes every time!).

Stream response to file using Fetch API and fs.createWriteStream

I'm creating an Electron application and I want to stream an image to a file (so basically download it).
I want to use the native Fetch API because the request module would be a big overhead.
But there is no pipe method on the response, so I can't do something like
fetch('https://imageurl.jpg')
.then(response => response.pipe(fs.createWriteStream('image.jpg')));
So how can I combine fetch and fs.createWriteStream?
I got it working. I made a function which transforms the response into a readable stream.
const responseToReadable = response => {
const reader = response.body.getReader();
const rs = new Readable();
rs._read = async () => {
const result = await reader.read();
if(!result.done){
rs.push(Buffer.from(result.value));
}else{
rs.push(null);
return;
}
};
return rs;
};
So with it, I can do
fetch('https://imageurl.jpg')
.then(response => responseToReadable(response).pipe(fs.createWriteStream('image.jpg')));
Fetch is not really able to work with nodejs Streams out of the box, because the Stream API in the browser differs from the one nodejs provides, i.e. you can not pipe a browser stream into a nodejs stream or vice versa.
The electron-fetch module seems to solve that for you. Or you can look at this answer: https://stackoverflow.com/a/32545850/2016129 to have a way of downloading files without the need of nodeIntegration.
There is also needle, a smaller alternative to the bulkier request, which of course supports Streams.
I guess today the answer is with nodejs 18+
node -e 'fetch("https://github.com/stealify").then(response => stream.Readable.fromWeb(response.body).pipe(fs.createWriteStream("./github.com_stealify.html")))'
in the above example we use the -e flag it tells nodejs to execute our cli code we download the page of a interristing Project here and save it as ./github.com_stealify.html in the current working dir the below code shows the same inside a nodejs .mjs file for convinience
Cli example using CommonJS
node -e 'fetch("https://github.com/stealify").then(({body:s}) =>
stream.Readable.fromWeb(s).pipe(fs.createWriteStream("./github.com_stealify.html")))'
fetch.cjs
fetch("https://github.com/stealify").then(({body:s}) =>
require("node:stream").Readable.fromWeb(s)
.pipe(require("node:fs").createWriteStream("./github.com_stealify.html")));
Cli example using ESM
node --input-type module -e 'stream.Readable.fromWeb(
(await fetch("https://github.com/stealify")).body)
.pipe(fs.createWriteStream("./github.com_stealify.html"))'
fetch_tla_no_tli.mjs
(await import("node:stream")).Readable.fromWeb(
(await fetch("https://github.com/stealify")).body).pipe(
(await import("node:fs")).createWriteStream("./github.com_stealify.html"));
fetch.mjs
import stream from 'node:stream';
import fs from 'node:fs';
stream.Readable
.fromWeb((await fetch("https://github.com/stealify")).body)
.pipe(fs.createWriteStream("./github.com_stealify.html"));
see: https://nodejs.org/api/stream.html#streamreadablefromwebreadablestream-options
Update i would not use this method when dealing with files
this is the correct usage as fs.promises supports all forms of iterators equal to the stream/consumers api
node -e 'fetch("https://github.com/stealify").then(({ body }) =>
fs.promises.writeFile("./github.com_stealify.html", body)))'

Categories

Resources