not able to fetch text data from web url using javascript - javascript

I need to extract text data from web url (http://www.africau.edu/images/default/sample.pdf)
I used two node_module.
1) crawler-Request
it('Read Pdf Data using crawler',function(){
const crawler = require('crawler-request');
function response_text_size(response){
response["size"] = response.text.length;
return response;
}
crawler("http://www.africau.edu/images/default/sample.pdf",response_text_size).then(function(response){
// handle response
console.log("Reponse =" + response.size);
});
});
What happen for this it will not print anything on console.
2) pfd2json/pdfparser
it('Read Data from url',function(){
var request = require('request');
var pdf = require('pfd2json/pdfparser');
var fs = require('fs');
var pdfUrl = "http://www.africau.edu/images/default/sample.pdf";
let databuffer = fs.readFileSync(pdfUrl);
pdf(databuffer).then(function(data){
var arr:Array<String> = data.text;
var n = arr.includes('Thursday 02 May');
console.log("Print Array " + n);
});
});
Failed: ENOENT: no such file or directory, open 'http://www.africau.edu/images/default/sample.pdf'
I am able to access data from local path but not able to extract it from url.

The issue here is that you are using the fs module (File System) to read a file on a distant server.
You also mistyped the pdf2json module, which should give you an error ?
You did require the request module. This module will make it possible to access that distant file. Here's one way to do this :
it('Read Data from url', function () {
var request = require('request');
var PDFParser = require('pdf2json');
var pdfUrl = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
var pdfParser = new PDFParser(this, 1);
// executed if the parser fails for any reason
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
// executed when the parser finished
pdfParser.on("pdfParser_dataReady", pdfData => console.log(pdfParser.getRawTextContent()));
// request to get the pdf's file content then call the pdf parser on the retrieved buffer
request({ url: pdfUrl, encoding: null }, (error, response, body) => pdfParser.parseBuffer(body));
});
This will make it possible to load the distant .pdf file in your program.
I'd recommend looking at the pdf2json documentation if you want to do more. This will simply output the textual content of the .pdf file when the parser has completed reading data.

Related

How to read a large csv as a stream

I am using the #aws-sdk/client-s3 to read a json file from S3, take the contents and dump it into dynamodb. This all currently works fine using:
const data = await (await new S3Client(region).send(new GetObjectCommand(bucketParams)));
And then deserialising the response body etc.
However, I'm looking to migrate to use jsonlines format, effectiely csv, in the sense it needs to be streamed in line by line or in chunks of lines and processed. I can't seem to find a way of doing this that doesnt load the entire file into memory (using response.text() etc).
Ideally, I would like to pipe the response into a createReadStream, and go from there.
I found this example with createReadStream() form module fs in node.js:
import fs from 'fs';
function read() {
let data = '';
const readStream = fs.createReadStream('business_data.csv', 'utf-8');
readStream.on('error', (error) => console.log(error.message));
readStream.on('data', (chunk) => data += chunk);
readStream.on('end', () => console.log('Reading complete'));
};
read();
You can modify it for your use. Hope this helps.
Connection to your S3 you can do by:
var s3 = new AWS.S3({apiVersion: '2006-03-01'});
var params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
var file = require('fs').createWriteStream('/path/to/file.jpg');
s3.getObject(params).createReadStream().pipe(file);
see here

download pdf from URL into gDrive

I need to download a PDF from a link in the following format
fileURL = "https://docs.google.com/feeds/download/documents/export/Export?id=<...DOCID...>&revision=3970&exportFormat=pdf"
and add it to gDrive folder.
I have this code, but the generated file just contain "Blob" rather than the actual content
function dFile(fileName,fileURL) {
var response = UrlFetchApp.fetch(fileURL);
var fileBlob = response.getBlob().getAs('application/pdf');
var folder = DriveApp.getFolderById('..folderID..');
var result = folder.createFile(fileName,fileBlob,MimeType.PDF);
Logger.log("file created");
}
How to I download the actual PDF?
Update:
I have updated my code and now I get this as generated PDF. Which makes me think I need to auth, but not sure how to do it, I set up all auth in manifest already
function dFile(fileName,fileURL) {
var response = UrlFetchApp.fetch(fileURL);
var fileBlob = response.getBlob().getAs('application/pdf');
var folder = DriveApp.getFolderById('..folderID..');
var result = folder.createFile(fileBlob).setName(fileName);
Logger.log("file created");
}
In your script, how about the following modification?
From:
var response = UrlFetchApp.fetch(fileURL);
var fileBlob = response.getBlob().getAs('application/pdf');
To:
var response = UrlFetchApp.fetch(fileURL, { headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() } });
var fileBlob = response.getBlob();
I thought that in your endpoint, getBlob() returns the PDF format.
In your script, createFile is used. By this, the required scope has already been included. But, if an error is related to Drive API, please enable Drive API at Advanced Google services.
Note:
In your endpoint, if revision=3970 is not existing, an error occurs. Please be careful about this.
Reference:
getOAuthToken()

NodeJs streaming File from request to Filesystem not Memory

i have an strange issue. I´m using request for file download in NodeJs and everytime i download larger files(>250mb) they get downloaded into memory and are not directly streamed to the Filesystem. Maybe i´m doing something wrong but i made a testcase and the file is still not getting streamed.
var request = require('request');
var fs = require('fs');
var writable = fs.createWriteStream("1GB.zip");
var stream = request.get({
uri: "http://ipv4.download.thinkbroadband.com/1GB.zip",
encoding: null
}, function(error, response, body) {
console.log("code:", response.statusCode);
if (response.statusCode >= 500) {
log.err(response.statusCode, " Servererror", file.url);
}
}).pipe(writable);
in this testcase i`m downloading a sample 1GB file and if you watch the node proccess with the taskmanager it grows to >1GB as it downloads the file.
I want that my Node application uses not more than 200mb of Ram
The issue is that you're passing a callback, which implicitly enables buffering inside request because one of the parameters for the callback is the entire body of the response.
If you want to know when the response is available, just listen for the response event instead:
var request = require('request');
var fs = require('fs');
var writable = fs.createWriteStream("1GB.zip");
var stream = request.get({
uri: "http://ipv4.download.thinkbroadband.com/1GB.zip",
encoding: null
}).on('response', function(response) {
console.log("code:", response.statusCode);
if (response.statusCode >= 500) {
log.err(response.statusCode, " Servererror", file.url);
}
}).pipe(writable);

Express.js proxy pipe translate XML to JSON

For my front-end (angular) app, I need to connect to an external API, which does not support CORS.
So my way around this is to have a simple proxy in Node.JS / Express.JS to pass the requests. The additional benefit is that I can set my api-credentials at proxy level, and don't have to pass them to the front-end where the user might steal/abuse them.
This is all working perfectly.
Here's the code, for the record:
var request = require('request');
var config = require('./config');
var url = config.api.endpoint;
var uname = config.api.uname;
var pword = config.api.pword;
var headers = {
"Authorization" : 'Basic ' + new Buffer(uname + ':' + pword).toString('base64'),
"Accept" : "application/json"
};
exports.get = function(req, res) {
var api_url = url+req.url;
var r = request({url: api_url, headers: headers});
req.pipe(r).pipe(res);
};
The API-endpoint I have to use has XML as only output format. So I use xml2js on the front-end to convert the XML reponse to JSON.
This is also working great, but I would like to lighten the load for the client, and do the XML -> JSON parsing step on the server.
I assume I will have to create something like:
req.pipe(r).pipe(<convert_xml_to_json>).pipe(res);
But I don't have any idea how do create something like that.
So basically I'm looking to create an XML to JSON proxy as a layer on top of an already existing API.
There are a lot of questions on SO regarding "how do I make a proxy" and "how do I convert XML to JSON" but I couldn't find any that combine the two.
you need to use transform stream and for xml to json conversion you need some library i use this xml2json
..then u use it like this (simplified but it should work with request too)
var http = require('http');
var fs = require('fs');
var parser = require('xml2json');
var Transform = require('stream').Transform;
function xmlParser () {
var transform = new Transform();
transform._transform = function(chunk, encoding, done) {
chunk = parser.toJson(chunk.toString())
console.log(chunk);
this.push(chunk);
done();
};
transform.on('error', function (err) {
console.log(err);
});
return transform;
}
var server = http.createServer(function (req, res) {
var stream = fs.createReadStream(__dirname + '/data.xml');
stream.pipe(xmlParser()).pipe(res);
});
server.listen(8000);

Retrieving accurate filesize data using node.js

I am trying to retrieve the accurate filesize information for an image URL using node.js (specifically the http module). Everytime I run the below code (with any image url) I get '4061' bytes as the response. The below example should return about 3000 bytes.
I am open to corrections to my existing method of calculation or an alternative method to handle this in node. Thanks.
var http = require('http');
var options = {host: 'www.subway.com', path: '/menu/Images/Menu/Categories_Main/menu-category-featured-products.jpg'};
var req = http.get(options, function(res) {
var obj = res.headers;
var filesize = obj['content-length'];
console.log(filesize + " bytes");
}
);
req.end();

Categories

Resources