I try to get text content from the webpage. For example Google.com
I write at console:
$ ('#SIvCob').innerText
and get:
"Google offered in: русский"
This is the text, what I find out. Now I want to save it to file (.txt).
Two moments: there is no only one item, that I search, actually 7-10. And, there is a refresh every second! I go to write a cycle.
I know about copy() function and about right click on the console and "Save As," but I need a CODE, which will do it automatically.
Thanks in advance.
The browser has no API to write to the file system since that would be a security risk. But you can use Nodejs and their File System API to write you text file.
You will also need to use the HTTP API to get the web content. And you will also need to parse your HTML, you can do it with fast-html-parser or any other module of your choice. (high5, htmlparser, htmlparser2, htmlparser2-dom, hubbub, libxmljs, ms/file, parse5, ...)
var http = require('http');
var fs = require('fs');
var parser = require('node-html-parser');
var options = {
host: 'www.google.com',
port: 80,
path: '/index.html'
};
var file = '/path/to/myFile.txt';
http.get(options, function(res) {
res.setEncoding('utf8');
var body = '';
res.on('data', function (chunk) {body += chunk});
res.on('end', function () {
var dom = parser.parse(body);
var text = dom.querySelector('#SIvCob').text;
fs.writeFile(file, text, function (err) {
if (err) throw err;
console.log('The file has been saved!');
});
});
});
Related
I need to extract text data from web url (http://www.africau.edu/images/default/sample.pdf)
I used two node_module.
1) crawler-Request
it('Read Pdf Data using crawler',function(){
const crawler = require('crawler-request');
function response_text_size(response){
response["size"] = response.text.length;
return response;
}
crawler("http://www.africau.edu/images/default/sample.pdf",response_text_size).then(function(response){
// handle response
console.log("Reponse =" + response.size);
});
});
What happen for this it will not print anything on console.
2) pfd2json/pdfparser
it('Read Data from url',function(){
var request = require('request');
var pdf = require('pfd2json/pdfparser');
var fs = require('fs');
var pdfUrl = "http://www.africau.edu/images/default/sample.pdf";
let databuffer = fs.readFileSync(pdfUrl);
pdf(databuffer).then(function(data){
var arr:Array<String> = data.text;
var n = arr.includes('Thursday 02 May');
console.log("Print Array " + n);
});
});
Failed: ENOENT: no such file or directory, open 'http://www.africau.edu/images/default/sample.pdf'
I am able to access data from local path but not able to extract it from url.
The issue here is that you are using the fs module (File System) to read a file on a distant server.
You also mistyped the pdf2json module, which should give you an error ?
You did require the request module. This module will make it possible to access that distant file. Here's one way to do this :
it('Read Data from url', function () {
var request = require('request');
var PDFParser = require('pdf2json');
var pdfUrl = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
var pdfParser = new PDFParser(this, 1);
// executed if the parser fails for any reason
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
// executed when the parser finished
pdfParser.on("pdfParser_dataReady", pdfData => console.log(pdfParser.getRawTextContent()));
// request to get the pdf's file content then call the pdf parser on the retrieved buffer
request({ url: pdfUrl, encoding: null }, (error, response, body) => pdfParser.parseBuffer(body));
});
This will make it possible to load the distant .pdf file in your program.
I'd recommend looking at the pdf2json documentation if you want to do more. This will simply output the textual content of the .pdf file when the parser has completed reading data.
I'm certain I'm missing something obvious, but the gist of the problem is I'm receiving a PNG from a Mapbox call with the intent of writing it to the file system and serving it to the client. I've successfully relayed the call, received a response of raw data and written a file. The problem is that my file ends up truncated no matter what path I take, and I've exhausted the answers I've found skirting the subject. I've dumped the raw response to the log, and it's robust, but any file I make tends to be about a chunk's worth of unreadable data.
Here's the code I've got at present for the file making. I tried this buffer move as a last ditch after several failed and comparably fruitless iterations. Any help would be greatly appreciated.
module.exports = function(req, res, cb) {
var cartography = function() {
return https.get({
hostname: 'api.mapbox.com',
path: '/v4/mapbox.wheatpaste/' + req.body[0] + ',' + req.body[1] + ',6/750x350.png?access_token=' + process.env.MAPBOX_API
}, function(res) {
var body = '';
res.on('data', function(chunk) {
body += chunk;
});
res.on('end', function() {
var mapPath = 'map' + req.body[0] + req.body[1] + '.png';
var map = new Buffer(body, 'base64');
fs.writeFile(__dirname + '/client/images/maps/' + mapPath, map, 'base64', function(err) {
if (err) throw err;
cb(mapPath);
})
})
});
};
cartography();
};
It is possible to rewrite your code in more compact subroutine:
const fs = require('fs');
const https = require('https');
https.get(url, (response)=> { //request itself
if(response) {
let imageName = 'image.png'; // for this purpose I usually use crypto
response.pipe( //pipe response to a write stream (file)
fs.createWriteStream( //create write stream
'./public/' + imageName //create a file with name image.png
)
);
return imageName; //if public folder is set as default in app.js
} else {
return false;
}
})
You could get original name and extension from url, but it safer to generate a new name with crypto and get file extension like i said from url or with read-chunk and file-type modules.
I have written some javascript that goes to a page and returns the data from the site, however, I would like to get specific elements off this html site and use functions like document.getElementById. How can I use that sort of functionality here? Currently, the console.log(chunk) simply spits out the entire body of html, I want to be able to parse that.
var http = require("http");
var options = {
host: 'www.google.com',
port: 80,
path: '/news'
};
http.get(options, function(res) {
res.setEncoding('utf8');
res.on('data', function(chunk){
console.log(chunk);
});
}).on('error', function(e) {
console.log("Got error: " + e.message);
});
There are many npm modules to perform the same, here are some
1.cheerio
2.jsdom
I know this question has been asked but my mind has been blown by my inability to get this working. I am trying to upload multiple images to my server with the following code:
var formidable = require('formidable');
var fs = require('fs');
...
router.post('/add_images/:showcase_id', function(req, res){
if(!admin(req, res)) return;
var form = new formidable.IncomingForm(),
files = [];
form.uploadDir = global.__project_dirname+"/tmp";
form.on('file', function(field, file) {
console.log(file);
file.image_id = global.s4()+global.s4();
file.endPath = "/img/"+file.image_id+"."+file.type.replace("image/","");
files.push({field:field, file:file});
});
form.on('end', function() {
console.log('done');
console.log(files);
db.get("SOME SQL", function(err, image_number){
if(err){
console.log(err);
}
var db_index = 0;
if(image_number) db_index = image_number.image_order;
files.forEach(function(file, index){
try{
//this line opens the image in my computer (testing)
require("sys").exec("display " + file.file.path);
console.log(file.file.path);
fs.renameSync(file.file.path, file.file.endPath);
}catch (e){
console.log(e);
}
db.run( "SOME MORE SQL"')", function(err){
if(index == files.length)
res.redirect("/admin/gallery"+req.params.showcase_id);
});
});
});
});
form.parse(req);
});
The line that opens the image via system calls works just fine, however I continue to get:
Error: ENOENT, no such file or directory '/home/[username]/[project name]/tmp/285ef5276581cb3b8ea950a043c6ed51'
by the rename statement.
the value of file.file.path is:
/home/[username]/[project name]/tmp/285ef5276581cb3b8ea950a043c6ed51
I am so confused and have tried everything. What am I doing wrong?
Probably you get this error because the target path does not exist or you don't have write permissions.
The error you get is misleading due to a bug in nodejs, see:
https://github.com/joyent/node/issues/5287
https://github.com/joyent/node/issues/685
Consider adding:
console.log(file.file.endPath);
before the fs.renameSync call and check if the target path exist and is writable by your application
You stated form. Therefore note that Formidable doesn't work out of the box with just NodeJS. Unless you were to use something like the prompt module for input. If you are using HTML, you'll need something like Angular, React or Browserify to be able to give it access to your interface.
Right now I'm using this script in PHP. I pass it the image and size (large/medium/small) and if it's on my server it returns the link, otherwise it copies it from a remote server then returns the local link.
function getImage ($img, $size) {
if (#filesize("./images/".$size."/".$img.".jpg")) {
return './images/'.$size.'/'.$img.'.jpg';
} else {
copy('http://www.othersite.com/images/'.$size.'/'.$img.'.jpg', './images/'.$size.'/'.$img.'.jpg');
return './images/'.$size.'/'.$img.'.jpg';
}
}
It works fine, but I'm trying to do the same thing in Node.js and I can't seem to figure it out. The filesystem seems to be unable to interact with any remote servers so I'm wondering if I'm just messing something up, or if it can't be done natively and a module will be required.
Anyone know of a way in Node.js?
You should check out http.Client and http.ClientResponse. Using those you can make a request to the remote server and write out the response to a local file using fs.WriteStream.
Something like this:
var http = require('http');
var fs = require('fs');
var google = http.createClient(80, 'www.google.com');
var request = google.request('GET', '/',
{'host': 'www.google.com'});
request.end();
out = fs.createWriteStream('out');
request.on('response', function (response) {
response.setEncoding('utf8');
response.on('data', function (chunk) {
out.write(chunk);
});
});
I haven't tested that, and I'm not sure it'll work out of the box. But I hope it'll guide you to what you need.
To give a more updated version (as the most recent answer is 4 years old, and http.createClient is now deprecated), here is a solution using the request method:
var fs = require('fs');
var request = require('request');
function getImage (img, size, filesize) {
var imgPath = size + '/' + img + '.jpg';
if (filesize) {
return './images/' + imgPath;
} else {
request('http://www.othersite.com/images/' + imgPath).pipe(fs.createWriteStream('./images/' + imgPath))
return './images/' + imgPath;
}
}
If you can't use remote user's password for some reasons and need to use the identity key (RSA) for authentication, then programmatically executing the scp with child_process is good to go
const { exec } = require('child_process');
exec(`scp -i /path/to/key username#example.com:/remote/path/to/file /local/path`,
(error, stdout, stderr) => {
if (error) {
console.log(`There was an error ${error}`);
}
console.log(`The stdout is ${stdout}`);
console.log(`The stderr is ${stderr}`);
});