Get external website content using node js

Get external website content using node js - javascript

In my website I am using node js for backend and html for front end. I need to get external website metadata (keywords).
Have any package for get the metadata in node js?
For example i have 100 website url in array following like this.
var arrayName = ["http://www.realsimple.com/food-recipes/9-healthy-predinner-snacks", "http://www.womenshealthmag.com/weight-loss/100-calorie-snacks", "https://www.pinterest.com/explore/healthy-snacks/", "http://www.rd.com/slideshows/healthy-snacks-for-adults/", "http://greatist.com/snacking", "http://www.bodybuilding.com/fun/26-best-healthy-snacks.html"]
I need to get all website metadata particularly in keywords of metadata.
In node js have any package for this ?
I found some code in google.
var options = {
host: 'www.google.com',
port: 80,
path: '/index.html'
};
http.get(options, function(res) {
console.log("Got response: " + res.statusCode);
}).on('error', function(e) {
console.log("Got error: " + e.message);
});
Have any other options?
Expected Outputs:
Array1 = ["keyword1","keyword2","keyword3"];
Array2 = ["keyword1","keyword2","keyword3"];
Array3 = ["keyword1","keyword2","keyword3"];
Array1, Array2, Array3 are Site1,Site2,Site3 like this.

I'll suggest you to use any from following packages:
http://npm.im/cheerio
http://npm.im/request
Note: You need to code it by yourself to grasp keywords from site data.

Related

Node.js: Transform Request Options into Final URL

If I'm using Node.js, is there a way I can automatically turn a set of options for the request function into the final URL that Node.js will use for its HTTP request?
That is, if I have a set of options that I use like this
var http = require('http');
var options = {
host: 'www.random.org',
path: '/integers/?num=1&min=1&max=10&col=1&base=10&format=plain&rnd=new'
};
callback = function(response) {
var str = '';
//another chunk of data has been received, so append it to `str`
response.on('data', function (chunk) {
str += chunk;
});
//the whole response has been received, so we just print it out here
response.on('end', function () {
conso
console.log(str);
});
}
const req = http.request(options, callback).end();
Is there a way for me to transform
var options = {
host: 'www.random.org',
path: '/integers/?num=1&min=1&max=10&col=1&base=10&format=plain&rnd=new'
};
Into the string
www.random.org/integers/?num=1&min=1&max=10&col=1&base=10&format=plain&rnd=new
I realize for the above case this would be a trivial string concatenation.
const url = 'htts://' + options.host + options.path
What I'm interested in is code that can transform any options object into its final URL If I look to the manual, there's twenty one possible options for a request. Some might impact the final URL. Some might not. I'm hoping Node.js or NPM has a built in way of turning those options into a URL and save me the tedious work of doing it myself.

Node.js originally offered the querystring module which has functions which seem to do what you need. For instance, the stringify function:
https://nodejs.org/dist/latest-v15.x/docs/api/querystring.html#querystring_querystring_stringify_obj_sep_eq_options
querystring.stringify({ foo: 'bar', baz: ['qux', 'quux'], corge: '' });
// Returns 'foo=bar&baz=qux&baz=quux&corge='
More recently, objects like URLSearchParams were introduced in the url module to better align with the WHATWG spec and therefore be more inline with APIs available in browswers:
https://nodejs.org/dist/latest-v15.x/docs/api/url.html#url_class_urlsearchparams
const myURL = new URL('https://example.org/?abc=123');
console.log(myURL.searchParams.get('abc'));
// Prints 123
myURL.searchParams.append('abc', 'xyz');
console.log(myURL.href);
// Prints https://example.org/?abc=123&abc=xyz
The approach you'll choose in the end depends of your specific business needs.

Why does Github webhooks give me circular JSON objects?

With the below code I am trying to get some basic pull request info from my Github repo, when someone submits a pull request for it. But I get this output instead:
$ node poc2.js
sha1=fef5611617bf56a36d165f73d41e527ee2c97d49
res [Circular]
req [Circular]
Since the secret hash is printed, the webhook is received. I can also verifiy this by nc -l 8080 instead of running my NodeJS app. There is will see a large json object, which is from where I got the json structure I use in the console.log's below.
Question
Can anyone figure out, why I get a "circular" json object from both req and res?
And how can I get the values I console.log?
const secret = "sdfsfsfsfwerwergdgv";
const http = require('http');
const crypto = require('crypto');
http.createServer(function (req, res) {
req.on('data', function(chunk) {
let sig = "sha1=" + crypto.createHmac('sha1', secret).update(chunk.toString()).digest('hex');
console.log(sig);
if (req.headers['x-hub-signature'] == sig) {
console.log("res %j", res);
console.log("req %j", req);
if (res.action == 'opened') {
console.log('PR for: ' + res.repository.html_url);
console.log('PR for: ' + res.repository.full_name);
console.log('PR: ' + res.pull_request.html_url);
console.log('PR title: ' + res.pull_request.title);
console.log('PR description' + res.pull_request.description);
console.log('PR by user: ' + res.pull_request.user.login);
};
}
});
res.end();
}).listen(8080);

The req is circular because it's not a JSON object, it's the Request object which includes a lot of internal machinery, some of which is circularly linked.
The actual incoming JSON document is in req.body or wherever your particular JSON body parser puts it, though in this example you don't have one yet so that's something you should fix.
Tip: You may want to use Express instead of the core Node http module, it's much more capable.

Chrome DevTools. Save console output to file automatically

I try to get text content from the webpage. For example Google.com
I write at console:
$ ('#SIvCob').innerText
and get:
"Google offered in: русский"
This is the text, what I find out. Now I want to save it to file (.txt).
Two moments: there is no only one item, that I search, actually 7-10. And, there is a refresh every second! I go to write a cycle.
I know about copy() function and about right click on the console and "Save As," but I need a CODE, which will do it automatically.
Thanks in advance.

The browser has no API to write to the file system since that would be a security risk. But you can use Nodejs and their File System API to write you text file.
You will also need to use the HTTP API to get the web content. And you will also need to parse your HTML, you can do it with fast-html-parser or any other module of your choice. (high5, htmlparser, htmlparser2, htmlparser2-dom, hubbub, libxmljs, ms/file, parse5, ...)
var http = require('http');
var fs = require('fs');
var parser = require('node-html-parser');
var options = {
host: 'www.google.com',
port: 80,
path: '/index.html'
};
var file = '/path/to/myFile.txt';
http.get(options, function(res) {
res.setEncoding('utf8');
var body = '';
res.on('data', function (chunk) {body += chunk});
res.on('end', function () {
var dom = parser.parse(body);
var text = dom.querySelector('#SIvCob').text;
fs.writeFile(file, text, function (err) {
if (err) throw err;
console.log('The file has been saved!');
});
});
});

nodejs: node-http-proxy and harmon: rewriting the html response from the end point instead of the 302 redirected response.

I'm using nodejs with node-http-proxy along with harmon. I am using harmon to rewrite the proxied response to include a javascript file and a css file. When I set the target of the proxy to be http://nodejs.org or anything other than localhost, I receive a 301 or 302 redirect. The script is rewriting the 301 response instead of the fully proxied response. How can I use harmon to rewrite the end response instead of the 302 response?
Here is the example of the script I am running from the harmon example folder:
var http = require('http');
var connect = require('connect');
var httpProxy = require('http-proxy');
var selects = [];
var simpleselect = {};
//<img id="logo" src="/images/logo.svg" alt="node.js">
simpleselect.query = 'img';
simpleselect.func = function (node) {
//Create a read/write stream wit the outer option
//so we get the full tag and we can replace it
var stm = node.createStream({ "outer" : true });
//variable to hold all the info from the data events
var tag = '';
//collect all the data in the stream
stm.on('data', function(data) {
tag += data;
});
//When the read side of the stream has ended..
stm.on('end', function() {
//Print out the tag you can also parse it or regex if you want
process.stdout.write('tag: ' + tag + '\n');
process.stdout.write('end: ' + node.name + '\n');
//Now on the write side of the stream write some data using .end()
//N.B. if end isn't called it will just hang.
stm.end('<img id="logo" src="http://i.imgur.com/LKShxfc.gif" alt="node.js">');
});
}
selects.push(simpleselect);
//
// Basic Connect App
//
var app = connect();
var proxy = httpProxy.createProxyServer({
target: 'http://nodejs.org'
})
app.use(require('../')([], selects, true));
app.use(
function (req, res) {
proxy.web(req, res);
}
);

The problem is that a lot of sites are now redirecting HTTP to HTTPS.
nodejs.org is one of those.
I have updated the sample https://github.com/No9/harmon/blob/master/examples/doge.js to show how the http-proxy needs to be configured to deal with HTTPS.
If you still have problems with other arbitrary redirects please log an issue on harmon.
Thanks

HTTP Get Node JS to parse using document elements

I have written some javascript that goes to a page and returns the data from the site, however, I would like to get specific elements off this html site and use functions like document.getElementById. How can I use that sort of functionality here? Currently, the console.log(chunk) simply spits out the entire body of html, I want to be able to parse that.
var http = require("http");
var options = {
host: 'www.google.com',
port: 80,
path: '/news'
};
http.get(options, function(res) {
res.setEncoding('utf8');
res.on('data', function(chunk){
console.log(chunk);
});
}).on('error', function(e) {
console.log("Got error: " + e.message);
});

There are many npm modules to perform the same, here are some
1.cheerio
2.jsdom

Develop Reference

JavaScript is the programming language of the Web.

Get external website content using node js - javascript

I'll suggest you to use any from following packages: http://npm.im/cheerio http://npm.im/request Note: You need to code it by yourself to grasp keywords from site data.

Related

Node.js: Transform Request Options into Final URL

Why does Github webhooks give me circular JSON objects?

Chrome DevTools. Save console output to file automatically

nodejs: node-http-proxy and harmon: rewriting the html response from the end point instead of the 302 redirected response.

HTTP Get Node JS to parse using document elements

Categories

Resources