Web scraping in Node.js

Web scraping in Node.js - javascript

Lately I've been trying to scrape Information from a website using Nodejs, the request module and cheerio. The code it works well (statusCode = 200) on my localhost (127.0.0.1) but when I push the code to Heroku server, statusCode = 403.
Is it because of the cookie? If yes, why does it work on my localhost that doesn't add any cookie in the request?
request({
method: 'GET',
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
},
url: 'https://www.example.com/login',
json: true
}, (err, response, body) => {
if (err) {
return console.log('Failed to request: ', err);
}
console.log(response.statusCode);
});

Related

How to fetch and get response header with no cors?

I try to fetch file hosting with browser. i disable cors web security with extension https://chrome.google.com/webstore/detail/allow-cors-access-control/lhobafahddgcelffkeicbaginigeejlf?hl=en. i tested. fetch return the html. but the problem i dont get the response headers. like cookie. it just give me 4 key. i need that response header cookie to do next fetch.
const initHeader = new Headers()
initHeader.append('accept', '*/*')
initHeader.append('connection', 'keep-alive')
initHeader.append('user-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36')
fetch('https://www56.zippyshare.com/v/d0M2gx7X/file.html', {
method: 'GET',
headers: initHeader
}).then(function(res) {
res.headers.forEach((val, key) => {
console.log(key + ': ' + val)
})
})

How to get value of response headers in nodejs

const URL = 'https://www.imdb.com/title/tt0816222/?
ref_ = fn_al_tt_2 ';
(async() => {
const response = await request({
uri: URL,
headers: {
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/0.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
},
});
I need help from this code. How can I get the response header values in console of visual studio code for the following site.

Just handle the Promise from request library
request({
uri: 'https://www.imdb.com/title/tt0816222/?',
headers: /*your headers*/
})
.then(function(response){
console.log(response.headers)
})

You will get the response headers in response.headers

Print like this
console.log(response.headers)

This code prints headers:
const URL = 'https://www.imdb.com/title/tt0816222/?ref_ = fn_al_tt_2';
const request = require('request');
(async () => {
const response = await request({
uri: URL,
headers: {
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/0.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
},
});
console.log(response.headers);
})();

Because you are just fetching the body of response from request npm.
add resolveWithFullResponse: true in request options.
const URL = 'https://www.imdb.com/title/tt0816222/?
ref_ = fn_al_tt_2 ';
(async() => {
const response = await request({
uri: URL,
headers: {
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/0.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
},
resolveWithFullResponse: true
});
if you need to only headers
const URL = 'https://www.imdb.com/title/tt0816222/?
ref_ = fn_al_tt_2 ';
(async() => {
const {headers} = await request({
uri: URL,
headers: {
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/0.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
},
resolveWithFullResponse: true
});
console.log(headers)

Automating Sign Up

I'm trying to login to Amazon via the node.js request module, and seem to be having difficulties.
My aim is to login to the site via their form, here is my code:
const request = require("request");
const rp = require("request-promise");
var querystring = require("querystring");
var cookieJar = request.jar();
var mainUrl = "https://www.amazon.com/";
var loginUrl = "https://www.amazon.co.uk/ap/signin";
let req = request.defaults({
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.61 Safari/537.36"
},
jar: cookieJar,
gzip: true,
followAllRedirects: true
});
var loginData =
"email=email#me.com&create=0&password=password123";
req.post(loginUrl, { data: loginData }, function(err, res, body) {
console.log(body);
});
I ran a debugger in the background, and found this seemed to be the URL called. I'm wondering if anyone knows what I may have done incorrectly.
Thank you.

Node.js can't make request to zomato.com

I would like to make request to https://zomato.com/ but there is no response, I am able to connect anywhere else but not to zomato I get timeout error every time. I was trying to set user-agent but it didn't work. I use node 6.6.0 and request 2.79.0. Any ideas?
var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
var http = require('http');
request.get({
url: 'http://zomato.com/',
headers: {
'user-ggent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
}
}, function(error, response, body) {
if(error) {
console.log("Error: " + error);
return;
}
else {
console.log("Status code: " + response.statusCode);
}
});
Update:
I've noticed that this:
curl -X GET "https://zomato.com/"
returns 301 redirect

I had some problems trying to do something similar with some websites. Try NigthmareJS instead of request
I didn't tested for zomato but here there is the code that I used for another website:
var website = new Nightmare()
.useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
.goto('http://zomatoorwhateverwebsite.com/')
.evaluate(function(){
return document.documentElement.innerHTML;
})
.end()
.then(function(html) {
var $ = cheerio.load(html);
//Do what you need here
})
I hope this helps. Sometimes you need to add some wait() check the documentation for extra functions

if you look at the output of curl zomato.com -v you can see that we are being redirected :
HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
So we need to add :
followAllRedirects: true,
Here :
request.get({
url: 'http://zamato.com/',
followAllRedirects: true,
headers: {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
}

How can I maintain a request session across async-waterfall calls?

I'm trying to run a series of requests in node.js. I was advised to use async-waterfall. I need to log into a remote vbulletin install and do a search for posts.
waterfall([
function(callback){
var options = {
jar: true,
form: {
'vb_login_password': '',
'vb_login_username': mtfConfig.user,
's': '',
'do': 'login',
'vb_login_md5password': crypto.createHash('md5').update(mtfConfig.password).digest("hex"),
'vb_login_md5password_utf': crypto.createHash('md5').update(mtfConfig.password).digest("hex"),
'submit.x' :13,
'submit.y' :9
},
//formData: form,
url: targetBaseURL+'/login.php',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.2 Safari/537.36'
},
followAllRedirects:true,
proxy: 'http://127.0.0.1:8888'
}
request.post(options, function(err, resp, body)
{
//console.log(res)
$ = cheerio.load(body);
var href = $('div.panel a').attr('href');
callback(null, href, this.jar);
});
},
function(href, jar, callback) {
console.log('second callback called');
request.get({
jar:jar,
url:href,
followAllRedirects: true,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.2 Safari/537.36'
},
proxy: 'http://127.0.0.1:8888'
}, function(err, resp,body){
$ = cheerio.load(body);
console.log($('div.signup h2').text());
callback(null, request.jar);
});
},
function (jar, callback) {
console.log('third callback - search.php')
request.get({
jar:jar,
url:targetBaseURL+'/search.php',
followAllRedirects: true,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.2 Safari/537.36'
},
proxy: 'http://127.0.0.1:8888'
}, function(err, resp,body){
$ = cheerio.load(body);
console.log(jar);
});
}
], done);
I've tried passing the cookie jar through from the first request but when I reach search.php, I am not logged in. How I can maintain the session cookies across requests and chained callbacks?

I found the answer here
Working code (first function only)
function(callback){
var jar = request.jar();
var options = {
jar: jar,
form: {
'vb_login_password': '',
'vb_login_username': mtfConfig.user,
's': '',
'do': 'login',
'vb_login_md5password': crypto.createHash('md5').update(mtfConfig.password).digest("hex"),
'vb_login_md5password_utf': crypto.createHash('md5').update(mtfConfig.password).digest("hex"),
'submit.x' :13,
'submit.y' :9
},
//formData: form,
url: targetBaseURL+'/login.php',
headers: {
'User-Agent': userAgent
},
followAllRedirects:true,
proxy: 'http://127.0.0.1:8888'
}
request.post(options, function(err, resp, body)
{
//console.log(res)
$ = cheerio.load(body);
var href = $('div.panel a').attr('href');
callback(null, href,jar);
console.log(jar.cookies); //jar now contains the cookies we need
});
},

Develop Reference

JavaScript is the programming language of the Web.

Web scraping in Node.js - javascript

Related

How to fetch and get response header with no cors?

How to get value of response headers in nodejs

Automating Sign Up

Node.js can't make request to zomato.com

How can I maintain a request session across async-waterfall calls?

Categories

Resources