403 Forbidden when using Cheerio - javascript

I'm trying to webscrape a Website so I can gather some information for a project, here is my code and it's returning in the Console 403. I'm using request and cheerio to do this, why is this happening? Note I do know the what the majority of status codes mean.
const request = require('request');
const cheerio = require('cheerio');
request('http://www.realmeye.com/forum/', function(err, resp, html) {
if (!err) {
const gatherInformation = cheerio.load(html);
console.log(html);
}
})

You should add a "User-Agent" header to the request, which fits for some browser (e.g. chrome). The server probably checks it to avoid unfamiliar clients.
A thumb rule for web scraping:
Use chrome dev tools / fiddler / other similar tool to inspect the request firing up from your client (chrome, firefox, etc') before trying to reproduce it on your framework (Inspect headers, cookies, etc').
The raw request I saw on Fiddler in your case (when hitting your url on chrome):
GET /forum/ HTTP/1.1
Host: www.realmeye.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
Sec-Fetch-Mode: same-origin
Sec-Fetch-Site: same-origin
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,he;q=0.8
Most of servers would check "Accept" and "User-Agent" headers before returning 200 OK response.
The fixed code snippet:
const request = require('request');
const cheerio = require('cheerio');
let options = {
url: 'https://www.realmeye.com/forum/',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
};
request(options, function(err, resp, html) {
if (!err) {
const gatherInformation = cheerio.load(html);
console.log(html);
}
})

Related

Scraping web data

I'd like to scrape the data from a website: https://en.macromicro.me/charts/773/baltic-dry-index
,which comprises 4 data sets.
I've discovered that the website use javascript to send request to https://en.macromicro.me/charts/data/773
to get the data,but for some reason i just can't get the data by using Postman or my script. i keep getting the result: {'success': 0, 'data': [], 'msg': 'error #240'}
did I miss anything here?
here is my code:
import requests
import json
import datetime
import pandas as pd
url = 'https://en.macromicro.me/charts/data/773'
header = {
'sec-ch-ua':'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Accept':'application/json, text/javascript, */*; q=0.01',
'Docref': 'https://www.google.com/',
'X-Requested-With':'XMLHttpRequest',
'sec-ch-ua-mobile':'?0',
'Authorization':'Bearer ee1c7b87258a902bde1129df2b64abac',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
r = requests.get(url,headers = header)
response = json.loads(r.text)
response
Missing Cookie in headers.
Refresh the page to get cookies.

pjax request change status to canceled as soon as it fires

here is my code
$(document).on('click', '#global-search-button', function() {
$.pjax({
url: "http://localhost/project/brows",
container: '#pjax-container',
timeout: 9000000 ,
})
});
when i click on the button
her here is more info on request
Request URL: http://localhost/project/brows?_pjax=%23pjax-container
Referrer Policy: no-referrer-when-downgrade
Provisional headers are shown
Accept: text/html, */*; q=0.01
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Referer: http://localhost/project/brows
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
X-PJAX: true
X-PJAX-Container: #pjax-container
X-Requested-With: XMLHttpRequest
_pjax: #pjax-container
i have to add some more info to be able to post the question , so here it is
Often I had a similar problem because I included the hostname as well.
Try calling pjax without it like so
$.pjax({
url: "/project/brows",
container: '#pjax-container',
timeout: 9000000,
})
the reason was very simple and odd , basically if the href of the link should be a # it would cancel as soon as it fires .... i changed it to javascript:void(0) and it worked

retrive a result from a website by an ajax post request

I'm tryng to perfom a simple ajax post request to retrieve some data from a website.
In detail i'm trying to contact a page that a website recall to have some informations.
So i have the main website and a page that it calls to retrive data.
I discovered that page using the google inspection section, in particular in the xhr section of network field of the inspector.
In my code i used all the headers and the payload data that are used by the website to contact the page.
This is the code that i'm using to reach my goal:
var XMLHttpRequest = require("xmlhttprequest").XMLHttpRequest;
var url = 'https://www.remax.pt/Webservices/MainWebService.asmx/GetCityList';
var body = {"SiteRegionID":"12","RegionID":"12","RegionRowID":"78","ProvinceID":"0",
"LanguageCode":"ITA","MinInternetCount":"0","SearchType":"","OfficeAgent":"0",
"EncodingLanguage":"PTG","OfficeAgentId":"0"};
var xhr = new XMLHttpRequest();
xhr.onload = function () {
var data = xhr.responseText;
if (xhr.readyState == 4 && xhr.status == "200") {
console.table("results: "+data);
} else {
console.error("error: "+data);
}
}
xhr.open("POST", url, true);
xhr.setRequestHeader('Content-Type', 'application/json; charset=UTF-8');
//xhr.setRequestHeader("Content-Type","text/html");
xhr.setRequestHeader("Access-Control-Allow-Origin","*");
xhr.setRequestHeader("accept", "application/json, text/javascript, */*; q=0.01");
xhr.setRequestHeader("authority", "www.remax.pt");
xhr.setRequestHeader("scheme", "https");
xhr.setRequestHeader("path", "/Webservices/MainWebService.asmx/GetCityList");
xhr.setRequestHeader("accept-language","it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7");
//xhr.setRequestHeader('accept-encoding', 'gzip, deflate, br');
//xhr.setRequestHeader("host", "https://www.remax.pt");
//xhr.setRequestHeader('referer', 'https://www.remax.pt/PublicListingList.aspx');
//xhr.setRequestHeader('content-length', '192');
//xhr.setRequestHeader('cookie','__cfduid=dc7dd48ccff40ee4f85840bfc35685b311531384150; PersonalizationMap=; PersonalizationGallery=SelectedCountryID=12; GtTransLang=ITA; SLINGSHOT=LanguageCode=it-IT; SessionId=1ac0ec84-6a03-4965-ba90-7eb686f66bf5; ASP.NET_SessionId=rgia1pblms2abf11ypsbiqgz; GtTrans=ENU; LastSearch=SiteRegionID=12&TransactionTypeUID=260&RegionID=12&RegionRowID=78&LocationText=Porto&LocationValue=YR78&PriceCurrency=EUR&ComRes=2; PersonalizationRegion=#mode=list&tt=260&cr=2&r=78&cur=EUR&la=All&sb=PriceIncreasing&page=1&sc=12&sid=a81a1d1d-ee36-4236-a72e-31343349c574; PersonalizationDate=2018-7-24 10:0:30');
xhr.setRequestHeader("user-agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36");
xhr.setRequestHeader("x-requested-with", "XMLHttpRequest");
xhr.send(JSON.stringify(body);
Actually i never receive an answer. I think that the flow of operations never enter in the onLoad section because the string that there are in if and else sections are never printed.
I wanna specify that some headers are commented because i had an answer of this type:
Refused to set unsafe header 'nameHeader'
So i decided to do not use them for the moment.
I tried to change some headers or add something new but the problem remains and honestly i don't have idea if it is a problem of syntax of some fields or if i need other things to perform an acceptable request.
for completeness i insert the 4 fields that i found in the inspector tool that specify the parameters passed by the website to call the page:
GENERAL:
1. Request URL:
https://www.remax.pt/Webservices/MainWebService.asmx/GetCityList
2. Request Method: POST
3. Status Code: 200
4. Remote Address: 104.25.40.105:443
5. Referrer Policy: no-referrer-when-downgrade
RESPONSE HEADERS:
List item
cache-control: private, max-age=0
cf-ray: 43f532b9e9886260-LIS
content-encoding: br
content-type: application/json; charset=utf-8
date: Tue, 24 Jul 2018 09:00:44 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
status: 200
x-aspnet-version: 4.0.30319
x-ua-compatible: IE=9, IE=8
REQUEST HEADERS:
authority: www.remax.pt
method: POST
path: /Webservices/MainWebService.asmx/GetCityList
scheme: https
accept: application/json, text/javascript, /; q=0.01
accept-encoding: gzip, deflate, br
accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7
content-length: 192
content-type: application/json; charset=UTF-8
cookie: __cfduid=dc7dd48ccff40ee4f85840bfc35685b311531384150; PersonalizationMap=; PersonalizationGallery=SelectedCountryID=12; GtTransLang=ITA; SLINGSHOT=LanguageCode=it-IT; SessionId=1ac0ec84-6a03-4965-ba90-7eb686f66bf5; ASP.NET_SessionId=rgia1pblms2abf11ypsbiqgz; GtTrans=ENU; LastSearch=SiteRegionID=12&TransactionTypeUID=260&RegionID=12&RegionRowID=78&LocationText=Porto&LocationValue=YR78&PriceCurrency=EUR&ComRes=2; PersonalizationRegion=#mode=list&tt=260&cr=2&r=78&cur=EUR&la=All&sb=PriceIncreasing&page=1&sc=12&sid=a81a1d1d-ee36-4236-a72e-31343349c574; PersonalizationDate=2018-7-24 10:0:30
origin: https://www.remax.pt
referer: https://www.remax.pt/PublicListingList.aspx
user-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
x-requested-with: XMLHttpRequest
REQUEST PAYLOAD:
{"SiteRegionID":"12","RegionID":"12","RegionRowID":"78","ProvinceID":"0","LanguageCode":"ITA","MinInternetCount":"0","SearchType":"","OfficeAgent":0,"EncodingLanguage":"PTG","OfficeAgentId":0}

How do I make a request using dojo.xhrGet() with Basic Authentication?

I'm trying to make a dojo xhrGet request using basic authentication but I keep getting 403 forbidden error. I can make the request with curl from the command line, so I know my credentials are valid. When I check the request headers, the Authorization: Basic header isn't even being set. What am I doing wrong:
var lookupArgs = {
url: "https://myendpoint.com/myapi/endpoint",
user:"myemail#myendpoint.com",
password:"mypassword",
handleAs: "text",
load: function(data) {
content_node.innerHTML = data;
},
error: function(error) {
content_node.innerHTML = error;
}
}
dojo.xhrGet(lookupArgs);
Request Header:
GET /myapi/endpoint HTTP/1.1
Host: myendpoint.com
Connection: keep-alive
Origin: http://my-origin
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3315.3 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: */*
Referer: http://my-origin
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

Why can I not see my HTTP headers in Chrome?

I am performing a fetch:
fetch(url, fetchOptions);
fetchOptions is configured like so:
var fetchOptions = {
method: options.method,
headers: getHeaders(),
mode: 'no-cors',
cache: 'no-cache',
};
function getHeaders() {
var headers = new Headers(); // Headers is part of the fetch API.
headers.append('User-ID', 'foo');
return headers;
}
Checking fetchOptions at runtime it looks as follows:
fetchOptions.headers.keys().next() // Object {done: false, value: "user-id"}
fetchOptions.headers.values().next() // Object {done: false, value: "foo"}
But user-id is nowhere to be found in the request headers per Chrome dev tools:
GET /whatever?a=long_name&searchTerm=g HTTP/1.1
Host: host:8787
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
accept: application/json
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36
Referer: http://localhost:23900/
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,fr;q=0.6
Why can I not see my "User-ID" header in Chrome dev tools, and why does the header key appear to have been lowercased?
Incase someone else has a similar problem there were two possible culprits for this:
I might not have been starting Chrome with the correct flags.
The following didn't work when run from a Windows shortcut:
"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -enable-precise-memory-info -disable-web-security
The following did work when run from a Windows command prompt:
C:\whatever\48.0.2564.82\application\chrome.exe --disable-web-security --user-data-dir=C:\whatever\tmp\chrome
The addition of no-cors to the mode of the request might have caused an OPTIONS request to precede the GET request and the server did not support OPTIONS.

Categories

Resources