Scraping an AJAX web page using python and requests - javascript

I tried to scrape this page using beautifulsoup find method but I could not find the table value in the HTML page. I found out that the website is generating the data instantly when I load the page through an internal API. Any help??
Thanks in advance.

This works for me. I had to dig around in the dev tools but found it
import requests
geturl=r'https://www.barchart.com/futures/quotes/CLJ19/all-futures'
apiurl=r'https://www.barchart.com/proxies/core-api/v1/quotes/get'
getheaders={
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
getpay={
'page': 'all'
}
s=requests.Session()
r=s.get(geturl,params=getpay, headers=getheaders)
headers={
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://www.barchart.com/futures/quotes/CLJ19/all-futures?page=all',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']
}
payload={
'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
'list': 'futures.contractInRoot',
'root': 'CL',
'meta': 'field.shortName,field.type,field.description',
'hasOptions': 'true',
'raw': '1'
}
r=s.get(apiurl,params=payload,headers=headers)
j=r.json()
print(j)
>{'count': 108, 'total': 108, 'data': [{'symbol': 'CLY00', 'contractSymbol': 'CLY00 (Cash)', ........

Related

Post request header: need fingerprint and re_token to pass it

So im trying to login to anghami.com using post request. I was able to login if I pass all of the values needed.
My issue is im not sure were to find the 2 missing values "re_token" ( I guess its recapthca from google) and "fingerprint"
check my code for working test:
headers = {
'authority': '',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'origin': 'https://play.anghami.com',
'referer': 'https://play.anghami.com/login',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
link = 'https://api.anghami.com/gateway.php'
s = requests.Session()
proxies = {
"http": "http://",
}
data = {
'm': 'an',
'u': 'email',
'p': 'password',
'devicename': 'Chrome 104',
're_token':'03ANYolqvEDNfBhyM-qJ77j92_vkw8yt1VtKKc9e6jZFl9mG4sysFOvVZ0LlQWsMecFRWRCMFGG8KAgdWw1S0kUPB-1yW5kfJ8B2XGLnlaW7XAReGvyYpB2WgZeGXPdxlTi0PINbN2Ga9wI2ecF9jltpf7gcUj9MLucb9KDaUYENySmFq2ts5qh9g_2nr6AXx_igsD53xvWPGrGi_n7evy224P7A0NitmjcXKlAKL_rlkkXqbwOd4qbzF_IkTKX6iSNLfb2FFso8S75OKa0dlbfLO_7eY2zU7VzVKa23XWet3RXDED7q8Rx8RKFaO9n_lvbG-PORGCpmajnbWtWoIhEZpY06mt41vx4AoW0JnCtV9Z3v5AsAoM_SIZNawTLVBKyI3iVk9AbsGskh5DZ0DzIQ2Hp_2325fuyhjp2gjW_yUud7DuGVZ9Zn7WjteVnE0Yv4ZQoWx5Z2Hz-s7Qy7G2Acm6WLbuIvS_5JsJsfLYh_hiB_DY79UyKHNpeQtulqS1wMGwHqDFbmfv',
'ngsw-bypass': 'true',
'type': 'authenticate',
'language': 'en',
'lang': 'en',
'web2': 'true',
'fingerprint': 'eyJmcCI6ImE2MzcxYTRjLTU1ODEtNDE2My1iMWRkLTA3NjBkMmI5OWZlYSIsImgiOiI0OTIxZDM1OCJ9',
'angh_type': 'authenticate',
}
auth = HTTPProxyAuth(proxyacc)
s.proxies = proxies
s.auth = auth
ext_ip = s.post(link, data=data, headers=headers)
print (ext_ip.text)
print (ext_ip.url)
How can I create 'fingerprint':
how to get 're_token':
So,
re_token -> recaptcha token(answer)
fingerprint -> base64 encoded string with fingerprint uid and
unknown 'h' param.
For more accurate information needed debug vendor.67cc4b67b66a6114.js this code.
As for the recaptcha, you need to understand what type of captcha is used on the site and, accordingly, find the key and additional information if needed.
To solve it, you can use any ready-made service like anti-captcha.
By the way, it's a bad idea to specify the authority header, since this header is specified automatically and works only with http2.0, otherwise, it betrays you even more.

UrlFetchApp fetch how to retrieve JSESSIONID object for the cookies?

I'm trying to use UrlfetchApp to send a request to a page. Request is good. But returning
<p>Your browser is currently set to block cookies. Please enable cookies in your browser preferences and try again.</p>
in the html body.
Here is the code:
const res1 = UrlFetchApp.fetch('https://url.com', {
method: 'POST',
headers: {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'JSESSIONID=Dfaefaefaesfsaefgr',
'Origin': 'https://url.com',
'Referer': 'https://url.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
},
payload: 'payloaddata'});
I don't know how to turn on the cookie when using UrlFetch function. Does anyone has the same issue?
Thanks for any suggestions.
What I have tried:
Seperate Cookie property from headers into a json object.
cookies = {
'JSESSIONID': '778C494754356A23F080849C10F2A851',
'TS01ee6e39': '018c1954d58470c0adb8d6d0df850dba9363aa54c16fc4653b2178278f62a5f0dde96d5780b3338857c8ea3ff92c6e99b9bd3a5867cef327bd847ac05ab9242a7414fa0832',
'X-HR-ClientSessionId': '10_107.162.4.39_1651626597548',
'locale': 'en',
'TS0189a565': '018c1954d5ae31eb4b4d18a85d43414cdcd9158bcc6fc4653b2178278f62a5f0dde96d578016166d716abd84dc80b66117923674fd68b4dd222e33d15361b818320e9dbd03333b4682ebc552fb370cd462eb3e5d2b5cc5273e789e87f2bd772fa800fa9e77744d459169cf3d8594422a3d7ae7968ba33103e373fbcf2c83f38da92d9643e5f9ef6a925938338bef881e38e827bf0a5635126e7297731cc06d71fd1a883702',}
But this does not work, request immediately return session expired. Please try again.
Edit on 05/07/2022:
I'm trying to get the JSESSIONID from fetch request, I can find this object on the browser, But could not find it on the urlfetch return response.

Why browsers do not send my explicit `Date` header?

Code example:
fetch('https://httpbin.org/get', {
'headers': {
'Date': (new Date()).toUTCString(),
}
})
Response:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "close",
"Host": "httpbin.org",
"Origin": "http://localhost:8000",
"Referer": "http://localhost:8000/",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.53"
},
"origin": "146.120.13.65",
"url": "https://httpbin.org/get"
}
Date is listed in the forbidden header names in the fetch spec.
These are forbidden so the user agent remains in full control over them.
Accept-Charset
Accept-Encoding
Access-Control-Request-Headers
Access-Control-Request-Method
Connection
Content-Length
Cookie
Cookie2
Date
DNT
Expect
Host
Keep-Alive
Origin
Referer
TE
Trailer
Transfer-Encoding
Upgrade
Via

What's the difference between this Javascript request and Python request?

I wrote something in Python and am trying to figure out why the hell the seemingly equivalent code in JS isn't working.
Working Python -
Headers used:
self.session = requests.Session()
#Set headers
self.headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
}
Code:
link = 'https://www.kith.com/cart'
data = [
('updates'+'['+'888074764295'+']', '1'),
('updates'+'['+'888463982599'+']', '0'),
]
click = self.session.post(link, headers= self.headers, data=data, verify = False)
Not working JS -
const secondaryVar = `updates[888463982599]`;
const desiredVariant = `updates[888074764295]`;
const checkoutForm = {};
checkoutForm[desiredVariant] = '1';
checkoutForm[secondaryVar] = '0';
//Post request to cart to update it with desired product
request({
url: 'https://www.kith.com/cart',
followAllRedirects: true,
method: 'post',
formData: checkoutForm,
headers : {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.9',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
},
},
function(err, res, body) {
I've narrowed it down to this bit of code, but as far as I can tell there is no significant difference between the code in Python and the JS code. My guess is it has something to do with the session or headers...but again I don't know.
Thanks for any responses
I think the Python might not respect cors, which would explain the difference. I don't know what JavaScript framework you are using, but using jQuery, the following works when executing this code from the kith.com website.
To avoid any issues with CORS, I removed the headers that are automatically set by the browser, and I change the url from www.kith.com to kith.com.
jQuery.ajax("https://kith.com/cart", settings={method:"post", headers : {
'Accept':'application/json',
'Accept-Language':'en-US,en;q=0.9',
'Cache-Control':'max-age=0',
'Upgrade-Insecure-Requests':'1',
}, data:{"desiredVariant":1,"secondaryVar":0}}).error(function(err){console.log("error"+ err)}).success(function(res){console.log(res)})

Adding cookie to 'get' request to log in a website with Google Apps Script

I've been trying to make a simple script work for the past couple of days and have not been successful so far.
Here is the problem :
The authentification on the website (like on many) consists of a first 'Post' request, wich redirects you with a 302 response, and then a 'Get' request, to go on the home page logged in.
So, i'm trying to log in with a POST request, get the cookie included in the response, then add this cookie to my GET request.
Here is my code :
var headers = {
'Upgrade-Insecure-Request' : '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Content-Type' : 'application/x-www-form-urlencoded',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive',
'Accept-Language' : 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4,it;q=0.2,es;q=0.2'
};
var payload = {
'ptl' : 'edt',
'codensa' : 'pblv',
'taiga_user' : 'user',
'taiga_mdp' : 'mdp',
'submit' : 'connexion'
};
var options = {
'method' : 'POST',
'headers': headers,
'payload' : payload,
'followRedirects' : false
};
var login = UrlFetchApp.fetch('https://etudiant.archi.fr/taiga/etd/pages/login.php', options);
var login_cookie = login.getAllHeaders()['Set-Cookie'].split(';')[0];
return login_cookie;
And then, my GET request :
var headers2 = {
'Upgrade-Insecure-Request' : '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4,it;q=0.2,es;q=0.2',
'Connection' : 'keep-alive',
'Cookie' : login_cookie,
};
var options2 = {
'method': 'GET',
'headers': headers2,
'followRedirects' : false
};
var index = UrlFetchApp.fetch('https://etudiant.archi.fr/taiga/etd/pages/index.php?im', options2);
And this doesnt work...
However, i suppose my script is kind of correct, since in my headers2, if i remplace 'login_cookie' with an actual cookie that i get manually with my brower, it works !!
Can anyone help me ? I've tried everything... haha

Categories

Resources