Scraping web data - javascript

I'd like to scrape the data from a website: https://en.macromicro.me/charts/773/baltic-dry-index
,which comprises 4 data sets.
I've discovered that the website use javascript to send request to https://en.macromicro.me/charts/data/773
to get the data,but for some reason i just can't get the data by using Postman or my script. i keep getting the result: {'success': 0, 'data': [], 'msg': 'error #240'}
did I miss anything here?
here is my code:
import requests
import json
import datetime
import pandas as pd
url = 'https://en.macromicro.me/charts/data/773'
header = {
'sec-ch-ua':'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Accept':'application/json, text/javascript, */*; q=0.01',
'Docref': 'https://www.google.com/',
'X-Requested-With':'XMLHttpRequest',
'sec-ch-ua-mobile':'?0',
'Authorization':'Bearer ee1c7b87258a902bde1129df2b64abac',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
r = requests.get(url,headers = header)
response = json.loads(r.text)
response

Missing Cookie in headers.
Refresh the page to get cookies.

Related

how to webscrape the "key statistics" from morningstar.com to python and pandas

so the link of the data i want to webscrape is this one
https://www.morningstar.com/stocks/xidx/smsm/valuation
and the data i want to scrape is this one down bellow
THE IMAGE
Pleaseee help mee :(
i would like to have the table in my jupyter notebook so i can use pandas and python to do my stock and investing analysis
The page loads the data through an API endpoint that requires the API key if accessed directly. But you can simulate the request sent by the browser to get the same data in python.
Use inspector to find out the exact request, copy the CURL request and convert it to python using online tool
Here is the converted request. You can get the results in json with response.json
import requests
headers = {
'authority': 'api-global.morningstar.com',
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9',
'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
'origin': 'https://www.morningstar.com',
'referer': 'https://www.morningstar.com/stocks/xidx/smsm/valuation',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'x-api-realtime-e': 'eyJlbmMiOiJBMTI4R0NNIiwiYWxnIjoiUlNBLU9BRVAifQ.X-h4zn65XpjG8cZnL3e6hj8LMbzupQBglHZce7tzu-c4utCtXQ2IYoLxdik04usYRhNo74AS_2crdjLnBc_J0lFEdAPzb_OBE7HjwfRaYeNhfXIDw74QCrFGqQ5n7AtllL-vTGnqmI1S9WJhSwnIBe_yRxuXGGbIttizI5FItYY.bB3WkiuoS1xzw78w.iTqTFVbxKo4NQQsNNlbkF4tg4GCfgqdRdQXN8zQU3QYhbHc-XDusH1jFii3-_-AIsqpHaP7ilG9aBxzoK7KPPfK3apcoMS6fDM3QLRSZzjkBoxWK75FtrQMAN5-LecdJk97xaXEciS0QqqBqNugoSPwoiZMazHX3rr7L5jPM-ecXN2uEjbSR0wfg-57iHAku8jvThz4mtGpMRAOil9iZaL6iRQ.o6tR6kuOQBhnpcsdTQeZWw',
'x-api-requestid': 'cdbb5a73-9654-4b31-a845-32844eb44ca8',
'x-sal-contenttype': 'e7FDDltrTy+tA2HnLovvGL0LFMwT+KkEptGju5wXVTU=',
}
params = {
'languageId': 'en',
'locale': 'en',
'clientId': 'MDC',
'component': 'sal-components-valuation',
'version': '3.79.0',
}
response = requests.get(
'https://api-global.morningstar.com/sal-service/v1/stock/valuation/v3/0P0000BPTU',
params=params,
headers=headers,
)

How to call a Javascript function from Python requests.get

I am trying to call a javascript function using Python requests lib, so I can be redirected to another URL containing a PDF file I will then download.
Steps are:
requests.get(url) from a session;
call javascript function embeded in the current web page;
download the binary content to a PDF file;
Is it possible to do it using only requests lib?
What I have so far - which throws an Exception:
import requests
import shutil
import os
main_dir = os.path.dirname(__file__)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}
with requests.Session() as s:
response = s.get(url, verify=False, stream=True, headers=headers)
if response.status_code == 200:
s.get('javascript:submitForm("pdf")')
with response as content:
with open(os.path.join(main_dir, 'nfe.pdf'), 'wb') as f:
shutil.copyfileobj(content.raw, f)
return True
return False
Exception thrown:
requests.exceptions.InvalidSchema: No connection adapters were found for 'javascript:submitForm("pdf")'

Python3 - grabbing data from dynamically refreshed table in webpage

I am pulling some data from this web page using the below back end call in Python. I have managed to get all the channels on the page returned by playing around with the &items parameter, however I cannot seem to grab the request in the console, or see any relevant params that will change the times of the day EPG data is populated for:
import requests
session = requests.Session()
url = 'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lang=en&lu=36625D&tz=America%2FToronto&items=3100&st=&wd=940&nc=0&div=tvmds_frames&si=0'
headers ={
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
r = session.get(url, headers=headers)
status_code2 = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
filename = 'C:\\Users\\myuser\\mypath\\test_file.txt'
with open(filename, "a", encoding='utf-8', errors="ignore") as text_file:
text_file.write(data2)
text_file.close
...all I get is an error in Google Chrome Dev Tools saying [Violation] Forced reflow while executing JavaScript.
Can anyone assist? I need to be able to grab program data from a full 24 hour period and across different days...
Thanks

403 Forbidden when using Cheerio

I'm trying to webscrape a Website so I can gather some information for a project, here is my code and it's returning in the Console 403. I'm using request and cheerio to do this, why is this happening? Note I do know the what the majority of status codes mean.
const request = require('request');
const cheerio = require('cheerio');
request('http://www.realmeye.com/forum/', function(err, resp, html) {
if (!err) {
const gatherInformation = cheerio.load(html);
console.log(html);
}
})
You should add a "User-Agent" header to the request, which fits for some browser (e.g. chrome). The server probably checks it to avoid unfamiliar clients.
A thumb rule for web scraping:
Use chrome dev tools / fiddler / other similar tool to inspect the request firing up from your client (chrome, firefox, etc') before trying to reproduce it on your framework (Inspect headers, cookies, etc').
The raw request I saw on Fiddler in your case (when hitting your url on chrome):
GET /forum/ HTTP/1.1
Host: www.realmeye.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
Sec-Fetch-Mode: same-origin
Sec-Fetch-Site: same-origin
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,he;q=0.8
Most of servers would check "Accept" and "User-Agent" headers before returning 200 OK response.
The fixed code snippet:
const request = require('request');
const cheerio = require('cheerio');
let options = {
url: 'https://www.realmeye.com/forum/',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
};
request(options, function(err, resp, html) {
if (!err) {
const gatherInformation = cheerio.load(html);
console.log(html);
}
})

Outlook login script in python3: Javascript not enabled - cannot login

I am currently trying to log into an outlook account using requests only in Python. I did the same thing with selenium before, but because of better performance and the ability to use proxies with it more easily I would like to use requests now. The problem is that whenever I send a post request to the outlook post URL it returns a page saying that it will not function without javascript.
I read on here that I would have to do the requests that javascript would do with requests, so I used the network analysis tool in Firefox and made requests to all the URLs that the browser made requests to. It still returns the same page, saying that js is not enabled and it will not function without js.
s = requests.Session()
s.headers.update ({
"Accept": "application/json",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Host": "www.login.live.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
})
url = "https://login.live.com/"
r = s.get(url, allow_redirects=True)
#Simulate JS
sim0 = s.get("https://logincdn.msauth.net/16.000.28299.9
#...requests 1-8 here
sim9 = s.get("https://logincdn.msauth.net/16.000.28299.9/images/Backgrounds/0.jpg?x=a5dbd4393ff6a725c7e62b61df7e72f0", allow_redirects=True)
post_url = "https://login.live.com/ppsecure/post.srf?contextid=EFE1326315AF30F4&bk=1567009199&uaid=880f460b700f4da9b692953f54786e1c&pid=0"
payload = {"username": username}
sleep(4)
s.cookies.update({"logonLatency": "LGN01=637025992217600244"})
s.cookies.update({"MSCC": "1567001943"})
s.cookies.update({"CkTst": "G1567004031095"})
print(s.cookies.get_dict())
print ("---------------------------")
print(json.dumps(payload))
print ("---------------------------")
rp = s.post(post_url, data=json.dumps(payload), allow_redirects=True)
rg = s.get("https://logincdn.msauth.net/16.000.28299.9/images/arrow_left.svg?x=a9cc2824ef3517b6c4160dcf8ff7d410")
print (rp.text)
print (rp.status_code)
If anyone could hint me in the right direction on how to fix this I would highly appreciate it! Thanks in advance

Categories

Resources