I am trying to call a javascript function using Python requests lib, so I can be redirected to another URL containing a PDF file I will then download.
Steps are:
requests.get(url) from a session;
call javascript function embeded in the current web page;
download the binary content to a PDF file;
Is it possible to do it using only requests lib?
What I have so far - which throws an Exception:
import requests
import shutil
import os
main_dir = os.path.dirname(__file__)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}
with requests.Session() as s:
response = s.get(url, verify=False, stream=True, headers=headers)
if response.status_code == 200:
s.get('javascript:submitForm("pdf")')
with response as content:
with open(os.path.join(main_dir, 'nfe.pdf'), 'wb') as f:
shutil.copyfileobj(content.raw, f)
return True
return False
Exception thrown:
requests.exceptions.InvalidSchema: No connection adapters were found for 'javascript:submitForm("pdf")'
Related
I'd like to scrape the data from a website: https://en.macromicro.me/charts/773/baltic-dry-index
,which comprises 4 data sets.
I've discovered that the website use javascript to send request to https://en.macromicro.me/charts/data/773
to get the data,but for some reason i just can't get the data by using Postman or my script. i keep getting the result: {'success': 0, 'data': [], 'msg': 'error #240'}
did I miss anything here?
here is my code:
import requests
import json
import datetime
import pandas as pd
url = 'https://en.macromicro.me/charts/data/773'
header = {
'sec-ch-ua':'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Accept':'application/json, text/javascript, */*; q=0.01',
'Docref': 'https://www.google.com/',
'X-Requested-With':'XMLHttpRequest',
'sec-ch-ua-mobile':'?0',
'Authorization':'Bearer ee1c7b87258a902bde1129df2b64abac',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
r = requests.get(url,headers = header)
response = json.loads(r.text)
response
Missing Cookie in headers.
Refresh the page to get cookies.
I am pulling some data from this web page using the below back end call in Python. I have managed to get all the channels on the page returned by playing around with the &items parameter, however I cannot seem to grab the request in the console, or see any relevant params that will change the times of the day EPG data is populated for:
import requests
session = requests.Session()
url = 'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lang=en&lu=36625D&tz=America%2FToronto&items=3100&st=&wd=940&nc=0&div=tvmds_frames&si=0'
headers ={
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
r = session.get(url, headers=headers)
status_code2 = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
filename = 'C:\\Users\\myuser\\mypath\\test_file.txt'
with open(filename, "a", encoding='utf-8', errors="ignore") as text_file:
text_file.write(data2)
text_file.close
...all I get is an error in Google Chrome Dev Tools saying [Violation] Forced reflow while executing JavaScript.
Can anyone assist? I need to be able to grab program data from a full 24 hour period and across different days...
Thanks
I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium
here is my code
import requests
from bs4 import BeautifulSoup
class Linkedin():
def __init__(self, url ):
self.url = url
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
def saveRsulteToHtmlFile(self, nameOfFile=None):
if nameOfFile == None:
nameOfFile ="Linkedin_page"
with open(nameOfFile+".html", "wb") as file:
file.write(self.response.content)
def getSingInPage(self):
self.sess = requests.Session()
self.response = self.sess.get(self.url, headers=self.header)
soup = BeautifulSoup(self.response.content, "html.parser")
self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]
def connecteToMyLinkdin(self):
self.form_data = {"session_key": "myemail#mail.com",
"loginCsrfParam": self.csrf,
"session_password": "mypassword"}
self.url = "https://www.linkedin.com/uas/login-submit"
self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)
def getAnyPage(self,url):
self.response = self.sess.get(url, headers=self.header)
url = "https://www.linkedin.com/"
likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading
likedin_page.getAnyPage("https://www.linkedin.com/jobs/")
likedin_page.saveRsulteToHtmlFile()
I want help to pass the javascript loads without using Selenium...
Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.
Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).
You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.
If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:
https://robotframework.org/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)
You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).
self.header_static = {
'authority': 'static-exp2.licdn.com',
'method': 'GET',
'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https://www.linkedin.com/jobs/',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
}
def postConnectionRequests(self):
urls = [
"https://static-exp2.licdn.com/sc/h/62mb7ab7wm02esbh500ajmfuz",
"https://static-exp2.licdn.com/sc/h/mpxhij2j03tw91bpplja3u9b",
"https://static-exp2.licdn.com/sc/h/3nq91cp2wacq39jch2hz5p64y",
"https://static-exp2.licdn.com/sc/h/emyc3b18e3q2ntnbncaha2qtp",
"https://static-exp2.licdn.com/sc/h/9b0v30pbbvyf3rt7sbtiasuto",
"https://static-exp2.licdn.com/sc/h/4ntg5zu4sqpdyaz1he02c441c",
"https://static-exp2.licdn.com/sc/h/94cc69wyd1gxdiytujk4d5zm6",
"https://static-exp2.licdn.com/sc/h/ck48xrmh3ctwna0w2y1hos0ln",
"https://static-exp2.licdn.com/sc/h/c356usw7zystbud7v7l42pz0s",
]
for url in urls:
self.sess.get(url,headers=self.header_static)
print("REQUEST SENT TO "+url)
I called the postConnectionRequests() function after before saving the HTML content, and received the complete page.
Hope this helps.
XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.
I am currently trying to log into an outlook account using requests only in Python. I did the same thing with selenium before, but because of better performance and the ability to use proxies with it more easily I would like to use requests now. The problem is that whenever I send a post request to the outlook post URL it returns a page saying that it will not function without javascript.
I read on here that I would have to do the requests that javascript would do with requests, so I used the network analysis tool in Firefox and made requests to all the URLs that the browser made requests to. It still returns the same page, saying that js is not enabled and it will not function without js.
s = requests.Session()
s.headers.update ({
"Accept": "application/json",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Host": "www.login.live.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
})
url = "https://login.live.com/"
r = s.get(url, allow_redirects=True)
#Simulate JS
sim0 = s.get("https://logincdn.msauth.net/16.000.28299.9
#...requests 1-8 here
sim9 = s.get("https://logincdn.msauth.net/16.000.28299.9/images/Backgrounds/0.jpg?x=a5dbd4393ff6a725c7e62b61df7e72f0", allow_redirects=True)
post_url = "https://login.live.com/ppsecure/post.srf?contextid=EFE1326315AF30F4&bk=1567009199&uaid=880f460b700f4da9b692953f54786e1c&pid=0"
payload = {"username": username}
sleep(4)
s.cookies.update({"logonLatency": "LGN01=637025992217600244"})
s.cookies.update({"MSCC": "1567001943"})
s.cookies.update({"CkTst": "G1567004031095"})
print(s.cookies.get_dict())
print ("---------------------------")
print(json.dumps(payload))
print ("---------------------------")
rp = s.post(post_url, data=json.dumps(payload), allow_redirects=True)
rg = s.get("https://logincdn.msauth.net/16.000.28299.9/images/arrow_left.svg?x=a9cc2824ef3517b6c4160dcf8ff7d410")
print (rp.text)
print (rp.status_code)
If anyone could hint me in the right direction on how to fix this I would highly appreciate it! Thanks in advance
I like to grab the table content from this page. The following is my code and I got NaN (without the data). How come the numbers are not showing up? How do I grab the table with the corresponding data? Thanks.
You can get a nice json format from the api:
import requests
import pandas as pd
url = 'https://api.blockchain.info/stats'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
params = {'cors': 'true'}
data = requests.get(url, headers=headers, params=params).json()
# if you want it as a table
df = pd.DataFrame(data.items())
Option 2:
Let the page fully render. There is abetter way to use wait with Selenium, but just quickly threw a 5 second wait in there to show:
from selenium import webdriver
import pandas as pd
import time
url = 'https://www.blockchain.com/stats'
browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get(url)
time.sleep(5)
dfs = pd.read_html(browser.page_source)
print(dfs[0])
browser.close()
Output:
0 1 2 3
0 Blocks Mined 150 150 NaN
1 Time Between Blocks 9.05 minutes 9.05 minutes NaN
2 Bitcoins Mined 1,875.00000000 BTC 1,875.00000000 BTC NaN