Scraping : Extract data from chart - javascript

I found a way to download data from the chart page of the Wall Street Journal by looking at the network tab (in the dev tools panels) and reproducing the request that is created while refreshing the chart. It works as follow:
import requests
import json
import time
from datetime import datetime as dt
from urllib.parse import urlencode
data = {
"Step":"PT5M",
"TimeFrame":"D1",
"StartDate":int(dt(2019, 5, 1).timestamp()*1000),
"EndDate":int(dt(2019, 5, 5).timestamp()*1000),
"EntitlementToken":"57494d5ed7ad44af85bc59a51dd87c90",
"IncludeMockTick":True,
"FilterNullSlots":False,
"FilterClosedPoints":True,
"IncludeClosedSlots":False,
"IncludeOfficialClose":True,
"InjectOpen":False,
"ShowPreMarket":False,
"ShowAfterHours":False,
"UseExtendedTimeFrame":True,
"WantPriorClose":False,
"IncludeCurrentQuotes":False,
"ResetTodaysAfterHoursPercentChange":False,
"Series":[{"Key":"STOCK/US/XNYS/ABB","Dialect":"Charting","Kind":"Ticker","SeriesId":"s1","DataTypes":["Last"],"Indicators":[{"Parameters":[{"Name":"Period","Value":"50"}],"Kind":"SimpleMovingAverage","SeriesId":"i2"},{"Parameters":[],"Kind":"Volume","SeriesId":"i3"}]}]
}
data = {
'json' : json.dumps(data)
}
data = urlencode(data)
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Dylan2010.EntitlementToken': '57494d5ed7ad44af85bc59a51dd87c90',
'Origin': 'https://quotes.wsj.com',
'Referer': 'https://quotes.wsj.com/ABB/advanced-chart',
'Sec-Fetch-Mode': 'cors',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
url = 'https://api.wsj.net/api/michelangelo/timeseries/history?' + data + '&ckey=57494d5ed7'
print(url)
r = requests.get(url, headers = headers)
r.text
This is great as its quite simple and works, however I can only retrieve minutes data that is maximum 25 days old or so.
On the other end, the morningstar charting seems to have much more minutes data available, and I would like to do the same thing with it : simply get the data from the website by looking up the javascript calls that are made in the background while updating the chart. But when I look at the network tab, I cannot see any call being made when changing the date range. I don't know much about javascript and would like to know what alternative mechanism they use to achieve that. (maybe async / fetch ?)
Does anyone know how it would be possible for me to see those calls ?

You will need to investigate if any of the other query string params are time based (or ticker dependant). I have replaced several params with a fixed value that still seems to work.
import requests
start_date = '20170814'
end_date = '20190102'
r = requests.get(f'https://quotespeed.morningstar.com/ra/uniqueChartData?instid=MSSAL&sdkver=2.39.0&CToken=1&productType=sdk&cdt=7&ed={end_date}&f=d&hasPreviousClose=true&ipoDates=19801212&pids=0P000000GY&sd={start_date}&tickers=126.1.AAPL&qs_wsid=27E31E614F74FC7D8828E941CAC2D319&tmpid=1&instid=MSSAL&_=1').json()
print(r)
Original params as observed with fiddler:

Related

Scraping web data

I'd like to scrape the data from a website: https://en.macromicro.me/charts/773/baltic-dry-index
,which comprises 4 data sets.
I've discovered that the website use javascript to send request to https://en.macromicro.me/charts/data/773
to get the data,but for some reason i just can't get the data by using Postman or my script. i keep getting the result: {'success': 0, 'data': [], 'msg': 'error #240'}
did I miss anything here?
here is my code:
import requests
import json
import datetime
import pandas as pd
url = 'https://en.macromicro.me/charts/data/773'
header = {
'sec-ch-ua':'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Accept':'application/json, text/javascript, */*; q=0.01',
'Docref': 'https://www.google.com/',
'X-Requested-With':'XMLHttpRequest',
'sec-ch-ua-mobile':'?0',
'Authorization':'Bearer ee1c7b87258a902bde1129df2b64abac',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
r = requests.get(url,headers = header)
response = json.loads(r.text)
response
Missing Cookie in headers.
Refresh the page to get cookies.

How Scraping Dynamic Variable Javascript value using BeautifulSoup and Requests

I am scraping login page, i only need VAR SALT= variable in JAVASCRIPT TAG.
This is the website = https://ib.muamalatbank.com/ib-app/loginpage
When i am read all answer here,using BeautifulSoup and requests, i can get these 2 variable(Maybe because its static):
var muserid='User ID must be filled';
var mpassword= 'Password must be filled';
But when i try Scrape this var SALT= , its give me all VAR value.
My result code in python
I just need This VAR SALT value only with no Quotation mark
Here the PIC = Source VAR SALT VALUE
I already using re.search, and re.compile, re.findall, but i am Newbie, keep gives me error "Object cannot string...."
from bs4 import BeautifulSoup as bs
import requests
import re
import lxml
import json
URL = 'https://ib.muamalatbank.com/ib-app/loginpage'
REF = 'https://ib.muamalatbank.com'
HEADERS = {'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0', 'origin': URL, 'referer': REF}
s = requests.session()
soup = bs(s.get(URL, headers=HEADERS, timeout=5, verify=False).text,"html.parser")
script = soup.find_all("script")[11]
ambilteks = soup.find_all(text=re.compile("salt=(.*?)"))
print(ambilteks)
Note: 1) i need Help but not interested using Selenium,
I have script in PHP-Laravel, its fully working(i need in Python), but i have no knowledge in laravel, anyone can ask me , i will give the Laravel code
Please help me, thank you very much
Try using re.compile and add the '' into your regex, then extract first result. Not tested with page response. First verify the string is actually present in the response.
p = re.compile(r"var salt='(.*?)'")
res = p.findall(s.get(URL, headers=HEADERS, timeout=5, verify=False).text)[0]
print(res)

Python3 - grabbing data from dynamically refreshed table in webpage

I am pulling some data from this web page using the below back end call in Python. I have managed to get all the channels on the page returned by playing around with the &items parameter, however I cannot seem to grab the request in the console, or see any relevant params that will change the times of the day EPG data is populated for:
import requests
session = requests.Session()
url = 'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lang=en&lu=36625D&tz=America%2FToronto&items=3100&st=&wd=940&nc=0&div=tvmds_frames&si=0'
headers ={
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
r = session.get(url, headers=headers)
status_code2 = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
filename = 'C:\\Users\\myuser\\mypath\\test_file.txt'
with open(filename, "a", encoding='utf-8', errors="ignore") as text_file:
text_file.write(data2)
text_file.close
...all I get is an error in Google Chrome Dev Tools saying [Violation] Forced reflow while executing JavaScript.
Can anyone assist? I need to be able to grab program data from a full 24 hour period and across different days...
Thanks

How to find all the JavaScript requests made from my browser when I'm accessing a site

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium
here is my code
import requests
from bs4 import BeautifulSoup
class Linkedin():
def __init__(self, url ):
self.url = url
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
def saveRsulteToHtmlFile(self, nameOfFile=None):
if nameOfFile == None:
nameOfFile ="Linkedin_page"
with open(nameOfFile+".html", "wb") as file:
file.write(self.response.content)
def getSingInPage(self):
self.sess = requests.Session()
self.response = self.sess.get(self.url, headers=self.header)
soup = BeautifulSoup(self.response.content, "html.parser")
self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]
def connecteToMyLinkdin(self):
self.form_data = {"session_key": "myemail#mail.com",
"loginCsrfParam": self.csrf,
"session_password": "mypassword"}
self.url = "https://www.linkedin.com/uas/login-submit"
self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)
def getAnyPage(self,url):
self.response = self.sess.get(url, headers=self.header)
url = "https://www.linkedin.com/"
likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading
likedin_page.getAnyPage("https://www.linkedin.com/jobs/")
likedin_page.saveRsulteToHtmlFile()
I want help to pass the javascript loads without using Selenium...
Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.
Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).
You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.
If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:
https://robotframework.org/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)
You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).
self.header_static = {
'authority': 'static-exp2.licdn.com',
'method': 'GET',
'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https://www.linkedin.com/jobs/',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
}
def postConnectionRequests(self):
urls = [
"https://static-exp2.licdn.com/sc/h/62mb7ab7wm02esbh500ajmfuz",
"https://static-exp2.licdn.com/sc/h/mpxhij2j03tw91bpplja3u9b",
"https://static-exp2.licdn.com/sc/h/3nq91cp2wacq39jch2hz5p64y",
"https://static-exp2.licdn.com/sc/h/emyc3b18e3q2ntnbncaha2qtp",
"https://static-exp2.licdn.com/sc/h/9b0v30pbbvyf3rt7sbtiasuto",
"https://static-exp2.licdn.com/sc/h/4ntg5zu4sqpdyaz1he02c441c",
"https://static-exp2.licdn.com/sc/h/94cc69wyd1gxdiytujk4d5zm6",
"https://static-exp2.licdn.com/sc/h/ck48xrmh3ctwna0w2y1hos0ln",
"https://static-exp2.licdn.com/sc/h/c356usw7zystbud7v7l42pz0s",
]
for url in urls:
self.sess.get(url,headers=self.header_static)
print("REQUEST SENT TO "+url)
I called the postConnectionRequests() function after before saving the HTML content, and received the complete page.
Hope this helps.
XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

Outlook login script in python3: Javascript not enabled - cannot login

I am currently trying to log into an outlook account using requests only in Python. I did the same thing with selenium before, but because of better performance and the ability to use proxies with it more easily I would like to use requests now. The problem is that whenever I send a post request to the outlook post URL it returns a page saying that it will not function without javascript.
I read on here that I would have to do the requests that javascript would do with requests, so I used the network analysis tool in Firefox and made requests to all the URLs that the browser made requests to. It still returns the same page, saying that js is not enabled and it will not function without js.
s = requests.Session()
s.headers.update ({
"Accept": "application/json",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Host": "www.login.live.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
})
url = "https://login.live.com/"
r = s.get(url, allow_redirects=True)
#Simulate JS
sim0 = s.get("https://logincdn.msauth.net/16.000.28299.9
#...requests 1-8 here
sim9 = s.get("https://logincdn.msauth.net/16.000.28299.9/images/Backgrounds/0.jpg?x=a5dbd4393ff6a725c7e62b61df7e72f0", allow_redirects=True)
post_url = "https://login.live.com/ppsecure/post.srf?contextid=EFE1326315AF30F4&bk=1567009199&uaid=880f460b700f4da9b692953f54786e1c&pid=0"
payload = {"username": username}
sleep(4)
s.cookies.update({"logonLatency": "LGN01=637025992217600244"})
s.cookies.update({"MSCC": "1567001943"})
s.cookies.update({"CkTst": "G1567004031095"})
print(s.cookies.get_dict())
print ("---------------------------")
print(json.dumps(payload))
print ("---------------------------")
rp = s.post(post_url, data=json.dumps(payload), allow_redirects=True)
rg = s.get("https://logincdn.msauth.net/16.000.28299.9/images/arrow_left.svg?x=a9cc2824ef3517b6c4160dcf8ff7d410")
print (rp.text)
print (rp.status_code)
If anyone could hint me in the right direction on how to fix this I would highly appreciate it! Thanks in advance

Categories

Resources