How Scraping Dynamic Variable Javascript value using BeautifulSoup and Requests - javascript

I am scraping login page, i only need VAR SALT= variable in JAVASCRIPT TAG.
This is the website = https://ib.muamalatbank.com/ib-app/loginpage
When i am read all answer here,using BeautifulSoup and requests, i can get these 2 variable(Maybe because its static):
var muserid='User ID must be filled';
var mpassword= 'Password must be filled';
But when i try Scrape this var SALT= , its give me all VAR value.
My result code in python
I just need This VAR SALT value only with no Quotation mark
Here the PIC = Source VAR SALT VALUE
I already using re.search, and re.compile, re.findall, but i am Newbie, keep gives me error "Object cannot string...."
from bs4 import BeautifulSoup as bs
import requests
import re
import lxml
import json
URL = 'https://ib.muamalatbank.com/ib-app/loginpage'
REF = 'https://ib.muamalatbank.com'
HEADERS = {'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0', 'origin': URL, 'referer': REF}
s = requests.session()
soup = bs(s.get(URL, headers=HEADERS, timeout=5, verify=False).text,"html.parser")
script = soup.find_all("script")[11]
ambilteks = soup.find_all(text=re.compile("salt=(.*?)"))
print(ambilteks)
Note: 1) i need Help but not interested using Selenium,
I have script in PHP-Laravel, its fully working(i need in Python), but i have no knowledge in laravel, anyone can ask me , i will give the Laravel code
Please help me, thank you very much

Try using re.compile and add the '' into your regex, then extract first result. Not tested with page response. First verify the string is actually present in the response.
p = re.compile(r"var salt='(.*?)'")
res = p.findall(s.get(URL, headers=HEADERS, timeout=5, verify=False).text)[0]
print(res)

Related

How to find all the JavaScript requests made from my browser when I'm accessing a site

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium
here is my code
import requests
from bs4 import BeautifulSoup
class Linkedin():
def __init__(self, url ):
self.url = url
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
def saveRsulteToHtmlFile(self, nameOfFile=None):
if nameOfFile == None:
nameOfFile ="Linkedin_page"
with open(nameOfFile+".html", "wb") as file:
file.write(self.response.content)
def getSingInPage(self):
self.sess = requests.Session()
self.response = self.sess.get(self.url, headers=self.header)
soup = BeautifulSoup(self.response.content, "html.parser")
self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]
def connecteToMyLinkdin(self):
self.form_data = {"session_key": "myemail#mail.com",
"loginCsrfParam": self.csrf,
"session_password": "mypassword"}
self.url = "https://www.linkedin.com/uas/login-submit"
self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)
def getAnyPage(self,url):
self.response = self.sess.get(url, headers=self.header)
url = "https://www.linkedin.com/"
likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading
likedin_page.getAnyPage("https://www.linkedin.com/jobs/")
likedin_page.saveRsulteToHtmlFile()
I want help to pass the javascript loads without using Selenium...
Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.
Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).
You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.
If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:
https://robotframework.org/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)
You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).
self.header_static = {
'authority': 'static-exp2.licdn.com',
'method': 'GET',
'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https://www.linkedin.com/jobs/',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
}
def postConnectionRequests(self):
urls = [
"https://static-exp2.licdn.com/sc/h/62mb7ab7wm02esbh500ajmfuz",
"https://static-exp2.licdn.com/sc/h/mpxhij2j03tw91bpplja3u9b",
"https://static-exp2.licdn.com/sc/h/3nq91cp2wacq39jch2hz5p64y",
"https://static-exp2.licdn.com/sc/h/emyc3b18e3q2ntnbncaha2qtp",
"https://static-exp2.licdn.com/sc/h/9b0v30pbbvyf3rt7sbtiasuto",
"https://static-exp2.licdn.com/sc/h/4ntg5zu4sqpdyaz1he02c441c",
"https://static-exp2.licdn.com/sc/h/94cc69wyd1gxdiytujk4d5zm6",
"https://static-exp2.licdn.com/sc/h/ck48xrmh3ctwna0w2y1hos0ln",
"https://static-exp2.licdn.com/sc/h/c356usw7zystbud7v7l42pz0s",
]
for url in urls:
self.sess.get(url,headers=self.header_static)
print("REQUEST SENT TO "+url)
I called the postConnectionRequests() function after before saving the HTML content, and received the complete page.
Hope this helps.
XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

Scraping : Extract data from chart

I found a way to download data from the chart page of the Wall Street Journal by looking at the network tab (in the dev tools panels) and reproducing the request that is created while refreshing the chart. It works as follow:
import requests
import json
import time
from datetime import datetime as dt
from urllib.parse import urlencode
data = {
"Step":"PT5M",
"TimeFrame":"D1",
"StartDate":int(dt(2019, 5, 1).timestamp()*1000),
"EndDate":int(dt(2019, 5, 5).timestamp()*1000),
"EntitlementToken":"57494d5ed7ad44af85bc59a51dd87c90",
"IncludeMockTick":True,
"FilterNullSlots":False,
"FilterClosedPoints":True,
"IncludeClosedSlots":False,
"IncludeOfficialClose":True,
"InjectOpen":False,
"ShowPreMarket":False,
"ShowAfterHours":False,
"UseExtendedTimeFrame":True,
"WantPriorClose":False,
"IncludeCurrentQuotes":False,
"ResetTodaysAfterHoursPercentChange":False,
"Series":[{"Key":"STOCK/US/XNYS/ABB","Dialect":"Charting","Kind":"Ticker","SeriesId":"s1","DataTypes":["Last"],"Indicators":[{"Parameters":[{"Name":"Period","Value":"50"}],"Kind":"SimpleMovingAverage","SeriesId":"i2"},{"Parameters":[],"Kind":"Volume","SeriesId":"i3"}]}]
}
data = {
'json' : json.dumps(data)
}
data = urlencode(data)
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Dylan2010.EntitlementToken': '57494d5ed7ad44af85bc59a51dd87c90',
'Origin': 'https://quotes.wsj.com',
'Referer': 'https://quotes.wsj.com/ABB/advanced-chart',
'Sec-Fetch-Mode': 'cors',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
url = 'https://api.wsj.net/api/michelangelo/timeseries/history?' + data + '&ckey=57494d5ed7'
print(url)
r = requests.get(url, headers = headers)
r.text
This is great as its quite simple and works, however I can only retrieve minutes data that is maximum 25 days old or so.
On the other end, the morningstar charting seems to have much more minutes data available, and I would like to do the same thing with it : simply get the data from the website by looking up the javascript calls that are made in the background while updating the chart. But when I look at the network tab, I cannot see any call being made when changing the date range. I don't know much about javascript and would like to know what alternative mechanism they use to achieve that. (maybe async / fetch ?)
Does anyone know how it would be possible for me to see those calls ?
You will need to investigate if any of the other query string params are time based (or ticker dependant). I have replaced several params with a fixed value that still seems to work.
import requests
start_date = '20170814'
end_date = '20190102'
r = requests.get(f'https://quotespeed.morningstar.com/ra/uniqueChartData?instid=MSSAL&sdkver=2.39.0&CToken=1&productType=sdk&cdt=7&ed={end_date}&f=d&hasPreviousClose=true&ipoDates=19801212&pids=0P000000GY&sd={start_date}&tickers=126.1.AAPL&qs_wsid=27E31E614F74FC7D8828E941CAC2D319&tmpid=1&instid=MSSAL&_=1').json()
print(r)
Original params as observed with fiddler:

How do I grab the table content from a webpage with javascript using python?

I like to grab the table content from this page. The following is my code and I got NaN (without the data). How come the numbers are not showing up? How do I grab the table with the corresponding data? Thanks.
You can get a nice json format from the api:
import requests
import pandas as pd
url = 'https://api.blockchain.info/stats'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
params = {'cors': 'true'}
data = requests.get(url, headers=headers, params=params).json()
# if you want it as a table
df = pd.DataFrame(data.items())
Option 2:
Let the page fully render. There is abetter way to use wait with Selenium, but just quickly threw a 5 second wait in there to show:
from selenium import webdriver
import pandas as pd
import time
url = 'https://www.blockchain.com/stats'
browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get(url)
time.sleep(5)
dfs = pd.read_html(browser.page_source)
print(dfs[0])
browser.close()
Output:
0 1 2 3
0 Blocks Mined 150 150 NaN
1 Time Between Blocks 9.05 minutes 9.05 minutes NaN
2 Bitcoins Mined 1,875.00000000 BTC 1,875.00000000 BTC NaN

A Node.js function gives different results while working on my machine and on AWS-Lambda

I have a node function that gets a link to YouTube video and sends a request to https://www.convertmp3.io/ . It's a website that allows to download MP3 from YT.
Then it parses the response HTML to get a direct download link from the document (using a library called "cheerio", but if You're not familiar with it, it's just for scraping the link from HTML), and afterwards opens the link to download the MP3.
My code:
const request = require("request")
const cheerio = require("cheerio")
let link = "https://www.convertmp3.io/download/?video=https://www.youtube.com/watch?v=RRSDTE5nWnc"
const options = {
url: link,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
};
request(options, (error, response, html) => {
if(!error && response.statusCode == 200)
{
const $ = cheerio.load(html)
const document = $('.infoBox')
let href = "https://www.convertmp3.io" + document.find("#download").attr('href')
console.log(href)
}
})
While working on my machine, this code works correctly. The console.log gives a link which I can open on my browser and see that the mp3 is actually downloading.
But when I run it on Lambda, the link I get has the correct format, but it's simply not working. It redirects to some non-existing domain.
I'm not entirely sure how is this even possible. The only thing I might think about is that the website may think that the program is a bot (logically) and give a wrong link (which sounds pretty bizzare). But I decided to also send some user-agent headers. It didn't work either.
I'm really confused about how can this be possible and don't even know what else to try. Any thoughts?

Get audio source link from Website with python

I am writing a script to fetch audio source links from a website. By crawling the main page a get a list of the links available. but when I crawl the links generated I can't find the source. (It should be inside href of a < audio > tag).
Here is my code:
# -*- coding: utf-8 -*-
import urllib.request
from bs4 import BeautifulSoup
def getHTML(st):
with urllib.request.urlopen(site+'/',timeout=100) as response:
return response.read()
site = 'http://www.e-radio.gr'
soup = BeautifulSoup(getHTML(site), 'html.parser')
# Parse Main Page And get links
lst = list()
for a in soup.body.find_all('a', {'class' : 'erplayer'}):
item = a.get('href')
if site in item:
lst.append(item)
else:
lst.append(site + item)
print("\n".join(lst))
It seems that the website doesn't load properly and it doesn't load the audio source using urllib.request. What else i can use instead of urllib.request so it waits for the full page to load. What i have thought was to use some external web browser to generate the html but i don't know how to do that
This is a bit tricky, but we can approach that step by step - first getting the player's HTML by following the iframe link. Then, getting the flash player link and following it. Then, extracting the link to the mp3 and downloading the stream. All of that under the same web-scraping session:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def download_file(session, link, path):
r = session.get(link, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
base_url = "http://www.e-radio.gr"
url = "http://www.e-radio.gr/Rainbow-89-Thessaloniki-i92/live"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
frame = soup.find(id="playerControls1")
frame_url = urljoin(base_url, frame["src"])
response = session.get(frame_url)
soup = BeautifulSoup(response.content, "html.parser")
link = soup.select_one(".onerror a")['href']
flash_url = urljoin(response.url, link)
response = session.get(flash_url)
soup = BeautifulSoup(response.content, "html.parser")
mp3_link = soup.select_one("param[name=flashvars]")['value'].split("url=", 1)[-1]
print(mp3_link)
download_file(session, mp3_link, "download.mp3")

Categories

Resources