Scraping Javascript Website With BeautifulSoup 4 & Requests_HTML - javascript

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])

In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

Related

Selenium doesn't seem to load the JavaScript part of the website

hello i have a script using Python and Selenium, and I don't understand why this can't retrieve the JS part of the website (the same script works perfectly fine on my other machine):
import chromedriver_binary
from bs4 import BeautifulSoup
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("window-size=1024,768")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--enable-javascript")
url = "https://deliveroo.co.uk/restaurants/london/holborn?geohash=gcpvj6kxet58&collection=pizza"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
show_data = soup.find_all("script", id="__NEXT_DATA__")
mydata = json.loads( show_data[0].text )
I get the following error, meaning that it couldn't see this part of JSON:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Not too sure why this is working on my other machine and not on my current one.
The .text attribute doesn't work here. To get the right data, it worked for me to use encode_contents(), just changing the definition of mydata like this:
mydata = json.loads( show_data[0].encode_contents())

Input a value into an HTML webpage using Python

I am attempting to automate inputting values into a webpage. However, the major issue is that the Mechanize library does not work because my webpage has no forms that Mechanize's form.name recognizes. This is due to the input being a <input>.
I have spent the past hour researching alternatives to Mechanize that might work, but to no avail. Google is of no help as it only thinks I want to take data from a website.
My current code:
from mechanize import Browser
import csv
csv_file = 'city_names.csv' # file name
cities = [] # array to save values from csv into
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
cities.append(row.get('cities'))
br = Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.open("https://iafisher.com/projects/cities/world") # The website if you are curious
for form in br.forms():
print(form.name) # Prints nothing
for i in range(len(cities)):
br.select_form(class="city-input") # ISSUE IS THROWN HERE
control = br.form.find_control("controlname")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br[control] = [cities[i]] # (the method here is __setitem__)
response = br.submit() # submit current form
The input value as seen in developer tools:
<input data-v-018e983a="" id="city-input" type="text" placeholder="Try 'Tokyo' or 'Kingston, Jamaica'" autocomplete="off" spellcheck="false" class="city-input ">
If there is any alternative to Mechanize or a method in Mechanize that would work, it would be appreciated.
After trying multiple different search options, I finally stumbled across a tutorial that manually tells Selenium what to do.
Final code:
from selenium import webdriver
import csv
from selenium.webdriver.common.keys import Keys
csv_file = 'city_names.csv'
cities = []
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
cities.append(row.get('cities'))
chromedriver_location = "C:/Users/blais/Downloads/chromedriver"
driver = webdriver.Chrome(chromedriver_location)
driver.get('https://iafisher.com/projects/cities/world')
submit = '//*[#id="city-input"]'
for c in range(len(cities)):
driver.find_element_by_xpath(submit).send_keys(cities[c] + Keys.ENTER)

Using Selenium to scrape webpage with javascript

I want to scrape a google scholar page with 'show more' button. I understand from my previous question that it is not a html but a javascript and there are several ways to scrape such pages. I tries selenium and tried the following code.
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
papers = soup.find_all('tr',{'class':'gsc_a_tr'})
for paper in papers:
title = paper.find('a',{'class':'gsc_a_at'}).text
author = paper.find('div',{'class':'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
The browser now clicks the 'show more' button and displays the entire page. But, I am still getting the information only for the first 20 papers. I dont understand why. Please help!
Thanks!
I believe your problem is that the new elements haven't completely loaded in when your program checks the website. Try importing time and then sleeping for a few minutes. Like this (I removed the headless features so you can see the program work):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
driver = webdriver.Chrome()
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
time.sleep(3)
driver.find_element_by_id("gsc_bpf_more").click()
time.sleep(4)
soup = BeautifulSoup(driver.page_source, 'html.parser')
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
title = paper.find('a', {'class': 'gsc_a_at'}).text
author = paper.find('div', {'class': 'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
# Awkward method
# Loading all available articles and then iterating over them
for i in range(1, 3):
driver.find_element_by_css_selector('#gsc_bpf_more').click()
# waits until elements are loaded
time.sleep(3)
# Container where all data located
for result in driver.find_elements_by_css_selector('#gsc_a_b .gsc_a_t'):
title = result.find_element_by_css_selector('.gsc_a_at').text
authors = result.find_element_by_css_selector('.gsc_a_at+ .gs_gray').text
publication = result.find_element_by_css_selector('.gs_gray+ .gs_gray').text
print(title)
print(authors)
print(publication)
# just for separating purpose
print()
Part of the output:
Tax/subsidy policies in the presence of environmentally aware consumers
S Bansal, S Gangopadhyay
Journal of Environmental Economics and Management 45 (2), 333-355
Choice and design of regulatory instruments in the presence of green consumers
S Bansal
Resource and Energy economics 30 (3), 345-368

Scrape a javascript variable from a webpage

I am scraping a site with beautiful soup but all the content is hidden inside a script inside a js variable like this:
I can't seem to find any solution to this other than using selenium which in this case is not an option, I won't go into detail why but it just doesn't work. I can already scrape it by getting the insid eof the script tag and then using eval() on it but that introduces a few problems (unexpected indent, unwanted functions) I can use python, javascript and maybe C# if anything there helps.
Expected behaviour - whatever makes me get the info (the variable in the last line) into any of those 3 languages (preferably python).
The code (sorry for the formating but i cant since its so long, it isnt even the full variable, its huge):
barLoadGoogleFont('Open Sans'); barCssLoad('/global/pics/js/jquery/royalSlider/skins/universal/rs-universal.css?v=e449c4'); barCssLoad('/global/pics/css/material-icons.css?v=e6d856'); barCssLoad('/user/pics/css/user.css?v=eced9d');
barCssLoad('/user/pics/css/userIcons.css?v=6f9a03');
barCssLoad('/timeline/pics/css/timeline.css?v=8ec2ca'); barJsLibraryLoad('/global/pics/js/jquery/jquery.royalslider.min.js?v=515a43'); barJsLibraryLoad('/anketa/pics/js/utilsAnketa.js?v=9383d5'); barJsLibraryLoad('/znamky/pics/js/utilsZnamky.js?v=7afc9e'); barJsLibraryLoad('/exam/pics/js/utilsExam.js?v=033d55'); barJsLibraryLoad('/timeline/pics/js/utilsTimeline.js?v=29cf0e'); barJsLibraryLoad('/timeline/pics/js/timelineItemCreator.js?v=c37c99'); barJsLibraryLoad('/timeline/pics/js/timelineInputbox.js?v=2fde70'); barJsLibraryLoad('/timeline/pics/js/timelineViewer.js?v=f35e45');
barJsLibraryLoad('/user/pics/js/DailyPlan.js?v=e81fb9'); barJsLibraryLoad('/user/pics/js/userHomeEtest.js?v=6166f3');
$j(document).ready(function() { $j('#jwbcddd3da_md').userhome({"items":[{"timelineid":"2140963","timestamp":"2020-12-09 09:59:13","reakcia_na":"692638","typ":"h_clearplany","user":"Plan5077","target_user":null,"user_meno":"Kvarta aj2","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 09:59:13","cas_udalosti":null,"data":"null","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:13","posledna_reakcia_btc":null},{"timelineid":"2287814","timestamp":"2020-12-09 09:59:12","reakcia_na":"2290613","typ":"h_dailyplan","user":"Trieda8694210","target_user":null,"user_meno":"Kvarta A","ineid":"daily2020-12-09","text":"","cas_pridania":"2020-12-09 09:59:12","cas_udalosti":null,"data":"[]","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:12","posledna_reakcia_btc":null},{"timelineid":"1439827","timestamp":"2020-12-09 08:56:57","reakcia_na":null,"typ":"h_clearplany","user":"*","target_user":null,"user_meno":"Cel\u00e1 \u0161kola","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 08:56:57","cas_udalosti":null,"data":"null","vlastnik":"Ucitel16434","vlastnik_meno":"Ivor Dian","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":null,"removed":"0","cas_pridania_btc":"2020-12-09 08:56:57","posledna_reakcia_btc":null},{"timelineid":"2290324","timestamp":"2020-12-09 08:37:22","reakcia_na":null,"typ":"sprava","user":"CustPlan5075","target_user":null,"user_meno":"Kvarta A+Kvarta B - nj4 \u00b7 nemeck\u00fd jazyk","ineid":null,"text":"Ahojte, zajtra...
Ok, little tough to debug without actually working with it. But you'll need to pull out that json structure. You can do it with splits. So this is sort of a generic code.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'www.thesite.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '.userhome({' in script.text:
json_str = script.text
data = json_str.split('.userhome(')[-1]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
for row in jsonData['items']:
rows.append(row)
table = pd.DataFrame(rows)

Search a string in javascript using python

Following my previous question :
how to fetch javascript contents in python
I tried to make another script which fetches the data from a javascript. After getting the webpage contents of course.
But, it's just not showing up the content I want. I want to find "content_id" from the javascript of the page. This is the page :- http://www.hulu.com/watch/815743
Here's what I have right now.
import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('script',{'type':'text/javascript'})
pattern = re.compile(r'"content_id":"(.*?)"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
I get this error :
AttributeError: 'NoneType' object has no attribute 'text'
Any idea how to solve this issue..?
There are two problems in your regular expression pattern:
the quotes are escaped with backslashes in the script contents, take that into account
there is a whitespace after the colon
Here is the fixed version:
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
Works for me, getting 60585710 as a result.
FYI, here is the complete code that I'm executing:
import re
import requests
from bs4 import BeautifulSoup
Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)

Categories

Resources