Using Selenium to scrape webpage with javascript - javascript

I want to scrape a google scholar page with 'show more' button. I understand from my previous question that it is not a html but a javascript and there are several ways to scrape such pages. I tries selenium and tried the following code.
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
papers = soup.find_all('tr',{'class':'gsc_a_tr'})
for paper in papers:
title = paper.find('a',{'class':'gsc_a_at'}).text
author = paper.find('div',{'class':'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
The browser now clicks the 'show more' button and displays the entire page. But, I am still getting the information only for the first 20 papers. I dont understand why. Please help!
Thanks!

I believe your problem is that the new elements haven't completely loaded in when your program checks the website. Try importing time and then sleeping for a few minutes. Like this (I removed the headless features so you can see the program work):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
driver = webdriver.Chrome()
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
time.sleep(3)
driver.find_element_by_id("gsc_bpf_more").click()
time.sleep(4)
soup = BeautifulSoup(driver.page_source, 'html.parser')
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
title = paper.find('a', {'class': 'gsc_a_at'}).text
author = paper.find('div', {'class': 'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
# Awkward method
# Loading all available articles and then iterating over them
for i in range(1, 3):
driver.find_element_by_css_selector('#gsc_bpf_more').click()
# waits until elements are loaded
time.sleep(3)
# Container where all data located
for result in driver.find_elements_by_css_selector('#gsc_a_b .gsc_a_t'):
title = result.find_element_by_css_selector('.gsc_a_at').text
authors = result.find_element_by_css_selector('.gsc_a_at+ .gs_gray').text
publication = result.find_element_by_css_selector('.gs_gray+ .gs_gray').text
print(title)
print(authors)
print(publication)
# just for separating purpose
print()
Part of the output:
Tax/subsidy policies in the presence of environmentally aware consumers
S Bansal, S Gangopadhyay
Journal of Environmental Economics and Management 45 (2), 333-355
Choice and design of regulatory instruments in the presence of green consumers
S Bansal
Resource and Energy economics 30 (3), 345-368

Related

How to scrape links that do not have a href and not available in page source

Im trying to use selenium web driver to extract data and i get to see that one of the links that i want to click does not have a href. The html tags i see in inspect element are also not available in the page source.I badly want the link to be clicked and proceed to the next page.
The anchor tag that i see during inspect is as below and this seems to be having a angular JS
< a id="docs" ng-click="changeFragment('deal.docs')">
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('URL here');
time.sleep(5) # Let the user actually see something!
username = driver.find_element_by_name('USERID')
username.send_keys('12345')
password = driver.find_element_by_name('PASSWORD')
password.send_keys('password')
#search_box.submit()
driver.find_element_by_id("submitInput").submit()
time.sleep(5) # Let the user actually see something!
lnum = driver.find_element_by_name('Number')
lnum.send_keys('0589403823')
checkbox = driver.find_element_by_name('includeInactiveCheckBox').click()
driver.find_element_by_id("searchButton").click()
time.sleep(5)
driver.execute_script("​changeFragment('deal.docs')").click()
driver.quit()
I tried to use find element by xpath and script but both didnt work .
The url im trying access cant be shared as it can be accessed only through a specific network

Scraping Javascript Website With BeautifulSoup 4 & Requests_HTML

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])
In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

Web scraping with BeautifulSoup won't work

Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.

How to grab data using XPath on javascript websites?

I would like to grab data on this news site. http://www.inquirer.net/
I want to grab news titles on the tiles.
Here's the screen shot of the inspected code
As you can see, one of the title of the tile that I want to grab is already there. When I copy the xpath from the browser it returns //*[#id="tgs3_info"]/h2
I tried to run my python code.
import lxml.html
import lxml.etree
import requests
link = 'http://www.inquirer.net/'
res = requests.get(link)
r = res.content
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
print(root)
but it returns an empty list.
I tried to search for an answer here on stackoverflow and in the internet. I don't really get it. When you view the page source of the site. The data that I want is not in the javascript function. It is in the div so I don't understand why I can't grab the data. I hope I could find answer here.
With inputs from Xurasky's solution to avoid a 403 error
import lxml.html
import lxml.etree
from urllib.request import Request, urlopen
req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
r = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
for a in root:
print(a.text_content())
Output
Duterte, Roque meeting set in Malacañang
2 senators welcome Ventura's revelations in Atio hazing case
Paolo Duterte vows to retire from politics in 2019
NBA: DeMarcus Cousins regrets being loyal to Sacramento Kings
PH bet Elizabeth Durado Clenci wins 2nd runner-up at Miss Grand International 2017
DOJ wants Divina, 50 others in `Atio' hazing case added on BI watchlist
Georgina Wilson Shares Messages From Fans on Baby Blues
I believe you are getting a urllib.error.HTTPError: HTTP Error 403: Forbidden Error.
You can fix this by using
import lxml.html
import lxml.etree
from urllib.request import Request, urlopen
req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
res = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
print(root)

Python, scrapy: scrape links then iterate over those links to scrape further links

I am trying to create a scrapy spider that focuses on a site called weedmaps.com. Weedmaps uses googlemaps API to generate information on dispensaries based on state then regional location information. What I ultimately want to the spider to do is start from the top layer, diver into States, scrape the regional links within those states, go into the regional links one at a time, scrape the dispensary links, and then go into the dispensary links one at a time and scrape specific information regarding those individual dispensaries. Given that the site is dynamic, I have been using selenium to account for javascript. Thanks to help from this site, I have been able to scrape regional links, and dispensary links, but separately. When I try to combine them, the spider only collects the first regional link, and then goes directly into that regional link to collect dispensary information and then it ends. I would also like to somehow create a list of regional links, and dispensary links that will be populated as the spider runs.
Any insights on how to accomplish this would be fantastic. Below is the code for what I have thus far, and I just can't seem to figure this thing out. Thank you in advance!
import scrapy
import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from scrapybot import __init__
class scrapybot_spider(scrapy.Spider):
name = "scrapybot_spider"
allowed_domains = ['https://weedmaps.com']
regionlinks = []
dispensarylinks = []
start_urls = ["https://weedmaps.com/dispensaries/in/united-states/colorado"]
#initialize the selenium webdriver via Firefox
def __init__(self):
self.browser = webdriver.Firefox()
#scraping regional links in States
def parse(self, response):
self.browser.get('https://weedmaps.com/dispensaries/in/united-states/colorado')
wait = WebDriverWait(self.browser, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.subregion a.region-ajax-link")))
for region in self.browser.find_elements_by_css_selector("div.subregion a.region-ajax-link"):
time.sleep(5)
region = region.get_attribute("href")
self.regionlinks.append(region)
print regionlinks
#scraping dispensary links within regional links from above
def dispensaryparse(self, response):
global region
self.browser.get(region)
wait = WebDriverWait(self.browser, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.dispensary div.name a")))
for dispensary in self.browser.find_elements_by_css_selector("div.dispensary div.name a"):
dispensary = dispensary.get_attribute("href")
self.dispensarylinks.append(dispensary)
print dispensary
return dispensary

Categories

Resources