Scraping elements rendered using React JS with BeautifulSoup

Scraping elements rendered using React JS with BeautifulSoup - javascript

I want to scrape anchor links with class="_1UoZlX" from the search results from this particular page - https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io
When I created a soup from the page I realised that the search results are being rendered using React JS and hence I can't find them in the page source (or in the soup).
Here's my code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ['https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]
for url in listUrls:
browser.get(url)
wait = WebDriverWait(browser, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"_1UoZlX"})
for result in results:
link = result["href"]
print link
urls.append(link)
print urls
This is the error I'm getting.
Traceback (most recent call last):
File "fetch_urls.py", line 19, in <module>
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Screenshot: available via screen
Someone mentioned in this answer that there is a way to use selenium to process the javascript on a page. Can someone elaborate on that? I did some googling but couldn't find an approach that works for this particular case.

There is no problem with your code but the website you are scraping - it does not stop loading for some reason that prevents the parsing of the page and subsequent code you wrote.
I tried with wikipedia to confirm the same:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ["https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"]
# browser = webdriver.PhantomJS('/usr/local/bin/phantomjs')
browser = webdriver.Chrome("./chromedriver")
urls=[]
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"mw-redirect"})
for result in results:
link = result["href"]
urls.append(link)
print urls
Outputs:
[u'/wiki/List_of_states_and_territories_of_India_by_area', u'/wiki/List_of_Indian_states_by_GDP_per_capita', u'/wiki/Constitutional_republic', u'/wiki/States_and_territories_of_India', u'/wiki/National_Capital_Territory_of_Delhi', u'/wiki/States_Reorganisation_Act', u'/wiki/High_Courts_of_India', u'/wiki/Delhi_NCT', u'/wiki/Bengaluru', u'/wiki/Madras', u'/wiki/Andhra_Pradesh_Capital_City', u'/wiki/States_and_territories_of_India', u'/wiki/Jammu_(city)']
P.S. I'm using a chrome driver in order to run the script against the real chrome browser for debugging purposes. Download the chrome driver from https://chromedriver.storage.googleapis.com/index.html?path=2.27/

Selenium will render the page including the Javascript. Your code is working properly. It is waiting for the element to be generated. In your case, Selenium didn't get that CSS element. The URL which you gave is not rendering the result page. Instead of that, It is generating the following error page.
http://imgur.com/a/YwFyE
This page is not having the CSS class. Your code is waiting for that particular CSS element. Try Firefox web driver to see what is happening.

Related

Collecting links from a JS-Based Webpage using Selenium

I need to collect all links from a webpage as seen below (25 links from each 206 pages, around 5200 total links), which also has a load more news button (as three dots). I wrote my script, but my script does not give any links that I tried to collect. I updated some of Selenium attributes. I really don't know why I could not get all the links.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
#Initialize the Chrome driver
driver = webdriver.Chrome()
driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")
page_count = driver.find_element(By.XPATH, "//span[#class='rgInfoPart']")
text = page_count.text
page_count = int(text.split()[-1])
links = []
for i in range(1, page_count + 1):
# Click on the page number
driver.find_element(By.XPATH, f"//a[text()='{i}']").click()
time.sleep(5)
# Wait for the page to load
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Extract the links from the page
page_links = soup.find_all('div', {'class': 'sub_lstitm'})
for link in page_links:
links.append("https://www.mfa.gov.tr"+link.find('a')['href'])
time.sleep(5)
driver.quit()
print(links)
I tried to run my code but actually I couldn't. I need to have some solution for this.

You can easily do everything in Selenium using the following method:
Wait for the links to be visible on the page
Get titles and urls
Get the current page number
If there is the button for the next page, then click it and repeat from step 1., otherwise it means we are in the last page hence the execution ends
What follows is the complete code to scrape all 206 pages
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")
titles, urls = [], []
while 1:
print('current page:', driver.find_element(By.CSS_SELECTOR, 'td span').text, end='\r')
links = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm a")))
for link in links:
titles.append( link.text )
urls.append( link.get_attribute('href') )
try:
driver.find_element(By.XPATH, '//td//span/parent::td/following-sibling::td[1]').click()
except:
print('next page button not found')
break
for i in range(len(titles)):
print(titles[i],'\n',urls[i],'\n')

Using only Selenium you can easily collect all links from the webpage inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm > a")))])
Using XPATH:
driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='sub_lstitm']/a")))])
Console Output:
['https://www.mfa.gov.tr/no_-17_-turkiye-ve-yemen-arasinda-gerceklestirilecek-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-16_-sayin-bakanimizin-abd-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-15_-iran-islam-cumhuriyeti-disisleri-bakani-huseyin-emir-abdullahiyan-in-ulkemize-yapacagi-ziyaret-hk.en.mfa', 'https://www.mfa.gov.tr/no_-14_-nepal-de-meydana-gelen-ucak-kazasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-13_-bosna-hersek-bakanlar-konseyi-baskan-yrd-ve-disisleri-bakani-bisera-turkovic-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-12_-turkiye-iran-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-11_-kuzey-kibris-turk-cumhuriyeti-kurucu-cumhurbaskani-sayin-rauf-raif-denktas-in-vefatinin-on-birinci-yildonumu-hk.en.mfa', 'https://www.mfa.gov.tr/no_-10_-italya-basbakan-yardimcisi-ve-disisleri-ve-uluslararasi-isbirligi-bakani-antonio-tajani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-9_-kirim-tatar-soydaslarimiz-hakkinda-mahkumiyet-karari-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-8_-kuzeybati-suriye-ye-yonelik-bm-sinir-otesi-insani-yardim-mekanizmasinin-uzatilmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-7_-brezilya-da-devlet-baskani-lula-da-silva-hukumeti-ni-ve-demokratik-kurumlari-hedef-alan-siddet-olaylari-hk.en.mfa', 'https://www.mfa.gov.tr/no_-6_-sudan-daki-gelismeler-hk.en.mfa', 'https://www.mfa.gov.tr/no_-5_-senegal-in-gniby-kentinde-meydana-gelen-kaza-hk.en.mfa', 'https://www.mfa.gov.tr/no_-4_-sayin-bakanimizin-afrika-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-3_-deas-teror-orgutu-ile-iltisakli-bir-sebekenin-malvarliklarinin-abd-makamlari-ile-eszamanli-olarak-dondurulmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-2_-somali-de-meydana-gelen-teror-saldirisi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-1_-israil-ulusal-guvenlik-bakani-itamar-ben-gvir-in-mescid-i-aksa-ya-baskini--hk.en.mfa', 'https://www.mfa.gov.tr/no_-386_-sayin-bakanimizin-brezilya-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/sc_-32_-gkry-nin-dogu-akdeniz-de-devam-eden-hidrokarbon-faaliyetleri-hk-sc.en.mfa', 'https://www.mfa.gov.tr/no_-385_-afganistan-da-yuksekogretimde-kiz-ogrencilere-getirilen-egitim-yasagi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-384_-isvec-disisleri-bakani-tobias-billstrom-un-turkiye-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-383_-yemen-cumhuriyeti-disisleri-ve-yurtdisindaki-yemenliler-bakani-dr-ahmed-awad-binmubarak-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-382_-gambiya-disisleri-uluslararasi-isbirligi-ve-yurtdisindaki-gambiyalilar-bakani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-381_-bosna-hersek-e-ab-adaylik-statusu-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-380_-turkiye-meksika-ust-duzey-iki-uluslu-komisyonu-siyasi-komitesinin-ikinci-toplantisinin-duzenlenmesi-hk.en.mfa']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

JS/AJAX Content Not loading when URL is accessed using Selenium(Python)

I'm trying to scrape this URL: https://www.wheel-size.com/size/acura/mdx/2001/
The values that I want to scrape are loaded dynamically e.g Center Bore
If you open the link in normal browser the content is loaded just fine but if I use Selenium(chromedriver) it just keeps loading and the values are never displayed.
Any idea how can I scrape it?
Below is the picture of how it looks like. You can also see the loading for 1-2 seconds when you open the link in normal browser.

To extract the desired texts e.g. 64.1 mm, 5x114.3 etc as the elements are Google Tag Manager enabled elements you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following locator strategies:
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get('https://www.wheel-size.com/size/acura/mdx/2001/')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[contains(., 'Center Bore')]//following::span[1]"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[contains(., 'PCD')]//following::span[1]"))).text)
Console Output:
64.1 mm
5x114.3
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

How to scrape links that do not have a href and not available in page source

Im trying to use selenium web driver to extract data and i get to see that one of the links that i want to click does not have a href. The html tags i see in inspect element are also not available in the page source.I badly want the link to be clicked and proceed to the next page.
The anchor tag that i see during inspect is as below and this seems to be having a angular JS
< a id="docs" ng-click="changeFragment('deal.docs')">
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('URL here');
time.sleep(5) # Let the user actually see something!
username = driver.find_element_by_name('USERID')
username.send_keys('12345')
password = driver.find_element_by_name('PASSWORD')
password.send_keys('password')
#search_box.submit()
driver.find_element_by_id("submitInput").submit()
time.sleep(5) # Let the user actually see something!
lnum = driver.find_element_by_name('Number')
lnum.send_keys('0589403823')
checkbox = driver.find_element_by_name('includeInactiveCheckBox').click()
driver.find_element_by_id("searchButton").click()
time.sleep(5)
driver.execute_script("changeFragment('deal.docs')").click()
driver.quit()
I tried to use find element by xpath and script but both didnt work .
The url im trying access cant be shared as it can be accessed only through a specific network

Scraping a dynamic website with Selenium/BeautifulSoup

I'm trying to scrape comments from a website using Selinium and Beutifulsoup. The site im trying to scrape from is genereted dynamicly by Javascript and that is little beyond what i've learned in the tutorials i've seen(im very little familiar with javascript). My best working solution so far is:
browser = webdriver.Chrome(executable_path=chromedriver_path)
browser.get('https://nationen.ebcomments.dk/embed/stream?asset_id=7627366')
def load_data():
time.sleep(1) # The site needs to load
browser.execute_script("document.querySelector('#stream > div.talk-stream-tab-container.Stream__tabContainer___2trkn > div:nth-child(2) > div > div > div > div > div:nth-child(3) > button').click()") # Click on load more comments button
htmlSource = browser.page_source
soup = BeautifulSoup(browser.page_source, 'html.parser')
load_data() # i should call this few times to load all comments, but in this example i only do it once.
for text in soup.findAll(class_="talk-plugin-rich-text-text"):
print(text.get_text(), "\n") # Print the comments
It works - but it's very slow, and I'm sure that there is a better solution, especially if I want to scrape several hundreds of articles with comments.
I think all the comments comes in JSON format(i have looked into Chromes dev tab under network, and I can see there is a response containing the JSON with the comment - see the pic). Then I tried to use SeliniumRequest to get the data, but not sure at all what I'm doing, and it's not working. It says "b'POST body missing. Did you forget to use body-parser middleware?'". Maybe I could get the JSON from the comments API, but I'm not sure if it's possible?
from seleniumrequests import Chrome
chromedriver_path = 'C:/chromedriver.exe'
webdriver = Chrome(executable_path=chromedriver_path)
response = webdriver.request('POST', 'https://nationen.ebcomments.dk/api/v1/graph/ql/', data={"assetId": "7627366", "assetUrl": "", "commentId": "","excludeIgnored": "false","hasComment": "false", "sortBy": "CREATED_AT", "sortOrder": "DESC"})

If only the comments you are after then the following implementation should get you there:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://nationen.ebcomments.dk/embed/stream?asset_id=7627366"
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get(link)
while True:
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".talk-load-more > button"))).click()
except Exception: break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"[data-slot-name='commentContent'] > .CommentContent__content___ZGv1q"))):
print(item.text)

Python Selenium: Iteration Error

I'm trying to download all xml files from a webpage. The process requires locating xml file download link one after the other, and once such a download link is clicked it leads to a form which needs to be submitted for the download to begin. The issue I'm facing lies in the iteration of these loops, once the first file is downloaded from the webpage I receive an error:
"selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of stale: either the element is no longer attached to the DOM or the page has been refreshed"
The "97081 data-extension xml" is the 2nd downloadable file in the iteration. I've hereby attached the code, any suggestions to rectify this will be much appreciated.
import os
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "F:\Projects\Poli_Map\DatG_Py_Dat")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/xml")
driver = webdriver.Firefox(firefox_profile=fp)
driver.get('https://data.gov.in/catalog/variety-wise-daily-market-prices-data-cauliflower')
wait = WebDriverWait(driver, 10)
allelements = driver.find_elements_by_xpath("//a[text()='xml']")
for element in allelements:
element.click()
class FormPage(object):
def fill_form(self, data):
driver.execute_script("document.getElementById('edit-download-reasons-non-commercial').click()")
driver.execute_script("document.getElementById('edit-reasons-d-rd').click()")
driver.find_element_by_xpath('//input[#name = "name_d"]').send_keys(data['name_d'])
driver.find_element_by_xpath('//input[#name = "mail_d"]').send_keys(data['mail_d'])
return self
def submit(self):
driver.execute_script("document.getElementById('edit-submit').click()")
data = {
'name_d': 'xyz',
'mail_d': 'xyz#outlook.com',
}
time.sleep(5)
FormPage().fill_form(data).submit()
time.sleep(5)
window_before = driver.window_handles[0]
driver.switch_to_window(window_before)
driver.back()

I found a workaround for you, no need to submit any fields.
You need to get the ID in the class field in the bottom of this picture (here for instance its 962721)
Then, use this URL like so :
https://data.gov.in/node/962721/download
This was found just doing a bit of "reverse-engineering". When you do web scraping, always have a look at the .js files and at your networking tab to see all the requests made.

Develop Reference

JavaScript is the programming language of the Web.

Scraping elements rendered using React JS with BeautifulSoup - javascript

Related

Collecting links from a JS-Based Webpage using Selenium

JS/AJAX Content Not loading when URL is accessed using Selenium(Python)

How to scrape links that do not have a href and not available in page source

Scraping a dynamic website with Selenium/BeautifulSoup

Python Selenium: Iteration Error

Categories

Resources