Recursively iterate over multiple web pages and scrape using selenium

Recursively iterate over multiple web pages and scrape using selenium - javascript

This is a follow up question to the query which I had about scraping web pages.
My earlier question: Pin down exact content location in html for web scraping urllib2 Beautiful Soup
This question is regarding doing the same, but the issue is to do the same recursively over multiple page s/views.
Here is my code
from selenium.webdriver.firefox import web driver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
From the url, you'll see that no change is seen if we navigate to the second page, otherwise it wouldn't have been an issue. In this case, the next page clicker calls in a javascript from the server. Is there a way we can still scrape this using selenium in python just by some slight modification of my presented code ? Please let me know if there is.
Thanks.

Just click Next after reading each page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
while True:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
Or if you want to limit the number of pages that you are reading:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
maxNumOfPages = 10; # for example
for pageId in range(2,maxNumOfPages+2):
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text(str(pageId)).click()
except:
break
driver.quit()

I think this would work. Although the python might be a little off, this should give you a starting point:
continue = True
while continue:
try:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.find_element_by_name('BV_TrackingTag_Review_Display_NextPage').click()
except:
print "Done!"
continue = False

Related

Selenium won't work unless I actually look at the Web page (perhaps anti-crawler mechanism by JavaScript?)

The following code works fine ONLY when I look at the Web page (aka the Chrome tab being manipulated by Selenium).
Is there a way to make it work even when I'm browsing another tab/window?
(I wonder how the website knows I'm actually looking at the web page or not...)
#This is a job website in Japanese
login_url = "https://mypage.levtech.jp/"
driver = selenium.webdriver.Chrome("./chromedriver")
#Account and password are required to log in.
#I logged in and got to the following page, which displays a list of companies that I have applied for:
#https://mypage.levtech.jp/recruits/screening
#Dictionary to store company names and their job postings
jobs = {}
for i, company in enumerate(company_names):
time.sleep(1)
element = driver.find_elements_by_class_name("ScreeningRecruits_ListItem")[i]
while element.text == "":
#While loops and time.sleep() are there because the webpage seems to take a while to load
time.sleep(0.1)
element = driver.find_elements_by_class_name("ScreeningRecruits_ListItem")[i]
td = element.find_element_by_tag_name("td")
while td.text == "":
time.sleep(0.1)
td = element.find_element_by_tag_name("td")
if td.text == company:
td.click()
time.sleep(1)
jobs[company] = get_job_desc(driver) #The get_job_desc function checks HTML tags and extract info from certain elements
time.sleep(1)
driver.back()
time.sleep(1)
print(jobs)
By the way, I have tried adding a user agent and scroll down the page using the following code, in the hope that the Web page would believe that I'm "looking at it." Well, I failed :(
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

So, I think the answer to your question is due to window_handles. Whenever we open a new tab, Selenium changes the window's focus on us ( obviously ). Because the focus is on another page, we need to use the driver.switch_to.window(handle_here) method. This way, we can switch to our proper tab. In order to do this, I found a website that has a similar functionality ( also in Japanese / Kanji? ) that might help you out.
MAIN PROGRAM - For Reference
from selenium import webdriver
from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as DriverWait
from selenium.webdriver.support import expected_conditions as DriverConditions
from selenium.common.exceptions import WebDriverException
import time
def get_chrome_driver():
"""This sets up our Chrome Driver and returns it as an object"""
path_to_chrome = "F:\Selenium_Drivers\Windows_Chrome85_Driver\chromedriver.exe"
chrome_options = webdriver.ChromeOptions()
# Browser is displayed in a custom window size
chrome_options.add_argument("window-size=1500,1000")
return webdriver.Chrome(executable_path = path_to_chrome,
options = chrome_options)
def wait_displayed(driver : ChromeDriver, xpath: str, int = 5):
try:
DriverWait(driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
except:
raise WebDriverException(f'Timeout: Failed to find {xpath}')
# Gets our chrome driver and opens our site
chrome_driver = get_chrome_driver()
chrome_driver.get("https://freelance.levtech.jp/project/search/?keyword=&srchbtn=top_search")
wait_displayed(chrome_driver, "//div[#class='l-contentWrap']//ul[#class='asideCta']")
wait_displayed(chrome_driver, "//div[#class='l-main']//ul[#class='prjList']")
wait_displayed(chrome_driver, "//div[#class='l-main']//ul[#class='prjList']//li[contains(#class, 'prjList__item')][1]")
# Click on the first item title link
titleLinkXpath = "(//div[#class='l-main']//ul[#class='prjList']//li[contains(#class, 'prjList__item')][1]//a[contains(#href, '/project/detail/')])[1]"
chrome_driver.find_element(By.XPATH, titleLinkXpath).click()
time.sleep(2)
# Get the currently displayed window handles
tabs_open = chrome_driver.window_handles
if tabs_open.__len__() != 2:
raise Exception("Failed to click on our Link's Header")
else:
print(f'You have: {tabs_open.__len__()} tabs open')
# Switch to the 2nd tab and then close it
chrome_driver.switch_to.window(tabs_open[1])
chrome_driver.close()
# Check how many tabs we have open
tabs_open = chrome_driver.window_handles
if tabs_open.__len__() != 1:
raise Exception("Failed to close our 2nd tab")
else:
print(f'You have: {tabs_open.__len__()} tabs open')
# Switch back to our main tab
chrome_driver.switch_to.window(tabs_open[0])
chrome_driver.quit()
chrome_driver.service.stop()
For scrolling, you could use this method
def scroll_to_element(driver : ChromeDriver, xpath : str, int = 5):
try:
webElement = DriverWait(driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
driver.execute_script("arguments[0].scrollIntoView();", webElement)
except:
raise WebDriverException(f'Timeout: Failed to find element using xpath {xpath}\nResult: Could not scroll')

How to find elements on a JavaScript Website with Selenium?

I want to automate some searching stuff for myself, but I have a bit of a problem here.
On this website:
https://shop.orgatop.de/
The program can't find the search bar, and I don't really know why.
driver = webdriver.Firefox()
driver.get('https://shop.orgatop.de/')
input_search = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="solrSearchTerm"]')))
input_search.click()
input_search.send_keys('asd')
input_search.send_keys(Keys.RETURN)

The element is present inside nested iframe like innerFrame>catalog>content>input.You need to switch those frame first inorder to access the input search box.
Induce WebDriverWait() and frame_to_be_available_and_switch_to_it()
driver = webdriver.Firefox()
driver.get('https://shop.orgatop.de/')
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"innerFrame")))
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"catalog")))
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"content")))
input_search = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="solrSearchTerm"]')))
input_search.click()
input_search.send_keys('asd')
input_search.send_keys(Keys.RETURN)
Browser snashot:

Trick elapsed visitation time when visiting a website using Python & selenium

Ok, so I am trying to create a program that allows the background of my Gnome desktop to tap into the stream of wallpapers used by Google's chromecast devices.
Currently I use a loop function in Python that uses selenium and a Chrome webdriver to get the images that are dynamically displayed here:
https://clients3.google.com/cast/chromecast/home/
My function works exactly like I want it to. However the problem is > whenever I visit this website in a browser it starts displaying random wallpapers in a random order, there seems to be a big selection of possible wallpapers, but whenever you reload the page it always shows only about 5 (max) different wallpapers.. since my script reloads the page in each loop, I only get about 5 different wallpapers out of it, while there should be a multitude of that available via the website.
That leads me to question: Can I use selenium for Python to somehow trick the website into thinking I've been around longer then just a few seconds and thereby maybe showing me a different wallpaper?
NB: I know I could also get the wallpapers from non-dynamic websites such as this one, I already got that one to work, but the goal now is to actually tap into the live Chromecast stream. I've searched if there might be an API somewhere for it, but couldn't find one so decided to go with my current approach.
My current code:
import io
import os
from PIL import Image
from pyvirtualdisplay import Display
from random import shuffle
import requests
import sched
from selenium import webdriver
import subprocess
import time
s = sched.scheduler(time.time, time.sleep)
def change_desktop():
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Chrome()
urllist = ["https://clients3.google.com/cast/chromecast/home/v/c9541b08", "https://clients3.google.com/cast/chromecast/home"]
shuffle(urllist)
browser.get(urllist[0])
element = browser.find_element_by_id("picture-background")
image_source = element.get_attribute("src")
browser.quit()
display.stop()
request = requests.get(image_source)
image = Image.open(io.BytesIO(request.content))
image_format = image.format
current_dir = os.path.dirname(os.path.realpath(__file__))
temp_local_image_location = current_dir + "/interactive_wallpaper." + image_format
image.save(temp_local_image_location)
subprocess.Popen(["/usr/bin/gsettings", "set", "org.gnome.desktop.background", "picture-uri", "'" + temp_local_image_location + "'"], stdout=subprocess.PIPE)
s.enter(30, 1, change_desktop())
s.enter(30, 1, change_desktop())
s.run()

Scrapy Splash not respecting Rendering "wait" time

I'm using Scrapy and Splash to scrape this page : https://www.athleteshop.nl/shimano-voor-as-108mm-37184
Here's the image I get in Scrapy Shell with view(response):
scrapy shell img
I need the barcode highlighted in red. But it's generated in javascript as it can be seen in the source code in Chrome with F12.
However, although displayed correctly in both Scrapy Shell and Splash localhost, although Splash localhost gives me the right html, the barcode I want to select always equals to None with response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first().
The selector isn't the problem since it works in Chrome's source code.
I've been looking for the answer on the web and SO for two days and no one seems to have the same problem. Is it just that Splash doesn't support it ?
The settings are the classic ones as follows :
SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
My code is as follows (the parse part aims at clicking on the link provided by a search engine inside the website. It works fine) :
def parse(self, response):
try :
link=response.xpath("//li[#class='item last']/a/#href").extract_first()
yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
except Exception as e:
print (str(e))
def parse_item(self, response):
product = {}
product['name']=response.xpath("//div[#class='product-name']/h1/text()").extract_first()
product['ean']=response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first()
product['price']=response.xpath("//div[#class='product-shop']//p[#class='special-price']/span[#class='price']/text()").extract_first()
product['image']=response.xpath("//div[#class='item image-photo']//img[#class='owl-product-image']/#src").extract_first()
print (product['name'])
print (product['ean'])
print (product['image'])
The print on the name and the image url work perfectly fine since they're not generated by javascript.
The code is alright, the settings are fine, the Splash localhost shows me something good, but my selectors don't work in the execution of the script (which doesn't show any errors), neither in Scrapy Shell.
The problem might be that Scrapy Splash instantly renders without caring about the wait time (20secs !) put in argument. What did I do wrong, please ?
Thanks in advance.

It doesn't seem to me, that the content of barcode field is generated dynamically, I can see it in the page source and extract from scrapy shell with response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first().

javascript + Selenium WebDriver cannot load list of followers in instgram

I am learning JavaScript,node.js and Selenium Web Driver.
As part of my education process I am developing simple bot for Instagram.
To emulate browser I use Chrome web driver.
Faced problem when trying to get list of followers and amount of followers for the account:
This code opens instagram page, enters credentials, goes to some account and opens followers for this account.
Data like username and password I take from the settings.json.
var webdriver = require('selenium-webdriver'),
by = webdriver.By,
Promise = require('promise'),
settings = require('./settings.json');
var browser = new webdriver
.Builder()
.withCapabilities(webdriver.Capabilities.chrome())
.build();
browser.manage().window().setSize(1024, 700);
browser.get('https://www.instagram.com/accounts/login/');
browser.sleep(settings.sleep_delay);
browser.findElement(by.name('username')).sendKeys(settings.instagram_account_username);
browser.findElement(by.name('password')).sendKeys(settings.instagram_account_password);
browser.findElement(by.xpath('//button')).click();
browser.sleep(settings.sleep_delay);
browser.get('https://www.instagram.com/SomeAccountHere/');
browser.sleep(settings.sleep_delay);
browser.findElement(by.partialLinkText('followers')).click();
This part should open all followers, but not working:
var FollowersAll = browser.findElement(by.className('_4zhc5 notranslate _j7lfh'));
Tried also by xpath:
var FollowersAll = browser.findElement(by.xpath('/html/body/div[2]/div/div[2]/div/div[2]/ul/li[3]/div/div[1]/div/div[1]/a'));
When I run in the browser's console:
var i = document.getElementsByClassName('_4zhc5 notranslate _j7lfh');
it is working fine.
I run code in debug mode (use WebStorm) and it shows in each case that variable "FollowersAll" is undfined.
The same happens when I try to check amount of followers for the account.
Thanks in advance.
example of the selected element

In DOM, class names may be used multiple time. In this case, findElement by className wont work.
Xpath should be Relative and should not be Absolute.
Try Xpath with unique HTML Attribute. For example:
1. //div[#id/text()='value']
In chrome browser, open Developer Tools(press F12). If you framed an Xpath, just press Ctrl+F and paste that Xpath. If it states 1 of 1, then you can surely use that Xpath.
If it states 1 of many, then you need to dig deeper to take exact Xpath.

Develop Reference

JavaScript is the programming language of the Web.

Recursively iterate over multiple web pages and scrape using selenium - javascript

Related

Selenium won't work unless I actually look at the Web page (perhaps anti-crawler mechanism by JavaScript?)

How to find elements on a JavaScript Website with Selenium?

Trick elapsed visitation time when visiting a website using Python & selenium

Scrapy Splash not respecting Rendering "wait" time

javascript + Selenium WebDriver cannot load list of followers in instgram

Categories

Resources