I try to handle the "infinite scrolling" in Quora website.
I use selenium lib with Python after trying to use the send_keys methods i try to run Javascript command in order to scroll down the page.
It doesn't working when i run the code, but if i try to run in the firefox console it's work.
How can i fix this problem? and it's possibile use PhantomJs?
def scrapying(self):
print platform.system()
browser = webdriver.Firefox()
#browser = webdriver.PhantomJS(executable_path='/usr/local/bin/node_modules/phantomjs/lib/phantom/bin/phantomjs')
browser.get("https://www.quora.com/C-programming-language")
#browser.get("https://answers.yahoo.com/dir/index?sid=396545660")
time.sleep(10)
#elem = browser.find_element_by_class_name("topic_page content contents main_content fixed_header ContentWrapper")
no_of_pagedowns = 500
while no_of_pagedowns:
#elem.send_keys(Keys.SPACE)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
no_of_pagedowns -= 1
browser.quit()
myClassObject = getFrom()
myClassObject.scrapying()
One of the options would be to recursively scroll into view of the last loaded post on a page:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.quora.com/C-programming-language")
NUM_POSTS = 200
posts = driver.find_elements_by_css_selector("div.pagedlist_item")
while len(posts) < NUM_POSTS:
driver.execute_script("arguments[0].scrollIntoView();", posts[-1])
posts = driver.find_elements_by_css_selector("div.pagedlist_item")
print(len(posts))
And it would scroll the page down until, at least, NUM_POSTS posts are loaded.
I'm also not able to trigger the infinite scroll to work using this while using Firefox. The gist of the code works in the console, however:
for i in range(0, 5):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
Related
I want to automate some searching stuff for myself, but I have a bit of a problem here.
On this website:
https://shop.orgatop.de/
The program can't find the search bar, and I don't really know why.
driver = webdriver.Firefox()
driver.get('https://shop.orgatop.de/')
input_search = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="solrSearchTerm"]')))
input_search.click()
input_search.send_keys('asd')
input_search.send_keys(Keys.RETURN)
The element is present inside nested iframe like innerFrame>catalog>content>input.You need to switch those frame first inorder to access the input search box.
Induce WebDriverWait() and frame_to_be_available_and_switch_to_it()
driver = webdriver.Firefox()
driver.get('https://shop.orgatop.de/')
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"innerFrame")))
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"catalog")))
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"content")))
input_search = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="solrSearchTerm"]')))
input_search.click()
input_search.send_keys('asd')
input_search.send_keys(Keys.RETURN)
Browser snashot:
I'm using Scrapy and Splash to scrape this page : https://www.athleteshop.nl/shimano-voor-as-108mm-37184
Here's the image I get in Scrapy Shell with view(response):
scrapy shell img
I need the barcode highlighted in red. But it's generated in javascript as it can be seen in the source code in Chrome with F12.
However, although displayed correctly in both Scrapy Shell and Splash localhost, although Splash localhost gives me the right html, the barcode I want to select always equals to None with response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first().
The selector isn't the problem since it works in Chrome's source code.
I've been looking for the answer on the web and SO for two days and no one seems to have the same problem. Is it just that Splash doesn't support it ?
The settings are the classic ones as follows :
SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
My code is as follows (the parse part aims at clicking on the link provided by a search engine inside the website. It works fine) :
def parse(self, response):
try :
link=response.xpath("//li[#class='item last']/a/#href").extract_first()
yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
except Exception as e:
print (str(e))
def parse_item(self, response):
product = {}
product['name']=response.xpath("//div[#class='product-name']/h1/text()").extract_first()
product['ean']=response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first()
product['price']=response.xpath("//div[#class='product-shop']//p[#class='special-price']/span[#class='price']/text()").extract_first()
product['image']=response.xpath("//div[#class='item image-photo']//img[#class='owl-product-image']/#src").extract_first()
print (product['name'])
print (product['ean'])
print (product['image'])
The print on the name and the image url work perfectly fine since they're not generated by javascript.
The code is alright, the settings are fine, the Splash localhost shows me something good, but my selectors don't work in the execution of the script (which doesn't show any errors), neither in Scrapy Shell.
The problem might be that Scrapy Splash instantly renders without caring about the wait time (20secs !) put in argument. What did I do wrong, please ?
Thanks in advance.
It doesn't seem to me, that the content of barcode field is generated dynamically, I can see it in the page source and extract from scrapy shell with response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first().
I am trying to work out a comparable command to use in jmeter webdriver sampler (JavaScript) how to do a waitForPopUp command. There must be a way. I have something that works for waiting for an element, but I can't work it out for a popup.
Update
I am using this code for waiting for an element:
var wait = new support_ui.WebDriverWait(WDS.browser, 5000)
WaitForLogo = function() {
var logo = WDS.browser.findElement(org.openqa.selenium.By.xpath("//img[#src='/images/power/ndpowered.gif']"))
}
wait.until(new com.google.common.base.Function(WaitForLogo))
And this works, but I can't work out how reuse this to wait for a popup, that has no name, in Java I have used:
selenium.waitForPopUp("_blank", "30000");
selenium.selectWindow("_blank");
And that works, but I can't work out an comparable JavaScript that will work in Jmeter for performance, as I can't get Java working in Jmeter.
I was able to get this working using:
var sui = JavaImporter(org.openqa.selenium.support.ui)
and:
wait.until(sui.ExpectedConditions.numberOfWindowsToBe(2))
In WebDriver Sampler you have the following methods:
WDS.browser.switchTo.frame('frame name or handle') - for switching to a frame
WDS.browser.switchTo.window('window name or handle') - for switching to a window
WDS.browser.switchTo.alert() - for switching to a modal dialog
WDS.browser.getWindowHandles() - for getting all open browser window handles
See JavaDoc on WebDriver.switchTo method and The WebDriver Sampler: Your Top 10 Questions Answered guide for more details.
I'm using selenium python webdriver in order to browse some pages. I want to inject a javascript code in to a pages before any other Javascript codes get loaded and executed. On the other hand, I need my JS code to be executed as the first JS code of that page. Is there a way to do that by Selenium?
I googled it for a couple of hours, but I couldn't find any proper answer!
Selenium has now supported Chrome Devtools Protocol (CDP) API, so , it is really easy to execute a script on every page load. Here is an example code for that:
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'alert("Hooray! I did it!")'})
And it will execute that script for EVERY page load. More information about this can be found at:
Selenium documentation: https://www.selenium.dev/documentation/en/support_packages/chrome_devtools/
Chrome Devtools Protocol documentation: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-addScriptToEvaluateOnNewDocument
Since version 1.0.9, selenium-wire has gained the functionality to modify responses to requests. Below is an example of this functionality to inject a script into a page before it reaches a webbrowser.
import os
from seleniumwire import webdriver
from gzip import compress, decompress
from urllib.parse import urlparse
from lxml import html
from lxml.etree import ParserError
from lxml.html import builder
script_elem_to_inject = builder.SCRIPT('alert("injected")')
def inject(req, req_body, res, res_body):
# various checks to make sure we're only injecting the script on appropriate responses
# we check that the content type is HTML, that the status code is 200, and that the encoding is gzip
if res.headers.get_content_subtype() != 'html' or res.status != 200 or res.getheader('Content-Encoding') != 'gzip':
return None
try:
parsed_html = html.fromstring(decompress(res_body))
except ParserError:
return None
try:
parsed_html.head.insert(0, script_elem_to_inject)
except IndexError: # no head element
return None
return compress(html.tostring(parsed_html))
drv = webdriver.Firefox(seleniumwire_options={'custom_response_handler': inject})
drv.header_overrides = {'Accept-Encoding': 'gzip'} # ensure we only get gzip encoded responses
Another way in general to control a browser remotely and be able to inject a script before the pages content loads would be to use a library based on a separate protocol entirely, eg: Chrome DevTools Protocol. The most fully featured I know of is playwright
If you want to inject something into the html of a page before it gets parsed and executed by the browser I would suggest that you use a proxy such as Mitmproxy.
If you cannot modify the page content, you may use a proxy, or use a content script in an extension installed in your browser. Doing it within selenium you would write some code that injects the script as one of the children of an existing element, but you won't be able to have it run before the page is loaded (when your driver's get() call returns.)
String name = (String) ((JavascriptExecutor) driver).executeScript(
"(function () { ... })();" ...
The documentation leaves unspecified the moment at which the code would start executing. You would want it to before the DOM starts loading so that guarantee might only be satisfiable with the proxy or extension content script route.
If you can instrument your page with a minimal harness, you may detect the presence of a special url query parameter and load additional content, but you need to do so using an inline script. Pseudocode:
<html>
<head>
<script type="text/javascript">
(function () {
if (location && location.href && location.href.indexOf("SELENIUM_TEST") >= 0) {
var injectScript = document.createElement("script");
injectScript.setAttribute("type", "text/javascript");
//another option is to perform a synchronous XHR and inject via innerText.
injectScript.setAttribute("src", URL_OF_EXTRA_SCRIPT);
document.documentElement.appendChild(injectScript);
//optional. cleaner to remove. it has already been loaded at this point.
document.documentElement.removeChild(injectScript);
}
})();
</script>
...
so I know it's been a few years, but I've found a way to do this without modifying the webpage's content and without using a proxy! I'm using the nodejs version, but presumably the API is consistent for other languages as well. What you want to do is as follows
const {Builder, By, Key, until, Capabilities} = require('selenium-webdriver');
const capabilities = new Capabilities();
capabilities.setPageLoadStrategy('eager'); // Options are 'eager', 'none', 'normal'
let driver = await new Builder().forBrowser('firefox').setFirefoxOptions(capabilities).build();
await driver.get('http://example.com');
driver.executeScript(\`
console.log('hello'
\`)
That 'eager' option works for me. You may need to use the 'none' option.
Documentation: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/lib/capabilities_exports_PageLoadStrategy.html
EDIT: Note that the 'eager' option has not been implemented in Chrome yet...
I'm trying to scrape a site that's jQuery based and I'm having trouble with getting the page to load completely before extracting the elements with Selenium. The page has multiple modules, each of which is a different query. I tried using the wait commands I found in the documentation, but it would usually hang the browser after one of the multiple queries load.
For reference, my OS is Windows 7, Firefox 30.0, Python 2.7 and Selenium 2.42.1
The commands and results I tried are as follows:
Explicit Wait: Browser hangs after loading the first query (Firefox Not Responding)
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, path)))
finally:
browser.quit()
Expected Conditions: Browser hangs after loading the first query (Firefox Not Responding)
wait = WebDriverWait(browser, 10)
element = wait.until(EC.element_to_be_clickable((By.XPATH,path)))
Implicit Wait: Firefox hangs after loading the first query (Firefox Not Responding)
browser.implicitly_wait(10) # seconds
myDynamicElement = browser.find_element_by_xpath(path)
Custom Function: Page loads, but selenium starts scraping before the second query is loaded resulting in an error
def wait_for_condition(browser,c):
for x in range(1,10):
print "Waiting for jquery: " + c
x = browser.execute_script("return " + c)
if(x):
return
time.sleep(1)
def main():
wait_for_condition(browser,"jQuery.active == 0")
#First element to be clicked on to scrape:
path="//a[starts-with(#class, 'export db')]"
browser.find_element_by_xpath(path).click()
The error is:
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"//a[starts-with(#class, \'export db\')]"}' ;
Catching this exception and running wait_for_condition again in the except block causes the browser to stop loading the rest of the queries and hang:
wait_for_condition(browser,"jQuery.active == 0")
try:
path="//a[starts-with(#class, 'export db')]"
browser.find_element_by_xpath(path).click()
except NoSuchElementException:
wait_for_condition(browser,"jQuery.active == 0")
path="//a[starts-with(#class, 'export db')]"
browser.find_element_by_xpath(path).click()
Please let me know if you have any suggestions to solving the problems.
Thanks in advance,
Teresa