Scrapy Splash not respecting Rendering "wait" time - javascript

I'm using Scrapy and Splash to scrape this page : https://www.athleteshop.nl/shimano-voor-as-108mm-37184
Here's the image I get in Scrapy Shell with view(response):
scrapy shell img
I need the barcode highlighted in red. But it's generated in javascript as it can be seen in the source code in Chrome with F12.
However, although displayed correctly in both Scrapy Shell and Splash localhost, although Splash localhost gives me the right html, the barcode I want to select always equals to None with response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first().
The selector isn't the problem since it works in Chrome's source code.
I've been looking for the answer on the web and SO for two days and no one seems to have the same problem. Is it just that Splash doesn't support it ?
The settings are the classic ones as follows :
SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
My code is as follows (the parse part aims at clicking on the link provided by a search engine inside the website. It works fine) :
def parse(self, response):
try :
link=response.xpath("//li[#class='item last']/a/#href").extract_first()
yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
except Exception as e:
print (str(e))
def parse_item(self, response):
product = {}
product['name']=response.xpath("//div[#class='product-name']/h1/text()").extract_first()
product['ean']=response.xpath("//table[#class='data-table']//tr[#class='even']/td[#class='data last']/text()").extract_first()
product['price']=response.xpath("//div[#class='product-shop']//p[#class='special-price']/span[#class='price']/text()").extract_first()
product['image']=response.xpath("//div[#class='item image-photo']//img[#class='owl-product-image']/#src").extract_first()
print (product['name'])
print (product['ean'])
print (product['image'])
The print on the name and the image url work perfectly fine since they're not generated by javascript.
The code is alright, the settings are fine, the Splash localhost shows me something good, but my selectors don't work in the execution of the script (which doesn't show any errors), neither in Scrapy Shell.
The problem might be that Scrapy Splash instantly renders without caring about the wait time (20secs !) put in argument. What did I do wrong, please ?
Thanks in advance.

It doesn't seem to me, that the content of barcode field is generated dynamically, I can see it in the page source and extract from scrapy shell with response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first().

Related

Cucumber+Ruby+Capybara+Selenium: How to make the 'visit' method wait for dynamic content

Here is the issue that has been nagging for weeks and all solutions found online do not seem to work... ie. wait for ajax, etc...
here is versions of gems:
capybara (2.10.1, 2.7.1)
selenium-webdriver (3.0.1, 3.0.0)
rspec (3.5.0)
running ruby 2.2.5
ruby 2.2.5p319 (2016-04-26 revision 54774) [x64-mingw32]
in the env.rb
Capybara.register_driver :selenium do | app |
browser = (ENV['browser'] || 'firefox').to_sym
Capybara::Driver::Selenium.new(app, :browser => browser.to_sym, :resynchronize => true)
Capybara.default_max_wait_time = 5
end
Here is my dynamicpage.feature
Given I visit page X
Then placeholder text appears
And the placeholder text is replaced by the content provided by the json service
and the step.rb
When(/^I visit page X$/) do
visit('mysite.com/productx/')
end
When(/^placeholder text appears$/) do
expect(page).to have_css(".text-replacer-pending")
end
Then(/^the placeholder text is replaced by the content provided by the json service$/) do
expect(page).to have_css(".text-replacer-done")
end
the webpage in question, which I cannot add it here as it is not publicly accessible, contains the following on page load:
1- <span class="text-replacer-pending">Placeholder Text</span>
after a call to an external service (which provides the Json data), the same span class gets refreshed/updated to the following;
2- <span class="text-replacer-done">Correct Data</span>
The problem I have with the "visit" method in capybara + selenium is that as soon as it visits the page, it thinks everything loaded and freezes the browser, and it never lets the service be called to dynamically update the content.
I tried the following solutions but without success:
Capybara.default_max_wait_time = 5
Capybara::Driver::Selenium.new(app, :browser => browser.to_sym, :resynchronize => true)
add sleep 5 after the visit method
wait for ajax solution from several websites, etc...
adding after hooks
etc...
I am at a complete loss why "visit" can't wait or at least provide a simple solution to an issue i am sure is very common.
I am aware of the capybara methods that wait and those that don't wait such as 'visit' but the issue is;
there is no content that goes from hidden to displayed
there is there is no user interaction either, just the content is getting updated.
also unsure if this is a capybara issue or a selenium or both.
Anyhow have insight on any solutions? i am fairly new to ruby and cucumber so specifically what code goes in what file/folder would be much appreciated.
Mel
Restore wait_until method (add it to your spec_helpers.rb)
def wait_until(timeout = DEFAULT_WAIT_TIME)
Timeout.timeout(timeout) do
sleep(0.1) until value = yield
value
end
end
And then:
# time in seconds
wait_until(20) { has_no_css?('.text-replacer-pending') }
expect(page).to have_css(".text-replacer-done")
#maxple and #nattf0dd
Just to close the loop on our issue here...
After looking at this problem from a different angle,
we finally found out Cucumber/Capybara/ is not a problem at all :-)
The issue we are having lies with the browser Firefox driver (SSL related), since we have no issues when running the same test with the Chrome driver.
I do appreciate the replies and suggestions and will keep those in mind for future.
thanks again!

Phantomjs: take screenshot of the current page?

I'm trying to use Phantomjs to capture a screenshot from the same page that the user is on.
For example, A user is on my-page.html and has made some changes to the elements of this page, now I need to take a screenshot of an element (DIV) inside this page (my-page.html) and save it.
I found a few examples of Phantomjs and php which I tested and worked on my server and it stores the image on my server too BUT all of the examples I found are for taking screenshots of external pages/URLs and not the 'current page'.
This is fairly a straight forward process in Html2canvas but the quality of the produced image is not good at all so I decided to use Phantomjs to produce higher quality screenshots AND also it allows me to zoom in on the page.
Here is a simple example of using Phantomjs for taking screenshot of External URL's:
var system = require("system");
if (system.args.length > 0) {
var page = require('webpage').create();
page.open(system.args[1], function() {
//viewportSize being the actual size of the headless browser
page.viewportSize = { width: 3000, height: 3000 };
//the clipRect is the portion of the page you are taking a screenshot of
page.clipRect = { top: 0, left: 0, width: 3000, height: 3000 };
page.zoomFactor = 300.0/72.0;
var pageTitle = system.args[1].replace(/http.*\/\//g, "").replace("www.", "").split("/")[0]
var filePath = "img/" + pageTitle + '.png';
page.render(filePath);
console.log(filePath);
phantom.exit();
});
}
Could someone please let me know if this is possible at all?
EDIT (Answer to my own question),
it turns out that you cannot take a screenshot of the current page if the page's elements have been edited by the user on the live basis. the only screenshots you can take with phantomjs is a bare bone of the page.
Reason: phantomjs is a headless browser and uses QtWebKit which runs on the server and it is not a javascript library same as html2canvas.
Explained and experienced by others HERE:
Another use case that is an issue for a project I’m working on is that you need drag and drop. Headless drivers have some basic functionality, but if you need to be able to set precise coordinates you’re stuck with Selenium.
For taking screenshot of current page you must pass the correct URL to Phantom script
Syntax :
phantomjs <"Phantom code url(as in documentation report.js)"> <"page url of which you want to take scrrenshot"> <"result saving url">
Now assuming you are passing correct URL :
In my case I was unable to take screenshot of my page as there was a spring security annotation to it, so it was not letting me to proceed so Please check for any security you added to your page if yes then remove it and then try again.
If case 1 does not apply to you surely there is an problem with URL you are passing please double check it.
Please let me know if problem still persist please post any errors(if occurring) you are getting.

Selenium: How to Inject/execute a Javascript in to a Page before loading/executing any other scripts of the page?

I'm using selenium python webdriver in order to browse some pages. I want to inject a javascript code in to a pages before any other Javascript codes get loaded and executed. On the other hand, I need my JS code to be executed as the first JS code of that page. Is there a way to do that by Selenium?
I googled it for a couple of hours, but I couldn't find any proper answer!
Selenium has now supported Chrome Devtools Protocol (CDP) API, so , it is really easy to execute a script on every page load. Here is an example code for that:
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'alert("Hooray! I did it!")'})
And it will execute that script for EVERY page load. More information about this can be found at:
Selenium documentation: https://www.selenium.dev/documentation/en/support_packages/chrome_devtools/
Chrome Devtools Protocol documentation: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-addScriptToEvaluateOnNewDocument
Since version 1.0.9, selenium-wire has gained the functionality to modify responses to requests. Below is an example of this functionality to inject a script into a page before it reaches a webbrowser.
import os
from seleniumwire import webdriver
from gzip import compress, decompress
from urllib.parse import urlparse
from lxml import html
from lxml.etree import ParserError
from lxml.html import builder
script_elem_to_inject = builder.SCRIPT('alert("injected")')
def inject(req, req_body, res, res_body):
# various checks to make sure we're only injecting the script on appropriate responses
# we check that the content type is HTML, that the status code is 200, and that the encoding is gzip
if res.headers.get_content_subtype() != 'html' or res.status != 200 or res.getheader('Content-Encoding') != 'gzip':
return None
try:
parsed_html = html.fromstring(decompress(res_body))
except ParserError:
return None
try:
parsed_html.head.insert(0, script_elem_to_inject)
except IndexError: # no head element
return None
return compress(html.tostring(parsed_html))
drv = webdriver.Firefox(seleniumwire_options={'custom_response_handler': inject})
drv.header_overrides = {'Accept-Encoding': 'gzip'} # ensure we only get gzip encoded responses
Another way in general to control a browser remotely and be able to inject a script before the pages content loads would be to use a library based on a separate protocol entirely, eg: Chrome DevTools Protocol. The most fully featured I know of is playwright
If you want to inject something into the html of a page before it gets parsed and executed by the browser I would suggest that you use a proxy such as Mitmproxy.
If you cannot modify the page content, you may use a proxy, or use a content script in an extension installed in your browser. Doing it within selenium you would write some code that injects the script as one of the children of an existing element, but you won't be able to have it run before the page is loaded (when your driver's get() call returns.)
String name = (String) ((JavascriptExecutor) driver).executeScript(
"(function () { ... })();" ...
The documentation leaves unspecified the moment at which the code would start executing. You would want it to before the DOM starts loading so that guarantee might only be satisfiable with the proxy or extension content script route.
If you can instrument your page with a minimal harness, you may detect the presence of a special url query parameter and load additional content, but you need to do so using an inline script. Pseudocode:
<html>
<head>
<script type="text/javascript">
(function () {
if (location && location.href && location.href.indexOf("SELENIUM_TEST") >= 0) {
var injectScript = document.createElement("script");
injectScript.setAttribute("type", "text/javascript");
//another option is to perform a synchronous XHR and inject via innerText.
injectScript.setAttribute("src", URL_OF_EXTRA_SCRIPT);
document.documentElement.appendChild(injectScript);
//optional. cleaner to remove. it has already been loaded at this point.
document.documentElement.removeChild(injectScript);
}
})();
</script>
...
so I know it's been a few years, but I've found a way to do this without modifying the webpage's content and without using a proxy! I'm using the nodejs version, but presumably the API is consistent for other languages as well. What you want to do is as follows
const {Builder, By, Key, until, Capabilities} = require('selenium-webdriver');
const capabilities = new Capabilities();
capabilities.setPageLoadStrategy('eager'); // Options are 'eager', 'none', 'normal'
let driver = await new Builder().forBrowser('firefox').setFirefoxOptions(capabilities).build();
await driver.get('http://example.com');
driver.executeScript(\`
console.log('hello'
\`)
That 'eager' option works for me. You may need to use the 'none' option.
Documentation: https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/lib/capabilities_exports_PageLoadStrategy.html
EDIT: Note that the 'eager' option has not been implemented in Chrome yet...

Recursively iterate over multiple web pages and scrape using selenium

This is a follow up question to the query which I had about scraping web pages.
My earlier question: Pin down exact content location in html for web scraping urllib2 Beautiful Soup
This question is regarding doing the same, but the issue is to do the same recursively over multiple page s/views.
Here is my code
from selenium.webdriver.firefox import web driver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
From the url, you'll see that no change is seen if we navigate to the second page, otherwise it wouldn't have been an issue. In this case, the next page clicker calls in a javascript from the server. Is there a way we can still scrape this using selenium in python just by some slight modification of my presented code ? Please let me know if there is.
Thanks.
Just click Next after reading each page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
while True:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
Or if you want to limit the number of pages that you are reading:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
maxNumOfPages = 10; # for example
for pageId in range(2,maxNumOfPages+2):
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text(str(pageId)).click()
except:
break
driver.quit()
I think this would work. Although the python might be a little off, this should give you a starting point:
continue = True
while continue:
try:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[#class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.find_element_by_name('BV_TrackingTag_Review_Display_NextPage').click()
except:
print "Done!"
continue = False

How to completely read a site that contains javascript from an android service?

I'm trying to read a node from a website that contains java script.
In VB .NET I just use the following code:
Dim listSpan As IHTMLElementCollection = bodyel.getElementsByTagName("span")
For Each spanItem As IHTMLElement In listSpan
If spanItem.className & "" = "span_name" Then
If Not spanItem.innerText Is Nothing Then
str_result = spanItem.innerText.ToString
Console.WriteLine("Found it: " & str_result)
Else
str_result = "NO"
Console.WriteLine("Not Found")
Console.Beep(500, 500)
End If
End If
Next
But I just can't find a way to convert this code to work in Android service. (Java).
I tried Jsoup but Jsoup is only reading the "view source code" elements and not the javascript results as html.
try {
Document doc = Jsoup.connect(str_link).get();
Elements links = doc.select("span_name");
for(Element link : links) {
String result = link.text();
Log.d("TMA Service","result: " + result);
list.add(title);
}
I mean. This code in VB can find everything. (just like if I right click in an element using google chrome and select "Inspect Element". This shows everything and I'd like to know how to get this data with Android.
CAN SOME ONE GIVE ME AN EXAMPLE?
Thanks.
Unfortunately you can't handle Javascript and dynamic content with Jsoup. Please see my answer here for more information and some examples of Java libraries, that may help you here.
Edit:
HtmlUnit - Getting started (section Getting started)
HtmlUnit: A Simple Example: Check Yahoo Email
How to use HtmlUnit in Java?
HtmlUnit: A Quick Introduction
HtmlUnit – A quick introduction
Getting started with HtmlUnit

Categories

Resources