Query selector all in rvest package - javascript

I try to execute this javascript command:
document.querySelectorAll('div.Dashboard-section div.pure-u-1-1 span.ng-scope')[0].innerText
in r using the rvest package using the following code:
library(rvest)
url <- read_html("")
url %>%
html_nodes("div.Dashboard-section div.pure-u-1-1 span.ng-scope") %>%
html_text()
but I take as result this:
character(0)
and I was expect this:
"Displaying results 1-25 of 10,897"
what can I do?

In a nutshell, the rvest package can fetch HTML, but it cannot execute Javascript. The page you tried to fetch loads data via AJAX, javascript.
For a workaround you could use RSelenium package, as user neoFox suggested. Selenium Webdriver would start Firefox or Chrome for you, navigate to the page, wait until it is loaded. and get the data-fragment from the HTML DOM.
Or use the much smaller phantomjs headless browser which would download the HTML page to an html file, without popping up a browser GUI. Read in and parse the downloaded HTML file with R.
Both need some serious configuration. Selenium is java based.
Phantomjs requires to read at least its documentation.
You could also inspect the page, find out the POST-request the site is making, and send this POST yourself. Then fetch the JSON it is returning and count the result items yourself.

Related

How to properly use Xpath to scrape AJAX data with scrapy?

I am scraping this website, most of the data I need is rendered with Ajax.
I have been at first trying to scrape it with Ruby (as it is the language I know the best) but it did not workout.
Then I was advised to do it with Python and Scrapy which I tried but I do not understand why can't I get the data.
import scrapy
class TaricSpider(scrapy.Spider):
name = 'taric'
allowed_domains = ['ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912']
start_urls = ['http://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912/']
def parse(self, response):
code = response.css(".td_searhed_criteria::text").extract()
tarifs = response.xpath("//div[contains(#class, 'measures_detail')]").extract_first()
print(code)
print(test)
And when I run this on my terminal, I get the attempted result for codebut for tarifs I get "None".
Do you please have any idea of what is wrong in my code ? I have tried differents way to scrape but none has worked.
Maybe the xpath is not correct ? Or maybe my Python syntax is bad, I have only be using Python since I am trying to scrape this webpage.
The reason why your XPath does not works - because of this data is adding from AJAX requests. If you open dev console in browser and move to Network->XHR - you will see AJAX request. Then there is 2 possible solution:
1. Make this request manually in your script
2. Use Js render like Splash
In this case, using the Splash will be easiest because of response from AJAX are Js files and not all data are presented there.
Also, I would recommend looking at the Aquarium, a tool that has Splash, HAProxy, and docker-compose

Can I wget the result from the javascript generated webpage

the url link : https://live.eservice-hk.net/viutv
will return the text results (one text line) and show it on the browser.
I wanna to get those results via wget but I can't.
Then I watch the website page source and discovered that the page was generated by javascript.
How do I get the results instead of the javascript?
No! You cannot wget (or even curl) the dynamically generated javascript result from the page. You need a webdriver like Selenium for that or maybe use Chrome in Headless Mode.
But for that particular page (and more specifically for that particular text result), you can use curl to get the text-link:
curl -X POST -d '{"channelno":"099","deviceId":"0000anonymous_user","format":"HLS"}' https://api.viu.now.com/p8/1/getLiveURL | jq '.asset.hls.adaptive[0]'
Note: The POST data and link is taken from the page's source. jq is a nice, little command line utility to handle JSON data on command line.

Get Html page after jQuery and Javascript execution with Python

I've imported the content of a webpage into a variable in python, but I'm not getting the final structure (the one that's modified by Ajax and jQuery in general).
How could I solve this?
I would like to get the html as the one I see if I save the page from the browser.
That's my code:
import urllib.request
urlAddress = "http:// ... /"
getPage = urllib.request.urlopen(urlAddress)
outputPage = str(getPage.read())
print(outPage)
You can't by just getting the page source from the server. You need to do one of the following:
Use the headless browser or similar solution (Selenium, Splash, PhantomJS, ...) to run the JS code in the page itself and see the results.
Figure out what the JS code actually does, and recreate the same in Python. If it's doing another call to the server, you can see that in the XHR tab in Developer Tools on Chrome.

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

Categories

Resources