How to properly use Xpath to scrape AJAX data with scrapy? - javascript

I am scraping this website, most of the data I need is rendered with Ajax.
I have been at first trying to scrape it with Ruby (as it is the language I know the best) but it did not workout.
Then I was advised to do it with Python and Scrapy which I tried but I do not understand why can't I get the data.
import scrapy
class TaricSpider(scrapy.Spider):
name = 'taric'
allowed_domains = ['ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912']
start_urls = ['http://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912/']
def parse(self, response):
code = response.css(".td_searhed_criteria::text").extract()
tarifs = response.xpath("//div[contains(#class, 'measures_detail')]").extract_first()
print(code)
print(test)
And when I run this on my terminal, I get the attempted result for codebut for tarifs I get "None".
Do you please have any idea of what is wrong in my code ? I have tried differents way to scrape but none has worked.
Maybe the xpath is not correct ? Or maybe my Python syntax is bad, I have only be using Python since I am trying to scrape this webpage.

The reason why your XPath does not works - because of this data is adding from AJAX requests. If you open dev console in browser and move to Network->XHR - you will see AJAX request. Then there is 2 possible solution:
1. Make this request manually in your script
2. Use Js render like Splash
In this case, using the Splash will be easiest because of response from AJAX are Js files and not all data are presented there.
Also, I would recommend looking at the Aquarium, a tool that has Splash, HAProxy, and docker-compose

Related

python javascript scrape automatically

Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.

Scraping elements generated by javascript queries using python

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.
You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Query selector all in rvest package

I try to execute this javascript command:
document.querySelectorAll('div.Dashboard-section div.pure-u-1-1 span.ng-scope')[0].innerText
in r using the rvest package using the following code:
library(rvest)
url <- read_html("")
url %>%
html_nodes("div.Dashboard-section div.pure-u-1-1 span.ng-scope") %>%
html_text()
but I take as result this:
character(0)
and I was expect this:
"Displaying results 1-25 of 10,897"
what can I do?
In a nutshell, the rvest package can fetch HTML, but it cannot execute Javascript. The page you tried to fetch loads data via AJAX, javascript.
For a workaround you could use RSelenium package, as user neoFox suggested. Selenium Webdriver would start Firefox or Chrome for you, navigate to the page, wait until it is loaded. and get the data-fragment from the HTML DOM.
Or use the much smaller phantomjs headless browser which would download the HTML page to an html file, without popping up a browser GUI. Read in and parse the downloaded HTML file with R.
Both need some serious configuration. Selenium is java based.
Phantomjs requires to read at least its documentation.
You could also inspect the page, find out the POST-request the site is making, and send this POST yourself. Then fetch the JSON it is returning and count the result items yourself.

Convert Django template to pdf

my application needs to mail out reports to clients and hence i needed an effective method to convert the dynamic template into a pdf report (including images generated via chart.js). I have tried pdfkit but it needs a URL (on which it most likely performs a GET, but then the template generates a report after a few AJAX calls, so the GET is going to just return the plain vanilla page with some filters) and it DOESN'T include images (which i am guessing i can solve by converting the chart image into a png using dataToURL and saving on the server).
The only option i see here is to save all the data, generated dynamically, along with the html tags and recreate the file on the server and then convert to pdf. I am sure there's a better solution. Apologies if this appears basic, but i am not a programmer by profession.
Django has a few options for outputting PDFs, most flexible of which is ReportLab.
However, to just render a Django template to PDF while passing context data, Weasyprint/xhtml2pdf are dead simple. Below is a view an example using the earlier xhtml2pdf library. It's a standard Django view.
To be clear, all of these libraries take a Django template, render it, and return a PDF. There's limitations (Pisa, for example, has only a handful of CSS parameters it can render). Regardless, take a look at these three; At least one will do exactly what you need.
from django_xhtml2pdf.utils import generate_pdf
def myview(request):
resp = HttpResponse(content_type='application/pdf')
dynamic_variable = request.user.some_special_something
context = {'some_context_variable':dynamic_variable}
result = generate_pdf('my_template.html', file_object=resp, context=context)
return result
You can use a paid library i.e pdfcrowd, which converts a webpage into pdf. Like this..
First install-
pip install pdfcrowd
Then use the library -
import pdfcrowd
from django.http import HttpResponse
def generate_pdf_view(request):
try:
# create an API client instance
client = pdfcrowd.Client("username", "apikey")
# convert a web page and store the generated PDF to a variable
pdf = client.convertURI("http://www.yourwebpage.com")
# set HTTP response headers
response = HttpResponse(mimetype="application/pdf")
response["Cache-Control"] = "max-age=0"
response["Accept-Ranges"] = "none"
response["Content-Disposition"] = "attachment; filename=google_com.pdf"
# send the generated PDF
response.write(pdf)
except pdfcrowd.Error, why:
response = HttpResponse(mimetype="text/plain")
response.write(why)
return response
You can get the username and APIKEY by signup here-http://pdfcrowd.com/pricing/api/
Option A: Scraping
You could use something like PhantomJS or CasperJS to navigate and scrape the HTML page.
Option B: Generation
You could use something like PyPDF as suggested here.
Which option is better?
Scraping saves you from having to maintain two templates. With Generation you get more control by the fact that you're writing specifically for PDF and that you implicitly have two templates.

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Categories

Resources