Scraping elements generated by javascript queries using python - javascript

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead

Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Related

How to properly use Xpath to scrape AJAX data with scrapy?

I am scraping this website, most of the data I need is rendered with Ajax.
I have been at first trying to scrape it with Ruby (as it is the language I know the best) but it did not workout.
Then I was advised to do it with Python and Scrapy which I tried but I do not understand why can't I get the data.
import scrapy
class TaricSpider(scrapy.Spider):
name = 'taric'
allowed_domains = ['ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912']
start_urls = ['http://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912/']
def parse(self, response):
code = response.css(".td_searhed_criteria::text").extract()
tarifs = response.xpath("//div[contains(#class, 'measures_detail')]").extract_first()
print(code)
print(test)
And when I run this on my terminal, I get the attempted result for codebut for tarifs I get "None".
Do you please have any idea of what is wrong in my code ? I have tried differents way to scrape but none has worked.
Maybe the xpath is not correct ? Or maybe my Python syntax is bad, I have only be using Python since I am trying to scrape this webpage.
The reason why your XPath does not works - because of this data is adding from AJAX requests. If you open dev console in browser and move to Network->XHR - you will see AJAX request. Then there is 2 possible solution:
1. Make this request manually in your script
2. Use Js render like Splash
In this case, using the Splash will be easiest because of response from AJAX are Js files and not all data are presented there.
Also, I would recommend looking at the Aquarium, a tool that has Splash, HAProxy, and docker-compose

how to scrape websites that using django

I wanted to create a robot to scrape a website with this address :
https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website
I realized that this website is using django because they are using
phrases like {{if group_name}} and others
there is a loop created with this kind of method and it creates table rows and
the information that I want is there.
when I am working with python and I download the html code I can't find
any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want
How can I get html codes that holds the content of that table like chrome tools
to get the information that I want from this website?
My way to get the codes is using python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
This should work for what you want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl' is what I use because it worked for the site I tested it on. If you have trouble with that just try another parser.
another problem is that there is a character that isn't recognized by default. so read the contents of soup using utf-8
Extra Info
This has nothing to do with django. HTML has what is described as a "tree" like structure. Where each set of tags is the parent of all children tags immediately inside it. You just weren't reading deep enough into the tree.

Simple login function for XBMC (Python) issue

I'm trying to scrape sections of a Javascript calendar page through python(xbmc/kodi).
So far I've been able to scrape static html variables but not the JavaScript generated sections.
The variables im trying to retrieve are <strong class="item-title">**this**</strong> , <span class="item-daterange">**this**</span> and <div class="item-location">**this**</div> , note that they are in separate sections of the html source , and rendered through JavaScript. All of them scraped variables should be appended into one String and displayed.
response = net.http_GET('my URL')
link = response.content
match=re.compile('<strong class="gcf-item-title">(.+?)</strong>').findall(link)
for name in match:
name = name
print name
From the above with regex i can scrape just one of those variables and since i need a String list to be displayed of all the variables together , How can that be done?
I get that the page has to be pre rendered for the javascript variables to be scraped But since I'm using xbmc , I am not sure on how i can import additional python libraries such as dryscrape to get this done. Downloading Dryscrape gives me a setup.py , init.py file along with some others but how can i use all of them together?
Thanks.
Is your question about the steps to scrape the JavaScript, how to use Python on XBMC/Kodi, or how to install packages that come with a setup.py file?
Just based on your RegEx above, if your entries are always like <strong class="item-title">**this**</strong> you won't get a match since your re pattern is for elements with class="gcf-item-title.
Are you using or able to use BeautifulSoup? If you're not using it, but can, you should--it's life changing in terms of scraping websites.

Convert Django template to pdf

my application needs to mail out reports to clients and hence i needed an effective method to convert the dynamic template into a pdf report (including images generated via chart.js). I have tried pdfkit but it needs a URL (on which it most likely performs a GET, but then the template generates a report after a few AJAX calls, so the GET is going to just return the plain vanilla page with some filters) and it DOESN'T include images (which i am guessing i can solve by converting the chart image into a png using dataToURL and saving on the server).
The only option i see here is to save all the data, generated dynamically, along with the html tags and recreate the file on the server and then convert to pdf. I am sure there's a better solution. Apologies if this appears basic, but i am not a programmer by profession.
Django has a few options for outputting PDFs, most flexible of which is ReportLab.
However, to just render a Django template to PDF while passing context data, Weasyprint/xhtml2pdf are dead simple. Below is a view an example using the earlier xhtml2pdf library. It's a standard Django view.
To be clear, all of these libraries take a Django template, render it, and return a PDF. There's limitations (Pisa, for example, has only a handful of CSS parameters it can render). Regardless, take a look at these three; At least one will do exactly what you need.
from django_xhtml2pdf.utils import generate_pdf
def myview(request):
resp = HttpResponse(content_type='application/pdf')
dynamic_variable = request.user.some_special_something
context = {'some_context_variable':dynamic_variable}
result = generate_pdf('my_template.html', file_object=resp, context=context)
return result
You can use a paid library i.e pdfcrowd, which converts a webpage into pdf. Like this..
First install-
pip install pdfcrowd
Then use the library -
import pdfcrowd
from django.http import HttpResponse
def generate_pdf_view(request):
try:
# create an API client instance
client = pdfcrowd.Client("username", "apikey")
# convert a web page and store the generated PDF to a variable
pdf = client.convertURI("http://www.yourwebpage.com")
# set HTTP response headers
response = HttpResponse(mimetype="application/pdf")
response["Cache-Control"] = "max-age=0"
response["Accept-Ranges"] = "none"
response["Content-Disposition"] = "attachment; filename=google_com.pdf"
# send the generated PDF
response.write(pdf)
except pdfcrowd.Error, why:
response = HttpResponse(mimetype="text/plain")
response.write(why)
return response
You can get the username and APIKEY by signup here-http://pdfcrowd.com/pricing/api/
Option A: Scraping
You could use something like PhantomJS or CasperJS to navigate and scrape the HTML page.
Option B: Generation
You could use something like PyPDF as suggested here.
Which option is better?
Scraping saves you from having to maintain two templates. With Generation you get more control by the fact that you're writing specifically for PDF and that you implicitly have two templates.

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Categories

Resources