my application needs to mail out reports to clients and hence i needed an effective method to convert the dynamic template into a pdf report (including images generated via chart.js). I have tried pdfkit but it needs a URL (on which it most likely performs a GET, but then the template generates a report after a few AJAX calls, so the GET is going to just return the plain vanilla page with some filters) and it DOESN'T include images (which i am guessing i can solve by converting the chart image into a png using dataToURL and saving on the server).
The only option i see here is to save all the data, generated dynamically, along with the html tags and recreate the file on the server and then convert to pdf. I am sure there's a better solution. Apologies if this appears basic, but i am not a programmer by profession.
Django has a few options for outputting PDFs, most flexible of which is ReportLab.
However, to just render a Django template to PDF while passing context data, Weasyprint/xhtml2pdf are dead simple. Below is a view an example using the earlier xhtml2pdf library. It's a standard Django view.
To be clear, all of these libraries take a Django template, render it, and return a PDF. There's limitations (Pisa, for example, has only a handful of CSS parameters it can render). Regardless, take a look at these three; At least one will do exactly what you need.
from django_xhtml2pdf.utils import generate_pdf
def myview(request):
resp = HttpResponse(content_type='application/pdf')
dynamic_variable = request.user.some_special_something
context = {'some_context_variable':dynamic_variable}
result = generate_pdf('my_template.html', file_object=resp, context=context)
return result
You can use a paid library i.e pdfcrowd, which converts a webpage into pdf. Like this..
First install-
pip install pdfcrowd
Then use the library -
import pdfcrowd
from django.http import HttpResponse
def generate_pdf_view(request):
try:
# create an API client instance
client = pdfcrowd.Client("username", "apikey")
# convert a web page and store the generated PDF to a variable
pdf = client.convertURI("http://www.yourwebpage.com")
# set HTTP response headers
response = HttpResponse(mimetype="application/pdf")
response["Cache-Control"] = "max-age=0"
response["Accept-Ranges"] = "none"
response["Content-Disposition"] = "attachment; filename=google_com.pdf"
# send the generated PDF
response.write(pdf)
except pdfcrowd.Error, why:
response = HttpResponse(mimetype="text/plain")
response.write(why)
return response
You can get the username and APIKEY by signup here-http://pdfcrowd.com/pricing/api/
Option A: Scraping
You could use something like PhantomJS or CasperJS to navigate and scrape the HTML page.
Option B: Generation
You could use something like PyPDF as suggested here.
Which option is better?
Scraping saves you from having to maintain two templates. With Generation you get more control by the fact that you're writing specifically for PDF and that you implicitly have two templates.
Related
I need to create a PDF from HTML inside a react-js app.
Many packages I have found prompt a download button in the browser ( like jsPDF ), but I actually need the PDF as a binary string. I need this string to be send to a private API that stores this PDF ( binary string ) in S3 as a PDF file. This private API call already exists, and I can not change anything from this code.
I am struggeling to understand why this is so hard. How would you go about converting HTML to PDF binary string? Thanks for any suggestions, packages, ... It can be javscript, if I can implement it inside my reactJS app.
Bonus points if the solution can accept HTML tags, since the input is done inside an WYSIWYG editor.
This server side solution works with any HTML framework.
https://github.com/PDFTron/web-to-pdf
This is from the company I work for, but is AGPL-3.0 so you should be able to use no problem.
I am scraping this website, most of the data I need is rendered with Ajax.
I have been at first trying to scrape it with Ruby (as it is the language I know the best) but it did not workout.
Then I was advised to do it with Python and Scrapy which I tried but I do not understand why can't I get the data.
import scrapy
class TaricSpider(scrapy.Spider):
name = 'taric'
allowed_domains = ['ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912']
start_urls = ['http://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Taric=01042090&SimDate=20190912/']
def parse(self, response):
code = response.css(".td_searhed_criteria::text").extract()
tarifs = response.xpath("//div[contains(#class, 'measures_detail')]").extract_first()
print(code)
print(test)
And when I run this on my terminal, I get the attempted result for codebut for tarifs I get "None".
Do you please have any idea of what is wrong in my code ? I have tried differents way to scrape but none has worked.
Maybe the xpath is not correct ? Or maybe my Python syntax is bad, I have only be using Python since I am trying to scrape this webpage.
The reason why your XPath does not works - because of this data is adding from AJAX requests. If you open dev console in browser and move to Network->XHR - you will see AJAX request. Then there is 2 possible solution:
1. Make this request manually in your script
2. Use Js render like Splash
In this case, using the Splash will be easiest because of response from AJAX are Js files and not all data are presented there.
Also, I would recommend looking at the Aquarium, a tool that has Splash, HAProxy, and docker-compose
I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.
You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})
I'm trying to scrape sections of a Javascript calendar page through python(xbmc/kodi).
So far I've been able to scrape static html variables but not the JavaScript generated sections.
The variables im trying to retrieve are <strong class="item-title">**this**</strong> , <span class="item-daterange">**this**</span> and <div class="item-location">**this**</div> , note that they are in separate sections of the html source , and rendered through JavaScript. All of them scraped variables should be appended into one String and displayed.
response = net.http_GET('my URL')
link = response.content
match=re.compile('<strong class="gcf-item-title">(.+?)</strong>').findall(link)
for name in match:
name = name
print name
From the above with regex i can scrape just one of those variables and since i need a String list to be displayed of all the variables together , How can that be done?
I get that the page has to be pre rendered for the javascript variables to be scraped But since I'm using xbmc , I am not sure on how i can import additional python libraries such as dryscrape to get this done. Downloading Dryscrape gives me a setup.py , init.py file along with some others but how can i use all of them together?
Thanks.
Is your question about the steps to scrape the JavaScript, how to use Python on XBMC/Kodi, or how to install packages that come with a setup.py file?
Just based on your RegEx above, if your entries are always like <strong class="item-title">**this**</strong> you won't get a match since your re pattern is for elements with class="gcf-item-title.
Are you using or able to use BeautifulSoup? If you're not using it, but can, you should--it's life changing in terms of scraping websites.
I'm working on automatically generating a local HTML file, and all the relevant data that I need is in a Python script. Without a web server in place, I'm not sure how to proceed, because otherwise I think an AJAX/json solution would be possible.
Basically in python I have a few list and dictionary objects that I need to use to create graphs using javascript and HTML. One solution I have (which really sucks) is to literally write HTML/JS from within Python using strings, and then save to a file.
What else could I do here? I'm pretty sure Javascript doesn't have file I/O capabilities.
Thanks.
You just need to get the data you have in your python code into a form readable by your javascript, right?
Why not just take the data structure, convert it to JSON, and then write a .js file that your .html file includes that is simply var data = { json: "object here" };
What do you thing about using some Templating system? It will fit your needs.
I know you've specifically mentioned "without a web-server", but unless you want to really go out of your way, and over-complicate this, and restrict flexibility for future use:-
Could you not use a very simple webserver such as: http://docs.python.org/library/simplehttpserver.html ? That way, should you need to expose the site, you've got the URL's already in place to set-up a proper webserver.
Maybe you could write to a cookie and then access that via JavaScript? Similar to this SO answer here?
You could use Python's JSON Encoder and Decoder library. This way you could encode your Python data into JSON format and include that in your HTML document. You would then use Javascript in the HTML file to work with the JSON encoded data.
http://docs.python.org/library/json.html
If this only needs to be for localhost, you could do something like the following.
To access, you would make a call to http://localhost:8080/foo; this can cause some issues due to Cross Site Injection Protection, however; these are readily solved by Googling around.
On the JS side, you would make an AJAX call like this (assuming jQuery)
$.ajax('http://localhost:8080/foo', function (data) {console.log(data)});
And then on the Python side you would have this file in the same directory as the html file you are seeking to use (index.html) on your computer, and execute it.
import BaseHTTPServer
import json
class WebRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
desiredDict = {'something':'sent to JS'}
if self.path == '/foo':
self.send_response(200)
self.send_header("Content-type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(desiredDict))
else:
if self.path == '/index.html' or self.path == '/':
htmlFile = open('index.html', 'rb')
self.send_response(200)
self.send_header("Content-type", "text/html")
self.send_header("Access-Control-Allow-Origin","http://localhost:8080/")
self.end_headers()
self.wfile.write(htmlFile.read())
else:
self.send_error(404)
server = BaseHTTPServer.HTTPServer(('',8080), WebRequestHandler)
server.serve_forever()