I am trying to get the html-content of several pages with python 2.7.3 and urllib2.
For the most pages, it works fine, but some pages like http://www.bbc.co.uk/news/entertainment-arts-22441507#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa return me this content:
This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.
This problem also occurs with pages where javascript is required. I only get the content within the noscript-tag returned.
Here is how I get the content:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = urllib2.urlopen(url).read().decode("utf-8")
Are there additional headers needed?
Sounds like you're fetching the original HTML page, before javascript/ajax has had a go at it. Try using webkit to get the page with JavaScript applied. See here for an answer with links.
Related
Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.
I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium
I am currently using javascript and XMLHttpRequest on a static html page to create a view of a record in Zotero. This works nicely except for one thing: The page html title.
I can of course also change the <title>...</title> tag, but if someone wants to post the view to for example facebook the static title on the web page will be shown there.
I can't think of any way to fix this with just a static page with javascript. I believe I need a dynamically created page from a server that does something similar to XMLHttpRequest.
For PHP there is HTTPRequest. Now to the problem. In the javascript version I can use asynchronous calls. With PHP I think I need synchronous calls. Is that something to worry about?
Is there perhaps some other way to handle this that I am not aware of?
UPDATE: It looks like those trying to answer are not at all familiar with Zotero. I should have been more clear. Zotero is a reference db located at http://zotero.org/. It has an API that can be used through XMLHttpRequest (which is what I said above).
Now I can not use that in my scenario which I described above. So I want to call the Zotero server from my server instead. (Through PHP or something else.)
(If you are not familiar with the concepts it might be hard to understand and answer the question. Of course.)
UPDATE 2: For those interested in how Facebook scraps an URL you post there, please test here: https://developers.facebook.com/tools/debug
As you can see by testing there no javascript is run.
Sorry, im not sure if i understand what you are trying to ask, are you just wanting to change the pages title?
Why not use javascript?
document.title = newTitle
Facebook expects the title (or opengraph :title tags) to be present when it fetches the page. It won't execyte any JavaScript for you to fill in the blanks.
A cool workaround would be to detect the Facebook scraper with PHP by parsing the User Agent string, and serving a version of the page with the information already filled in by PHP instead of JavaScript.
As far as I know, the Facebook scraper uses this header for User Agent: "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
You can check to see if part of that string is present in the header and load the page accordingly.
if (strpos($_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit') !== false)
{
//synchronously load the title and opengraph tags here.
}
else
{
//load the page normally
}
I want to load a external webpage on my own server and add my own header. Also i need to use the data from the external website like url and content (i need to search and find specific data, check if i got that data in my system and show my data in the header). The external webpage needs to be working (like the buttons for opening other pages, no new windows).
I know i can play with .NET to create software but i want to create a website that will do the trick. Can this be done? Php + iframe is to simple i think, that won't give me the data from external website and my server won't see changes in the external url (what i need).
If it's supposed to be client-side, then you can acquire the data necessary by using an Ajax request, parsing it in JavaScript and then just inserting it into an element. However you have to take into account that if the host doesn't support cross-origin resource sharing, then you won't be able to do it like this.
Ajax page source request: get full html source code of page through ajax request through javascript
Parsing elements from the source: http://ajaxian.com/archives/html-parser-in-javascript (not sure if useful)
Changing the element body:
// data --> the content you want to display in your element
document.getElementById('yourElement').innerHtml = data;
Other approach (server-side though) is to "act" like a browser by faking your user-agent to some browser's and then using cUrl for example to get the source. But you don't want to fake it, because that's not nice and you would feel bad..
Hope it gets you started!
I have the following problem:
HTML blank page on server 1.
WordPress site on server 2.
What I need is to call the content from www.wordpress.site/sample-page/ to HTML page on server 1, but not the entire page, only the part that I can edit from wp-admin; so without header and footer.
Also, I don't know if there is any other method, but I need it to be done via JavaScript/jQuery or Ajax.
I've used Google, but is hard to get a tutorial for this, I've tried a lot of tutorials, but none is what I need, and I don't know that much JavaScript to make it work.
SO, can someone help me please?
BIG Thanks!
Andrei
L.E.:
I've found this working: http://jsfiddle.net/mdawaffe/hLWdH/
It is working as it is written, if I try to change the domain with mine, will not work.
What script do I have to implement on the server from which the content is called (taken)?
For more information, as you asked:
I have a HTML + CSS + JS template that I will use with phonegap (if you don't know about it, try it, it's very useful) to create a mobile app for Android, iOS, and BlackBerry.
Now, I have this site: m.trafficvoice.ro (I hope I can post links here).
In the 'live stream' page (it's called services.html), I have a HTML5 audio tag/player.
What I need, is to get from www.trafficvoice.ro/whatever-the-name-page, the content, but only the part that I can edit in WordPress (so without header and footer).
Why? Because in the future there will be more stream to add, and maybe some of them will be down due to unknown reason, so I need to update that page, without making an update for the entire app, upload it to the store, wait for approval, the client to download it, etc.
Big thanks!
Andrei
Could you just use an iframe instead? You could modify a template in your theme to not display header/footer and then use that in the iframe.