I wish to download data for thousands of records from a government site using Python 2.7. One example of a record is http://camara.cl/pley/pley_detalle.aspx?prmID=1252&prmBL=1-07. Two related problems:
(1) the site relies on mouse clicks (in the source:
Urgencias to access another part of the data of interest to me; and
(2) I am illiterate in web scraping in general and Python in particular.
Learning-by-doing has so far taken me about half-way. Internet resources here, here, and here pushed me in the right direction. But I hit a wall.
I can get source code for the information that fills the screen when the url is invoked.
import requests
id = '1252'
bl = '1-07'
url = 'http://camara.cl/pley/pley_detalle.aspx'
parametros = {'prmID': id, 'prmBL': bl}
r = requests.get(url, params = parametros)
hitos = r.text
print hitos
But I've had no success in getting info from the 'Urgencias' tab. One attempt looks thus
import json
parametros = {'prmID': id, 'prmBL': bl, '__EVENTTARGET': 'ctl00$mainPlaceHolder$btnUrgencias'}
headers = {'content-type': 'application/x-www-form-urlencoded; charset=utf-8'}
p = requests.post(url, data = json.dumps(parametros), headers = headers)
urgencias = p.text
print urgencias
I am obviously not building/sending the request properly. (I am also missing cookies, I believe.)
Any help will be greatly appreciated. (Am open to use any method that will work from a Ubuntu machine!)
Related
I have a link here: https://fantasy.espn.com/football/players/add?leagueId=1589782588 and I've been wanted to pull data from it. In the developer console I typed out
let players = document.getElementsByClassName("AnchorLink link clr-link pointer")
players[0].text
and it works perfectly. How can I get this to work in my ide?
Disclaimer: the following is for teaching purpose only and should not be abused.
Use a public API if provided by the website owner.
Investigate what happens to the request by using the Network tab.
You'll notice that a request is made to an URI https://site.api.espn.com/apis/..... which ends in something like: ....ffl/news/players?days=30&playerId=2576623.
If you click that link you'll go directly to a page that serves an API response as JSON.
Inspect again the entire website Ctrl + Shift + F and look for that player ID 2576623 - and you'll notice that is stored inside each player image URI. So let's collect all those IDs.
Open Dev Tools Console and run:
var _i = document.querySelectorAll("tbody .player__column img[src*='full/']");
console.log(_i);
Now that you have your image elements it's time to collect all the IDs:
var _ids = [..._i].map(el => el.src.match(/(?<=full\/)[^\.]+(?=\.)/)[0]);
console.log(_ids)
From this point on you - can use any server-side script (or even JS if there's no CrossOrigin limitation), and fetch that JSON data.
from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !
Diving into my first chrome extension, and trying to figure out how to modify some data in http requests.
I'm using the documentation here: https://developer.chrome.com/extensions/webRequest
I was able to setup the extension to listen for requests, but am not able to access the data I want.
When I'm in the chrome dev tools, on the Network tab, I right click the particular request I'm trying to modify and copy as cURL. The data I want to modify shows up after --data. I want to access this and change an integer value one of the parameters is set to.
I'm not sure what the equivalent is with these http requests, but I've tried the following:
chrome.webRequest.onBeforeRequest.addListener(
function(details) {
var bkg = chrome.extension.getBackgroundPage();
bkg.console.log("onBeforeRequest");
bkg.console.log(JSON.stringify(details));
blockingResponse = {};
return blockingResponse;
},
{urls: [/*URL*/]},
['requestBody','blocking']
);
I can find the request with the url that I am looking at in the Network tab of the dev tools, so I'll be able to parse that and make sure I'm only modifying the requests that I want to, but printing the details doesn't show the data that I actually want to modify. Any idea how to obtain the HTTP request equivalent of the --data argument of a cURL request? And, well, modify it.
Edit: Here's the progress I've made.
When I log those details, I get ..."requestBody":{"raw":[{"bytes":{}}]},...
However, if I change onBeforeRequest to:
chrome.webRequest.onBeforeRequest.addListener(
function(details) {
var bkg = chrome.extension.getBackgroundPage();
bkg.console.log("onBeforeRequest");
bkg.console.log(JSON.stringify(details));
var encodedData = getBase64FromArrayBuffer(details.requestBody.raw[0].bytes);
bkg.console.log("unencoded data: " + details.requestBody.raw[0].bytes);
bkg.console.log("encodedData: " + encodedData);
blockingResponse = {};
return blockingResponse;
},
{urls: ["*://*.facebook.com/*"], types: ["xmlhttprequest"]},
['requestBody','blocking']
);
function getBase64FromArrayBuffer(responseData) {
var uInt8Array = new Uint8Array(responseData);
var i = uInt8Array.length;
var binaryString = new Array(i);
while (i--)
{
binaryString[i] = String.fromCharCode(uInt8Array[i]);
}
var data = binaryString.join('');
var base64 = window.btoa(data);
return base64;
}
The encoded data exists, showing a long string of chars, though it's gibberish. Does this mean that I won't be able to access this data and modify it? Or is there a way to decode this data?
The chrome.webRequest API does allow you to access POST data. It does not, however, allow you to modify the POST data.
You are able to modify some of the header info, but not the POST data.
It appears the ability to modify POST data was intended, but a dev at Google who was working on it got moved to something else, and sat on the bug/feature request for two years afterwards, and just released it so someone else could pick it up a few months ago. If this is a feature that interests you, head to https://bugs.chromium.org/p/chromium/issues/detail?id=91191# and star this bug (requires gmail account), and perhaps some renewed interest will lead to someone completing the functionality.
im very new to JavaScript so be patient.
I've been trying to scrape a site and get all the product URLs in a list that i will use later in other function like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio");
function getURLS(url) {
request(url, function(err, resp, body){
var linklist = [];
$ = cheerio.load(body);
var links = $('#productResults a');
for(valor in links) {
if(links[valor].attribs && links[valor].attribs.href && linklist.indexOf(links[valor].attribs.href) == -1){
linklist.push(links[valor].attribs.href);
}
}
var extended_links = [];
linklist.forEach(function(link){
extended_link = 'https://www.fromuthtennis.com/frm/' + link;
extended_links.push(extended_link);
})
console.log(extended_links);
})
};
This does work unless you go to the second page of items like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio"); //etc...
As far as i know this happens because the content on the page is loaded dynamically.
To get the contents of the page i believe i need to use PhantomJS because that would allow me to get the html code after the page has been fully loaded, so i installed the phantomjs-node module. I want to use NodeJS to get the URL list because the rest of my code is written on it.
I've been reading a lot about PhantomJS but using the phantomjs-node is tricky and i still don't understand how could i get the URL list using it because i'm very new to JavaScript or coding in general.
If someone could guide me a little bit i'd appreciate it a lot.
Yes, you can. That page looks like it implements Google's Ajax Crawling URL.
Basically it allows websites to generate crawler friendly content for Google. Whenever you see a URL like this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]
You need to convert it to this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx?_escaped_fragment_=Filter%3D%5Bpagenum%3D2*ava%3D1%5D
The conversion is simply take the base path: https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx, add a query param _escaped_fragment_ who's value is URL fragment Filter=[pagenum=2*ava=1] encoded into Filter%3D%5Bpagenum%3D2*ava%3D1%5D using standard URI encoding.
You can read the full specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
Note: This does not apply to all websites, only websites that implement Google's Ajax Crawling URL. But you're in luck in this case
You can see any product you want without using dynmic content using this url:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID={product_id}
For example to see product 37023:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID=37023
All you have to do is for(var productid=0;prodcutid<40000;productid++) {request...}.
Another approach is to use phantom module. (https://www.npmjs.com/package/phantom). It will let you run phantom command directly from your NodeJS app
I need to develop an in-house real-time analytics solution (similar to GA or mixpanel for example) that collects:
Information from the website itself (URL)
Information from the user’s browser (lang, device, OS etc..)
Information from the referring source etc..
.. and sends this data to the server with a single-pixel image request. Similar to how GA and other solutions work:
Google Analytics works by the inclusion of a block of JavaScript code
on pages in your website. When users to your website view a page, this
JavaScript code references a JavaScript file which then executes the
tracking operation for Analytics. The tracking operation retrieves
data about the page request through various means and sends this
information to the Analytics server via a list of parameters attached
to a single-pixel image request.
I wonder if there's any open source project available that does this part which I could use as base to build further. There's Piwik but its too feature-packed and too heavy for my requirement.
Edited to add: I'm doing something specific with the data, otherwise I'd just use the existing solutions.
Try
var img = new Image;
img.width = img.height = "1px";
var res = window.navigator;
var data = {};
var _plugins = {};
Array.prototype.slice.call(navigator.plugins).forEach(function(v, k) {
_plugins[v.name.toLowerCase().replace(/\s/, "-")] = {
"name": v.name,
"description": v.description,
"filename": v.filename
}
});
delete res.plugins && delete res.mimeTypes;
data.url = window.location.href;
data.ref = document.referrer;
data.nav = res;
data._plugins = _plugins;
// set `img` `dataset` with `data` ,
// send `img` to server , decode `img` `dataset` at server
img.dataset.stats = JSON.stringify(data);
var img = new Image;
img.width = img.height = "1px";
var res = window.navigator;
var data = {};
var _plugins = {};
Array.prototype.slice.call(navigator.plugins).forEach(function(v, k) {
_plugins[v.name.toLowerCase().replace(/\s/, "-")] = {
"name": v.name,
"description": v.description,
"filename": v.filename
}
});
delete res.plugins && delete res.mimeTypes;
data.url = window.location.href;
data.ref = document.referrer;
data.nav = res;
data._plugins = _plugins;
img.dataset.stats = JSON.stringify(data);
document.write(
img.dataset.stats
);
There are 2 big solutions for open source analytics.
Piwik as you mentioned is a well documented and pretty mature solution. Drilling down the code, how Piwik makes things come around will give you some insights.
Open Web Analytics is the other big player on the game. A more simplified tool which will help you understand how basic tracking is made.
Depending on the data you want to track I would also suggest taking a look on this tutorial which uses sockets in order to track real time data.
Least but not last you can also check what Crazy Egg does if you want to track down user's interactivity.