Cannot get entire web page after query - javascript

I'm trying to scrape the historical NAVPS tables found on this page:
http://www.philequity.net/pefi_historicalnavps.php
All the code here are the contents of my minimal working script. So it starts with:
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
opener = urllib2.build_opener()
urllib2.install_opener(opener)
After studying the web page using Chrome's Inspect Element, I find that the Form Data sent are the following:
form_data = {}
form_data['mutualFund'] = '1'
form_data['year'] = '1995'
form_data['dmonth'] = 'Month'
form_data['dday'] = 'Day'
form_data['dyear'] = 'Year'
So I continue building up the request:
url = "http://www.philequity.net/pefi_historicalnavps.php"
params = urllib.urlencode(form_data)
request = urllib2.Request(url, params)
I expect this to be the equivalent of clicking "Get NAVPS" after filling in the form:
page = urllib2.urlopen(request)
Then I read it with BeautifulSoup:
soup = BeautifulSoup(page.read())
print soup.prettify()
But alas! I only get the web page as though I didn't click "Get NAVPS" :( Am I missing something? Is the server sending the table in a separate stream? How do I get to it?

When I look at the POST request in firebug, I see one more parameter that you aren't passing: "type" is "Year". I don't know if this will get the data for you, there's any number of other reasons it might not serve you the data.

Related

Is it possible to trigger Google Maps api's JavaScript click listener then scrape the data via using Python?

http://ihe.istanbul/satis-noktalari
I would like to scrape the points(latLng) data of the targeted company's dealerships on maps which uses Google Maps api.
I tried to scrape data by using requests_html to render JavaScript on the page of website, then I used to reach the element by using BeautifulSoup.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get("http://ihe.istanbul/satis-noktalari")
# Run JavaScript code on webpage
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
html_content = soup.contents[1]
_script = html_content.find_all("script")[23]
print(_script)
Therefore, the print lead me to a way that I can see the desired area where I can find the latLng point values if the click event is triggered.
However, the website's url cannot renew itself and put a tag for selected zone of the city.
To explain myself in a clear way, I created two pictures that show exactly what I want to do:
This output shows the result which there is no selected city:
This is the triggered click event which shows the desired result:
If url could be updated after trigger JavaScript event via Google Maps api, I could use the url.
How can I trigger it via using Python, or how can I scrape the triggered data by using Python? The Python code I provide shows non triggered event.
Given that the page dynamically issue POST XHR requests using the options value attribute value, within the select dropdown. You can extract those values, mimic the POST requests page does, then use regex to extract the lat, lon from the responses. The following shows the logic for grabbing the centre specified co-ordinates.
import requests, re
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Referer': 'http://ihe.istanbul/satis-noktalari',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
results = {}
with requests.Session() as s:
s.headers = headers
r = s.get('http://ihe.istanbul/satis-noktalari')
soup = bs(r.content, 'lxml')
options = {i.text:i['value'] for i in soup.select('[name=ilceID] option:nth-child(n+2)')}
for k, v in options.items():
data = {'ilceID': v, 'SatisBufe': '1'}
r = s.post('http://ihe.istanbul/satis-noktalari', data=data)
lat, lon = re.search(r'google.maps.LatLng\(([\d.]+),\s?([\d.]+)\)', r.text).groups()
print(k, f'lat = {lat}', f'lon = {lon}')
results[k] = [lat, lon]

Webscraping Blockchain data seemingly embedded in Javascript through Python, is this even the right approach?

I'm referencing this url: https://tracker.icon.foundation/block/29562412
If you scroll down to "Transactions", it shows 2 transactions with separate links, that's essentially what I'm trying to grab. If I try a simple pd.read_csv(url) command, it clearly omits the data I'm looking for, so I thought it might be JavaScript based and tried the following code instead:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://tracker.icon.foundation/block/29562412')
r.html.links
r.html.absolute_links
and I get the result "set()"
even though I was expecting the following:
['https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632', 'https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e']
Is JavaScript even the right approach? I tried BeautifulSoup instead and found no cigar on that end as well.
You're right. This page is populated asynchronously using JavaScript, so BeautifulSoup and similar tools won't be able to see the specific content you're trying to scrape.
However, if you log your browser's network traffic, you can see some (XHR) HTTP GET requests being made to a REST API, which serves its results in JSON. This JSON happens to contain the information you're looking for. It actually makes several such requests to various API endpoints, but the one we're interested in is called txList (short for "transaction list" I'm guessing):
def main():
import requests
url = "https://tracker.icon.foundation/v3/block/txList"
params = {
"height": "29562412",
"page": "1",
"count": "10"
}
response = requests.get(url, params=params)
response.raise_for_status()
base_url = "https://tracker.icon.foundation/transaction/"
for transaction in response.json()["data"]:
print(base_url + transaction["txHash"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632
https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e
>>>

Python: Retrieve post parameter from javascript button

I am programming in python a script to obtain statistical data of the public schools of the city in which I live. With the following code I get the source code of a page that shows, by pages, the first 100 results of a total of 247 schools:
import requests
url = "http://www.madrid.org/wpad_pub/run/j/BusquedaAvanzada.icm"
post_call = {'Public title': 'S', 'cdMuni': '079', 'cdNivelEdu': '6545'}
r = requests.post(url, data = post_call)
The page can be viewed here.
On that page there is a button that activates a javascript function to download a csv file with all 247 results. I was thinking of using Selenium to download this file, but I have seen, using Tamper Data, that when the button is pressed a POST call occurs, in which the parameter codCentrosExp is sent with the codes of the 247 colleges. The parameter looks like this:
CodCentrosExp = 28077877%3B28077865%3B28063751%3B28018392%3B28018393%...(thus up to the 247 codes)
This makes my work easier, since I do not have to download the csv file, open it, select the code column, etc. And I could do it with Tamper Data, but my question is: how can I get that parameter with my Python script, without having to use Tamper Data?
I finally found the parameter in the page's source code, and extracted them as follows:
schools = BeautifulSoup(r.content, "lxml")
school_codes = schools.findAll(Attrs = {"name": "codCentrosExp", "value": re.compile("^.+$")})[0]["value"]
school_codes = school_codes.split(";")
Anyway, if anyone knows how to respond to the original question, I would be grateful to know how it could be done.

Python: Javascript rendered webpage not parsing

I want to parse the information in the following url. I want to parse the Name of the trade, the strategy description and the transactions in the "Trading History" and "Open Positions". When I parse the page, I do not get this data.
I am new to parsing javascript rendered webpages so I would appreciate some explanation why my code below isn't working.
import bs4 as bs
import urllib
import dryscrape
import sys
import time
url = 'https://www.zulutrade.com/trader/314062/trading'
sess = dryscrape.Session()
sess.visit(url)
time.sleep(10)
sauce = sess.body()
soup = bs.BeautifulSoup(sauce, 'lxml')
Thanks!
Your link in the code doesn't allow you to get anything cause the original url you should play with is the one I'm pasting below .The one you tried to work with automatically gets redirected to the one I mentioned here.
https://www.zulutrade.com/zulutrade-client/traders/api/providers/314062/tradeHistory?
Scraping json data out of the table from that page is as follows:
import requests
r = requests.get('https://www.zulutrade.com/zulutrade-client/traders/api/providers/314062/tradeHistory?')
j = r.json()
items = j['content']
for item in items:
print(item['currency'],item['pips'],item['tradeType'],item['transactionCurrency'],item['id'])

Scrapy - making selections from dropdown(e.g.date) on webpage

I'm new to scrapy and python and i am trying to scrape data off the following start url .
After login, this is my start url--->
start_urls = ["http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
(a) from there i need to interact with the webpage to select ---by-airport---
and then make ---airport, date, time period selection---
how can i do that? i would like to loop over all time periods and past dates..
I have used firebug to see the source, I cannot show here as I do not have enough points to post images..
i read a post mentioning the use of Splinter..
(b) after the selections it will lead me to a page where there are links to the eventual page with the information i want. How do i populate the links and make scrapy look into every one to extract the information?
-using rules? where should i insert the rules/ linkextractor function?
I am willing to try myself, hope help can be given to find posts that can guide me.. I am a student and I have spent more than a week on this.. I have done the scrapy tutorial, python tutorial, read the scrapy documentation and searched for previous posts in stackoverflow but I did not manage to find posts that cover this.
a million thanks.
my code so far to log-in and the items to scrape via xpath from the eventual target site:
`import scrapy
from tutorial.items import FlightItem
from scrapy.http import FormRequest
class flightSpider(scrapy.Spider):
name = "flight"
allowed_domains = ["flightstats.com"]
login_page = 'https://www.flightstats.com/go/Login/login_input.do;jsessionid=0DD6083A334AADE3FD6923ACB8DDCAA2.web1:8009?'
start_urls = [
"http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,formdata= {'loginForm_email': 'marvxxxxxx#hotmail.com', 'password': 'xxxxxxxx'},callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
for sel in response.xpath('/html/body/div[2]/div[2]/div'):
item = flightstatsItem()
item['flight_number'] = sel.xpath('/div[1]/div[1]/h2').extract()
item['aircraft_make'] = sel.xpath('/div[4]/div[2]/div[2]/div[2]').extract()
item['dep_date'] = sel.xpath('/div[2]/div[1]/div').extract()
item['dep_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[1]').extract()
item['arr_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[2]').extract()
item['dep_gate_scheduled'] = sel.xpath('/div[2]/div[2]/div[1]/div[2]/div[2]').extract()
item['dep_gate_actual'] = sel.xpath('/div[2]/div[2]/div[1]/div[3]/div[2]').extract()
item['dep_runway_actual'] = sel.xpath('/div[2]/div[2]/div[2]/div[3]/div[2]').extract()
item['dep_terminal'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[1]').extract()
item['dep_gate'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[2]').extract()
item['arr_gate_scheduled'] = sel.xpath('/div[3]/div[2]/div[1]/div[2]/div[2]').extract()
item['arr_gate_actual'] = sel.xpath('/div[3]/div[2]/div[1]/div[3]/div[2]').extract()
item['arr_terminal'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[1]').extract()
item['arr_gate'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[2]').extract()
yield item`

Categories

Resources