Scrapy - making selections from dropdown(e.g.date) on webpage - javascript

I'm new to scrapy and python and i am trying to scrape data off the following start url .
After login, this is my start url--->
start_urls = ["http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
(a) from there i need to interact with the webpage to select ---by-airport---
and then make ---airport, date, time period selection---
how can i do that? i would like to loop over all time periods and past dates..
I have used firebug to see the source, I cannot show here as I do not have enough points to post images..
i read a post mentioning the use of Splinter..
(b) after the selections it will lead me to a page where there are links to the eventual page with the information i want. How do i populate the links and make scrapy look into every one to extract the information?
-using rules? where should i insert the rules/ linkextractor function?
I am willing to try myself, hope help can be given to find posts that can guide me.. I am a student and I have spent more than a week on this.. I have done the scrapy tutorial, python tutorial, read the scrapy documentation and searched for previous posts in stackoverflow but I did not manage to find posts that cover this.
a million thanks.
my code so far to log-in and the items to scrape via xpath from the eventual target site:
`import scrapy
from tutorial.items import FlightItem
from scrapy.http import FormRequest
class flightSpider(scrapy.Spider):
name = "flight"
allowed_domains = ["flightstats.com"]
login_page = 'https://www.flightstats.com/go/Login/login_input.do;jsessionid=0DD6083A334AADE3FD6923ACB8DDCAA2.web1:8009?'
start_urls = [
"http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,formdata= {'loginForm_email': 'marvxxxxxx#hotmail.com', 'password': 'xxxxxxxx'},callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
for sel in response.xpath('/html/body/div[2]/div[2]/div'):
item = flightstatsItem()
item['flight_number'] = sel.xpath('/div[1]/div[1]/h2').extract()
item['aircraft_make'] = sel.xpath('/div[4]/div[2]/div[2]/div[2]').extract()
item['dep_date'] = sel.xpath('/div[2]/div[1]/div').extract()
item['dep_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[1]').extract()
item['arr_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[2]').extract()
item['dep_gate_scheduled'] = sel.xpath('/div[2]/div[2]/div[1]/div[2]/div[2]').extract()
item['dep_gate_actual'] = sel.xpath('/div[2]/div[2]/div[1]/div[3]/div[2]').extract()
item['dep_runway_actual'] = sel.xpath('/div[2]/div[2]/div[2]/div[3]/div[2]').extract()
item['dep_terminal'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[1]').extract()
item['dep_gate'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[2]').extract()
item['arr_gate_scheduled'] = sel.xpath('/div[3]/div[2]/div[1]/div[2]/div[2]').extract()
item['arr_gate_actual'] = sel.xpath('/div[3]/div[2]/div[1]/div[3]/div[2]').extract()
item['arr_terminal'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[1]').extract()
item['arr_gate'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[2]').extract()
yield item`

Related

Python Web Scraping in Pagination in Single Page Application

I am currently researching on how to scrape web content using python in pagination driven by javascript in single page application (SPA).
For example,
https://angular-8-pagination-example.stackblitz.io/
I googled and found that using Scrapy is not possible to scrape javascript / SPA driven content.
It needs to use Splash. I am new to both Scrapy and Splash.
Is this correct?
Also, how do I call the javascript pagination method? I inspect the element, it's just an anchor without href and javascript event.
Please advise.
Thank you,
Hatjhie
You need to use a SpalshRequest to render the JS. You then need to get the pagination text. Generally I use re.search with the appropriate regex pattern to extract the relevent numbers. You can then assign them to current page variable and total pages variables.
Typically a website will move to the next page by incrementing ?page=x or ?p=x at the end of the url. You can then increment this value to scrape all the relevant pages.
The overall pattern looks like this:
import scrapy
from scrapy_splash import SplashRequest
import re
from ..items import Item
proxy ='http//your.proxy.com:PORT'
current_page_xpath='//div[your x path selector]/text()'
last_page_xpath='//div[your other x path selector]/text()'
class spider(scrapy.Spider):
name = 'my_spider'
allowed_domains =['domain.com']
start_urls =['https://www.domaintoscrape.com/page=1']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, meta ={'proxy':proxy})
def get_page_nbr(value):
#you may need more complex regex to get page numbers.
#most of the time they are in form "page X of Y"
#google is your friend
if re.search('\d+',value):
value = re.search('\d+',value)
value = value[0]
else:
value =None
return value
def parse(self, response):
#get last and current page from response:
last_page = page_response.xpath(last_page_xpath).get()
current_page = page_response.xpath(current_page_xpath).get()
#do something with your response
# if current page is less than last page make another request by incrmenenting the page in the URL
if current_page < last_page:
ajax_url = response.url.replace(f'page={int(current_page)}',f'page={int(current_page)+1}')
yield scrapy.Request(url=ajax_url, callback=self.parse, meta ={'proxy':proxy})
#optional
if current_page == last_page:
print(f'processed {last_page} items for {response.url}')
finally, its worth having a look on Youtube as there are a number of tutorials on scrapy_splash and pagination.

Scrapy, scraping elements only visible during runtime

I am new to Scrapy, HTML and Javascript. I am trying to source a list of all branches and agents for our agency from the website. Most of the information I need can be extracted from an AJAX result: www.tysonprop.co.za/ajax/agents/?branch_id=[id]
The challenge is two fold:
The branch names displayed on the website (https://www.tysonprop.co.za/agents/) are contained within span elements not visible when viewing the page source. This means that Scrapy cannot find the information. For example, "Tyson Properties Fourways Office" should in theory be located at: xpath(//div[#id="select2-result-label-76"]/span[#class="select2-match"]/text()) [![see inspect element][1]][1])
The AJAX call requires the branch-id. I can't figure out how the page translates the branch name selected in the drop-down to a branch id to intercept the logic. I.e. how do I extract a list of branch names with corresponding ID's?
I have done an extensive web search without much success. Any help would be appreciated.
[1]: https://i.stack.imgur.com/1kjk8.png
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
agent = Agent()
json_data = json.loads(response.text)
branch_id = json_data['branch']['id']
branch_name = json_data['branch']['branch_name']
branch_tel = json_data['branch']['get_dialable_telephone_number']
# Loop through all of th agents
agent_list = json_data['agents']
for key in range(len(agent_list)):
agent['id'] = agent_list[key]['id']
agent['branch_id'] = branch_id
agent['branch_name'] = branch_name
agent['branch_tel'] = branch_tel
agent['privy_seal_url'] = agent_list[key]['privy_seal_url']
Related question: Scrapy xpath not extracting div containing special characters <%=
If you look into the page source, you can see the branch id's and names are present in the HTML, and located under 'name="agent_search"'.
With the logic below you go through the different branches and get their id & name:
branches_xpath = response.xpath('//*[#name="agent_search"]//option')
for branch_xpath in branches_xpath[1:]: # skip first option as that one is empty
branch_id = branch_xpath.xpath('./#value').get()
branch_name = branch_xpath.xpath('./text()').get()
print(f"branch_id: {branch_id}, branch_name: {branch_name}")

Scrapy + Selenium 302 redirection handling

So I'm building a webcrawler that logs into my bank account and gathers data about my spending. I originally was going to use only Scrapy but it didn't work since the First Merit page uses Javascript to log in so I piled Selenium on top.
My code logs in (first you need to input username, and then password, not in conjunction as in most pages) through a series of yielding Requests with specific callback functions that handle the next step.
import scrapy
from scrapy import Request
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import selenium
import time
class LoginSpider(scrapy.Spider):
name = 'www.firstmerit.com'
# allowed_domains = ['https://www.firstmeritib.com']
start_urls = ['https://www.firstmeritib.com/AccountHistory.aspx?a=1']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
# Obtaining necessary components to input my own stuff
username = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[#id="txtUsername"]'))
login_button = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[#id="btnLogin"]'))
# The actual interaction
username.send_keys("username")
login_button.click()
# The process of logging in is broken up in two functions since the website requires me
# to enter my username first which redirects me to a password page where I cna finally enter my account (after inputting password)
yield Request(url = self.driver.current_url,
callback = self.password_handling,
meta = {'dont_redirect' : True,
'handle_httpstatus_list': [302],
'cookiejar' : response}
)
def password_handling(self, response):
print("^^^^^^")
print(response.url)
password = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[#id="MainContent_txtPassword"]'))
login_button2 = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[#id="MainContent_btnLogin"]'))
password.send_keys("password")
login_button2.click()
print("*****")
print(self.driver.current_url)
print("*****")
yield Request (url = self.driver.current_url,
callback = self.after_login, #, dont_filter = True,
meta = {'dont_redirect' : True,
'handle_httpstatus_list': [302],
'cookiejar' : response.meta['cookiejar'] }
)
def after_login(self, response):
print"***"
print(response.url)
print"***"
print(response.body)
if "Account Activity" in response.body:
self.logger.error("Login failed")
return
else:
print("you got through!")
print()
The issue is that once I finally get to my account page where all my spending is displayed, I can't actually access the HTML data. I've properly handles 302 redirections, but the "meta = " options seem to take me to the page through selenium, but don't let me scrape it.
Instead of getting all the data from response.body in the after_login function, I get the following:
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
How do I manage to actually be able to get that information to scrape it?
Is this redirection in place by the bank to protect account from being crawled?
Thank you!

Cannot get entire web page after query

I'm trying to scrape the historical NAVPS tables found on this page:
http://www.philequity.net/pefi_historicalnavps.php
All the code here are the contents of my minimal working script. So it starts with:
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
opener = urllib2.build_opener()
urllib2.install_opener(opener)
After studying the web page using Chrome's Inspect Element, I find that the Form Data sent are the following:
form_data = {}
form_data['mutualFund'] = '1'
form_data['year'] = '1995'
form_data['dmonth'] = 'Month'
form_data['dday'] = 'Day'
form_data['dyear'] = 'Year'
So I continue building up the request:
url = "http://www.philequity.net/pefi_historicalnavps.php"
params = urllib.urlencode(form_data)
request = urllib2.Request(url, params)
I expect this to be the equivalent of clicking "Get NAVPS" after filling in the form:
page = urllib2.urlopen(request)
Then I read it with BeautifulSoup:
soup = BeautifulSoup(page.read())
print soup.prettify()
But alas! I only get the web page as though I didn't click "Get NAVPS" :( Am I missing something? Is the server sending the table in a separate stream? How do I get to it?
When I look at the POST request in firebug, I see one more parameter that you aren't passing: "type" is "Year". I don't know if this will get the data for you, there's any number of other reasons it might not serve you the data.

Python- Get Stock Information for a company for a range of dates

Trying to get more than just the stock information at the current time period, and I can't figure out if Google Finance allows for retrieving information for more than just one date. For example, if I wanted to find out the Google Stock value over the last 30 days and return that data as a list... how would I go about doing this?
Using the code below only gets me a single value:
class GoogleFinanceAPI:
def __init__(self):
self.prefix = "http://finance.google.com/finance/info?client=ig&q="
def get(self,symbol,exchange):
url = self.prefix+"%s:%s"%(exchange,symbol)
u = urllib2.urlopen(url)
content = u.read()
obj = json.loads(content[3:])
return obj[0]
c = GoogleFinanceAPI()
quote = c.get("MSFT","NASDAQ")
print quote
Here is a recipe to get a historical values from Google Finance:
http://code.activestate.com/recipes/576495-get-a-stock-historical-value-from-google-finance/
It looks like it returns the data in .csv format.
Edit: Here is your script modified to get the .csv. It works for me.
import urllib2
import csv
class GoogleFinanceAPI:
def __init__(self):
self.url = "http://finance.google.com/finance/historical?client=ig&q={0}:{1}&output=csv"
def get(self,symbol,exchange):
page = urllib2.urlopen(self.url.format(exchange,symbol))
content = page.readlines()
page.close()
reader = csv.reader(content)
for row in reader:
print row
c = GoogleFinanceAPI()
c.get("MSFT","NASDAQ")
The best way to go forward is use the API's provided by Google. Specifically look for returns parameter where you specify how long you want.
Instead, if you want to do it via Python, find out query pattern as where the date entry goes and substitute it in the URL and do a GET, parse the result and include it in your result list.

Categories

Resources