Scrapy, scraping elements only visible during runtime

Scrapy, scraping elements only visible during runtime - javascript

I am new to Scrapy, HTML and Javascript. I am trying to source a list of all branches and agents for our agency from the website. Most of the information I need can be extracted from an AJAX result: www.tysonprop.co.za/ajax/agents/?branch_id=[id]
The challenge is two fold:
The branch names displayed on the website (https://www.tysonprop.co.za/agents/) are contained within span elements not visible when viewing the page source. This means that Scrapy cannot find the information. For example, "Tyson Properties Fourways Office" should in theory be located at: xpath(//div[#id="select2-result-label-76"]/span[#class="select2-match"]/text()) [![see inspect element][1]][1])
The AJAX call requires the branch-id. I can't figure out how the page translates the branch name selected in the drop-down to a branch id to intercept the logic. I.e. how do I extract a list of branch names with corresponding ID's?
I have done an extensive web search without much success. Any help would be appreciated.
[1]: https://i.stack.imgur.com/1kjk8.png
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
agent = Agent()
json_data = json.loads(response.text)
branch_id = json_data['branch']['id']
branch_name = json_data['branch']['branch_name']
branch_tel = json_data['branch']['get_dialable_telephone_number']
# Loop through all of th agents
agent_list = json_data['agents']
for key in range(len(agent_list)):
agent['id'] = agent_list[key]['id']
agent['branch_id'] = branch_id
agent['branch_name'] = branch_name
agent['branch_tel'] = branch_tel
agent['privy_seal_url'] = agent_list[key]['privy_seal_url']
Related question: Scrapy xpath not extracting div containing special characters <%=

If you look into the page source, you can see the branch id's and names are present in the HTML, and located under 'name="agent_search"'.
With the logic below you go through the different branches and get their id & name:
branches_xpath = response.xpath('//*[#name="agent_search"]//option')
for branch_xpath in branches_xpath[1:]: # skip first option as that one is empty
branch_id = branch_xpath.xpath('./#value').get()
branch_name = branch_xpath.xpath('./text()').get()
print(f"branch_id: {branch_id}, branch_name: {branch_name}")

Related

Python Web Scraping in Pagination in Single Page Application

I am currently researching on how to scrape web content using python in pagination driven by javascript in single page application (SPA).
For example,
https://angular-8-pagination-example.stackblitz.io/
I googled and found that using Scrapy is not possible to scrape javascript / SPA driven content.
It needs to use Splash. I am new to both Scrapy and Splash.
Is this correct?
Also, how do I call the javascript pagination method? I inspect the element, it's just an anchor without href and javascript event.
Please advise.
Thank you,
Hatjhie

You need to use a SpalshRequest to render the JS. You then need to get the pagination text. Generally I use re.search with the appropriate regex pattern to extract the relevent numbers. You can then assign them to current page variable and total pages variables.
Typically a website will move to the next page by incrementing ?page=x or ?p=x at the end of the url. You can then increment this value to scrape all the relevant pages.
The overall pattern looks like this:
import scrapy
from scrapy_splash import SplashRequest
import re
from ..items import Item
proxy ='http//your.proxy.com:PORT'
current_page_xpath='//div[your x path selector]/text()'
last_page_xpath='//div[your other x path selector]/text()'
class spider(scrapy.Spider):
name = 'my_spider'
allowed_domains =['domain.com']
start_urls =['https://www.domaintoscrape.com/page=1']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, meta ={'proxy':proxy})
def get_page_nbr(value):
#you may need more complex regex to get page numbers.
#most of the time they are in form "page X of Y"
#google is your friend
if re.search('\d+',value):
value = re.search('\d+',value)
value = value[0]
else:
value =None
return value
def parse(self, response):
#get last and current page from response:
last_page = page_response.xpath(last_page_xpath).get()
current_page = page_response.xpath(current_page_xpath).get()
#do something with your response
# if current page is less than last page make another request by incrmenenting the page in the URL
if current_page < last_page:
ajax_url = response.url.replace(f'page={int(current_page)}',f'page={int(current_page)+1}')
yield scrapy.Request(url=ajax_url, callback=self.parse, meta ={'proxy':proxy})
#optional
if current_page == last_page:
print(f'processed {last_page} items for {response.url}')
finally, its worth having a look on Youtube as there are a number of tutorials on scrapy_splash and pagination.

Scrapy - making selections from dropdown(e.g.date) on webpage

I'm new to scrapy and python and i am trying to scrape data off the following start url .
After login, this is my start url--->
start_urls = ["http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
(a) from there i need to interact with the webpage to select ---by-airport---
and then make ---airport, date, time period selection---
how can i do that? i would like to loop over all time periods and past dates..
I have used firebug to see the source, I cannot show here as I do not have enough points to post images..
i read a post mentioning the use of Splinter..
(b) after the selections it will lead me to a page where there are links to the eventual page with the information i want. How do i populate the links and make scrapy look into every one to extract the information?
-using rules? where should i insert the rules/ linkextractor function?
I am willing to try myself, hope help can be given to find posts that can guide me.. I am a student and I have spent more than a week on this.. I have done the scrapy tutorial, python tutorial, read the scrapy documentation and searched for previous posts in stackoverflow but I did not manage to find posts that cover this.
a million thanks.
my code so far to log-in and the items to scrape via xpath from the eventual target site:
`import scrapy
from tutorial.items import FlightItem
from scrapy.http import FormRequest
class flightSpider(scrapy.Spider):
name = "flight"
allowed_domains = ["flightstats.com"]
login_page = 'https://www.flightstats.com/go/Login/login_input.do;jsessionid=0DD6083A334AADE3FD6923ACB8DDCAA2.web1:8009?'
start_urls = [
"http://www.flightstats.com/go/HistoricalFlightStatus/flightStatusByFlight.do?"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,formdata= {'loginForm_email': 'marvxxxxxx#hotmail.com', 'password': 'xxxxxxxx'},callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
for sel in response.xpath('/html/body/div[2]/div[2]/div'):
item = flightstatsItem()
item['flight_number'] = sel.xpath('/div[1]/div[1]/h2').extract()
item['aircraft_make'] = sel.xpath('/div[4]/div[2]/div[2]/div[2]').extract()
item['dep_date'] = sel.xpath('/div[2]/div[1]/div').extract()
item['dep_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[1]').extract()
item['arr_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[2]').extract()
item['dep_gate_scheduled'] = sel.xpath('/div[2]/div[2]/div[1]/div[2]/div[2]').extract()
item['dep_gate_actual'] = sel.xpath('/div[2]/div[2]/div[1]/div[3]/div[2]').extract()
item['dep_runway_actual'] = sel.xpath('/div[2]/div[2]/div[2]/div[3]/div[2]').extract()
item['dep_terminal'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[1]').extract()
item['dep_gate'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[2]').extract()
item['arr_gate_scheduled'] = sel.xpath('/div[3]/div[2]/div[1]/div[2]/div[2]').extract()
item['arr_gate_actual'] = sel.xpath('/div[3]/div[2]/div[1]/div[3]/div[2]').extract()
item['arr_terminal'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[1]').extract()
item['arr_gate'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[2]').extract()
yield item`

Grails controller - link to select element from .gsp file

So, I'm working on this grails application to create a web dashboard. So far, I have created a controller that queries metrics from my database and renders it as a JSON file, which I feed to d3 and other javascript libraries on the front end gsp file.
My question is this: I have a certain drop down menu on my front end as follows:
<select onchange="loadData()" id="metric" class="dropdown">
<option value ="sales">Sales</option>
<option value ="visits">Visits</option>
</select>
And my corresponding controller, in its simplest form, has the following actions:
(Importing grails.converters and groovy.sql.sql)
def dataSource
def listJson = {
def sql = new Sql(dataSource)
def rows = sql.rows("select date_hour, total_revenue as sales, visits from table")
sql.close()
render rows as JSON
}
The problem now is, I have a bunch of drop down menus, and quite a lot of options in each, for each of which, if I did as above, I would have to create a new json file for d3 to use. Instead, can't I somehow insert the value of the option from the select element above, into the sql statement in the controller?
Something like the one below, but I don't know if it's possible, and if it is, the right syntax. I'm using grails 2.3.4 right now.
def listJson = {
def sql = new Sql(dataSource)
def rows = sql.rows("select date_hour, total_revenue as sales, ${index #metric} from table")
sql.close()
render rows as JSON
}
where index is my index.gsp file (where the select option is), and #metric, the id of the element.
Thanks in advance!

You can get the value of the select from params in your controller. For example:
def listJson = {
def metric = params.metric
// build query...
def query = "... ${metric} ..."
}
However, I'd advise against building a SQL query like this. Any time you accept user input as part of a SQL query it provides a huge opportunity for SQL injection attacks. Why not use a higher level database abstraction like GORM? Also note that groovy uses different parameter expansion in SQL queries than regular strings to generate for PreparedStatements. You'd need to write your example like this: sql.rows("select date_hour, total_revenue as sales, " + metric + " from table")
Finally, while it depends on how you're submitting the request in loadData(), the usual convention for HTML input elements is to submit the value with the element name attribute as the key, not the id.

Scrapy: Modify rules for scraping web page

I've started to use scrapy for a project of mine to scrape data off a tennis website. Here is an example page that I want to scrape data off. As you can see, I want to scrape data for a tennis player. I need to recursively go through the entire page and gather 'Match Stats' (Theres a link titled 'Match Stats' next to every match) for a player's matches. I've already written code to parse data from the opened match stats popup. All I need to do now is open these match stats pages through the initial spider.
In all the examples I've read up on, we can write rules to navigate scrapy to the different urls that need scraping. In my case, I just want to write a rule to the different match stats links. However, if you saw the page I want to scrape, 'Match Stats' links are in the following format: javascript:makePopup('match_stats_popup.php?matchID=183704502'). As I've read online (I might be wrong!), scrapy can't deal with javascript and hence cant 'click' on that link. However, since the links are javascript popups, its possible to add the match_stats_popup.php?matchID=183704502 part of the link to the main url to get a standard html page:
http://www.tennisinsight.com/match_stats_popup.php?matchID=183704502
I am hoping I could modify the rules before scraping. In summary, I just want to find the links that are of the type: javascript:makePopup('match_stats_popup.php?matchID=183704502, and modify them so that they are now of the type http://www.tennisinsight.com/match_stats_popup.php?matchID=183704502
This is what I've written in the rules so far, which doesnt open any pages:
rules = (
Rule(SgmlLinkExtractor(allow='/match_stats_popup.php?matchID=\d+'),
'parse_match', follow=True,
),
)
parse_match is the method which parses data from the opened match stats popup.
Hope my problem is clear enough!

Using BaseSgmlLinkExtractor or SgmlLinkExtractor you can specify both the tag(s) from which to extract and process_value function used for extracting the link. There is nice example in the official documentation. Here is the code for your example:
class GetStatsSpider(CrawlSpider):
name = 'GetStats'
allowed_domains = ['tennisinsight.com']
start_urls = ['http://www.tennisinsight.com/player_activity.php?player_id=1']
def getPopLink(value):
m = re.search("javascript:makePopup\('(.+?)'\)", value)
if m:
return m.group(1)
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopLink), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = TennisItem()
i['url_stats'] = response.url
return i

Python- Get Stock Information for a company for a range of dates

Trying to get more than just the stock information at the current time period, and I can't figure out if Google Finance allows for retrieving information for more than just one date. For example, if I wanted to find out the Google Stock value over the last 30 days and return that data as a list... how would I go about doing this?
Using the code below only gets me a single value:
class GoogleFinanceAPI:
def __init__(self):
self.prefix = "http://finance.google.com/finance/info?client=ig&q="
def get(self,symbol,exchange):
url = self.prefix+"%s:%s"%(exchange,symbol)
u = urllib2.urlopen(url)
content = u.read()
obj = json.loads(content[3:])
return obj[0]
c = GoogleFinanceAPI()
quote = c.get("MSFT","NASDAQ")
print quote

Here is a recipe to get a historical values from Google Finance:
http://code.activestate.com/recipes/576495-get-a-stock-historical-value-from-google-finance/
It looks like it returns the data in .csv format.
Edit: Here is your script modified to get the .csv. It works for me.
import urllib2
import csv
class GoogleFinanceAPI:
def __init__(self):
self.url = "http://finance.google.com/finance/historical?client=ig&q={0}:{1}&output=csv"
def get(self,symbol,exchange):
page = urllib2.urlopen(self.url.format(exchange,symbol))
content = page.readlines()
page.close()
reader = csv.reader(content)
for row in reader:
print row
c = GoogleFinanceAPI()
c.get("MSFT","NASDAQ")

The best way to go forward is use the API's provided by Google. Specifically look for returns parameter where you specify how long you want.
Instead, if you want to do it via Python, find out query pattern as where the date entry goes and substitute it in the URL and do a GET, parse the result and include it in your result list.

Develop Reference

JavaScript is the programming language of the Web.