Scrapy: Modify rules for scraping web page - javascript

I've started to use scrapy for a project of mine to scrape data off a tennis website. Here is an example page that I want to scrape data off. As you can see, I want to scrape data for a tennis player. I need to recursively go through the entire page and gather 'Match Stats' (Theres a link titled 'Match Stats' next to every match) for a player's matches. I've already written code to parse data from the opened match stats popup. All I need to do now is open these match stats pages through the initial spider.
In all the examples I've read up on, we can write rules to navigate scrapy to the different urls that need scraping. In my case, I just want to write a rule to the different match stats links. However, if you saw the page I want to scrape, 'Match Stats' links are in the following format: javascript:makePopup('match_stats_popup.php?matchID=183704502'). As I've read online (I might be wrong!), scrapy can't deal with javascript and hence cant 'click' on that link. However, since the links are javascript popups, its possible to add the match_stats_popup.php?matchID=183704502 part of the link to the main url to get a standard html page:
http://www.tennisinsight.com/match_stats_popup.php?matchID=183704502
I am hoping I could modify the rules before scraping. In summary, I just want to find the links that are of the type: javascript:makePopup('match_stats_popup.php?matchID=183704502, and modify them so that they are now of the type http://www.tennisinsight.com/match_stats_popup.php?matchID=183704502
This is what I've written in the rules so far, which doesnt open any pages:
rules = (
Rule(SgmlLinkExtractor(allow='/match_stats_popup.php?matchID=\d+'),
'parse_match', follow=True,
),
)
parse_match is the method which parses data from the opened match stats popup.
Hope my problem is clear enough!

Using BaseSgmlLinkExtractor or SgmlLinkExtractor you can specify both the tag(s) from which to extract and process_value function used for extracting the link. There is nice example in the official documentation. Here is the code for your example:
class GetStatsSpider(CrawlSpider):
name = 'GetStats'
allowed_domains = ['tennisinsight.com']
start_urls = ['http://www.tennisinsight.com/player_activity.php?player_id=1']
def getPopLink(value):
m = re.search("javascript:makePopup\('(.+?)'\)", value)
if m:
return m.group(1)
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopLink), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = TennisItem()
i['url_stats'] = response.url
return i

Related

Python Web Scraping in Pagination in Single Page Application

I am currently researching on how to scrape web content using python in pagination driven by javascript in single page application (SPA).
For example,
https://angular-8-pagination-example.stackblitz.io/
I googled and found that using Scrapy is not possible to scrape javascript / SPA driven content.
It needs to use Splash. I am new to both Scrapy and Splash.
Is this correct?
Also, how do I call the javascript pagination method? I inspect the element, it's just an anchor without href and javascript event.
Please advise.
Thank you,
Hatjhie
You need to use a SpalshRequest to render the JS. You then need to get the pagination text. Generally I use re.search with the appropriate regex pattern to extract the relevent numbers. You can then assign them to current page variable and total pages variables.
Typically a website will move to the next page by incrementing ?page=x or ?p=x at the end of the url. You can then increment this value to scrape all the relevant pages.
The overall pattern looks like this:
import scrapy
from scrapy_splash import SplashRequest
import re
from ..items import Item
proxy ='http//your.proxy.com:PORT'
current_page_xpath='//div[your x path selector]/text()'
last_page_xpath='//div[your other x path selector]/text()'
class spider(scrapy.Spider):
name = 'my_spider'
allowed_domains =['domain.com']
start_urls =['https://www.domaintoscrape.com/page=1']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, meta ={'proxy':proxy})
def get_page_nbr(value):
#you may need more complex regex to get page numbers.
#most of the time they are in form "page X of Y"
#google is your friend
if re.search('\d+',value):
value = re.search('\d+',value)
value = value[0]
else:
value =None
return value
def parse(self, response):
#get last and current page from response:
last_page = page_response.xpath(last_page_xpath).get()
current_page = page_response.xpath(current_page_xpath).get()
#do something with your response
# if current page is less than last page make another request by incrmenenting the page in the URL
if current_page < last_page:
ajax_url = response.url.replace(f'page={int(current_page)}',f'page={int(current_page)+1}')
yield scrapy.Request(url=ajax_url, callback=self.parse, meta ={'proxy':proxy})
#optional
if current_page == last_page:
print(f'processed {last_page} items for {response.url}')
finally, its worth having a look on Youtube as there are a number of tutorials on scrapy_splash and pagination.

Scrapy, scraping elements only visible during runtime

I am new to Scrapy, HTML and Javascript. I am trying to source a list of all branches and agents for our agency from the website. Most of the information I need can be extracted from an AJAX result: www.tysonprop.co.za/ajax/agents/?branch_id=[id]
The challenge is two fold:
The branch names displayed on the website (https://www.tysonprop.co.za/agents/) are contained within span elements not visible when viewing the page source. This means that Scrapy cannot find the information. For example, "Tyson Properties Fourways Office" should in theory be located at: xpath(//div[#id="select2-result-label-76"]/span[#class="select2-match"]/text()) [![see inspect element][1]][1])
The AJAX call requires the branch-id. I can't figure out how the page translates the branch name selected in the drop-down to a branch id to intercept the logic. I.e. how do I extract a list of branch names with corresponding ID's?
I have done an extensive web search without much success. Any help would be appreciated.
[1]: https://i.stack.imgur.com/1kjk8.png
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
agent = Agent()
json_data = json.loads(response.text)
branch_id = json_data['branch']['id']
branch_name = json_data['branch']['branch_name']
branch_tel = json_data['branch']['get_dialable_telephone_number']
# Loop through all of th agents
agent_list = json_data['agents']
for key in range(len(agent_list)):
agent['id'] = agent_list[key]['id']
agent['branch_id'] = branch_id
agent['branch_name'] = branch_name
agent['branch_tel'] = branch_tel
agent['privy_seal_url'] = agent_list[key]['privy_seal_url']
Related question: Scrapy xpath not extracting div containing special characters <%=
If you look into the page source, you can see the branch id's and names are present in the HTML, and located under 'name="agent_search"'.
With the logic below you go through the different branches and get their id & name:
branches_xpath = response.xpath('//*[#name="agent_search"]//option')
for branch_xpath in branches_xpath[1:]: # skip first option as that one is empty
branch_id = branch_xpath.xpath('./#value').get()
branch_name = branch_xpath.xpath('./text()').get()
print(f"branch_id: {branch_id}, branch_name: {branch_name}")

Python Flask data feed from Pandas Dataframe, dynamically define with unique endpoint

Hi I am building a web app with Flask Python. I got a problem here:
#app.route('/analytics/signals/<ticker_url>')
def analytics_signals_com_page(ticker_url):
all_ticker = full_list
ticker_name = com_name
ticker = ticker_url.upper()
pricerec = sp500[ticker_url.upper()].tolist()
timerec = sp500[ticker_url.upper()].index.tolist()
return render_template('company.html', all_ticker=all_ticker, ticker_name=ticker_name, ticker=ticker, pricerec=pricerec, timerec=timerec)
Here I am defining company pages based on the a page will contain different content. The problem is that everything is fine upto ticker = ticker_url.upper(). It works perfectly fine. But for pricerec and timerec, they make problems.
sp500 is a pandas DataFrame columns being companies like "AAPL", "GOOG","MSFT", and so forth 505 companies and the index are timestamps, and values are the prices at each time.
So what I am doing for the pricerec, I am taking the ticker_url and use it to take the specific company's price and make it as a list. And timerec is to take the index (timestamps) and make it as a list. And I am passing these two variables into the company.html page.
But it makes internal server error. I do not know why it happens.
My expectation was that when a user click a button that href to "~/analytics/signals/aapl" then the company.html page will contain the pricerec and timerec for me to draw a graph. But it didn't work like that. It makes internal server error. I defined those two variables in the javascript also like I did for the other variables(all_ticker, ticker_name, and ticker)
Can anyone help me with this issue?
Thanks!

Static page with custom content depending on url

I used to use parts of the url address to add words to a page template when I worked with PHP.
I've started looking at static pages using https://gohugo.io and trying to get this function in JavaScript in order not to need to generate multiple pages (Although this is the point of static pages) since all url will use the same page template but with difrent text from the url.
Example (from my PHP site)
url = www.domain.tld/city/washington/
Where i get the word after /city/ and put the word "washington" in my page content.
url = www.domain.tld/city/somecityname/
Where i get the word after /somecityname/ and put the word "washington" in my page content.
I looked at https://gohugo.io/extras/datafiles/ and https://gohugo.io/extras/datadrivencontent/ but this wont fix it the way i want it to be. (although I have a csv file with the city names)
Page will be hosted on GitHub Pages so i can only use Javascript / jQuery for this function.
try this code for get city name
var qrStr1 = window.location.href;
var new1 = qrStr1.replace(':','');
var data1 = new1.split("/");
alert(data1[4]);// try with different index so you can find your value

base link and search api

I am attempting to query a database through an API which I don't fully understand. I have been sent an example of the API being used with a keyword search form. The form is an html file and uses jquery return JSON documents, format items into an array array, and display.
I tried to build the design of my application and manipulate the form to work within my pages. The file the uses the API requires that the a base link be used.
<base href="{{app_root}}">
If I remove this base link my functionality of the search is lost. If I use the base link all of presentation and CSS is lost.
I thought maybe I could change the base link dynamically when I needed to call the search file with:
<script type="text/javascript">
function setbasehref(basehref) {
var thebase = document.getElementsByTagName("base");
thebase[0].href = basehref;
}
//setbasehref("{{app_root}}");
setbasehref("{{app_root}}");
</script>
Then use setbasehref() to change it back to my original base link, but that didn't work.
I'm new to javascript and JSON, and I'm not entirely sure what app_root is doing. Any thoughts?

Categories

Resources