Crawling through pages with PostBack data javascript Python Scrapy - javascript

I'm crawling through some directories with ASP.NET programming via Scrapy.
The pages to crawl through are encoded as such:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript "link."
If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.

This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:
the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
using a regular BaseSpider since there is some logic involved in the pagination
it is important to provide headers pretending to be a real browser
it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters
The code:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('#name').extract()[0]
try:
value = form_input.xpath('#value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[#class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)

Related

control Javascript translation in Google search

The code performs a google search using the below init.py
def search(term, num_results=10, lang="en", lr="lang_en"):
usr_agent = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.100 Safari/537.36'}
def fetch_results(search_term, number_results, language_code):
escaped_search_term = search_term.replace(' ', '+')
google_url = 'https://www.google.com/search?q={}&num={}&hl={}&lr={}'.format(escaped_search_term, number_results+1,
language_code, lr)
...
Some of the returned links use javascript do translate the website:
<script type="text/javascript">
var home = '/de/', root = '/', country = 'ch', language = 'de', w = {
"download_image": "Bild download (Niedrige Qualität)",
...
another example:
<script>
dataLayer.push({
'brand' : 'Renault',
'countryCode' : 'BE',
'googleAccount' : 'UA-23041452-1',
'adobeAccount' : 'renaultbeprod',
'languageCode' : 'nl',
...
Is there a way to filter out the results translated through javascript and get search results only in one language ?
My apologies. Please disregard this question. I made a trivial mistake of putting tripple quoted comments inside a function in another part of the code. It disabled the parameters to be passed effectively.
After removing tripple quote comment from below, I get the results only in english:
for google_url in search(query, # The query you want to run
lang='en', # User interface language (host language)
num_results = 10, # Number of results per page
lr="lang_en" # Langauge of the documents received
'''
lr - parameter is implemented in __init__.py of googlesearch
It should be handled only here.
Other useful search parameters not used yet are:
cr - restricts search results to documents originating in a particular country.
(ex. cr=countryCA)
gl - boosts search results whose country of origin matches the parameter value.
(ex. gl=uk)
'''
):

Scrapy: scraping JSON data from URL that is constructed with dates

I have read many posts on using Scrapy to scrape JSON data, but haven't found one with dates in the URL.
I am using Scrapy version 2.1.0 and I am trying to scrape this site which populates based on date ranges in the URL. Here is the rest of my code which includes headers I copied from the site I am trying to scrape, but I am trying to use the following while loop to generate the URLs:
while start_date <= end_date:
start_date += delta
dates_url = (str(start_date) + "&end=" + str(start_date))
ideas_url = base_url+dates_url
request = scrapy.Request(
ideas_url,
callback=self.parse_ideas,
headers=self.headers
)
print(ideas_url)
yield request
Then I am trying to scrape using the following:
def parse_ideas(self, response):
raw_data = response.body
data = json.loads(raw_data)
yield {
'Idea' : data['dates']['data']['idea']
}
Here is a more complete error output from when I try to runspider and export to a CSV, but I keep getting the error:
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is this the best approach for scraping a site that uses dates in its URL to populate? And, if so, what I am doing wrong with my JSON request that I am not getting any results?
Note in case it matters, in settings.py I enabled and edited the following:
USER_AGENT = 'Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/4537.36'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
And I added the following at the end of settings.py
HTTPERROR_ALLOWED_CODES = [400]
DOWNLOAD_DELAY = 2
The problem with this is that you are trying to scrape a JavaScript app. In the html it says:
<noscript>You need to enable JavaScript to run this app.</noscript>
Another problem is that the app pulls it's data from this api which requires authorization. So I guess your best bet is to either use Splash or Selenium to wait for the page to load and use the html generated by them.
Personally I mostly use something very similar to scrapy-selenium. There is also a Package available for it here

How to yield fragment URLs in scrapy using Selenium?

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !

Issue scraping javascript generated content with Selenium and python

I'm trying to scrape real estate data off of this website: example
As you can see the relevant content is placed into article tags.
I'm running selenium with phantomjs:
driver = webdriver.PhantomJS(executable_path=PJSpath)
Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.
Before calling
driver.get(engine_link)
I copy engine_link to the clipboard and it opens fine in chrome.
Next I wait for all possible redirects to happen:
def wait_for_redirect(wdriver):
elem = wdriver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 5:
print("Waited for redirect for 5 seconds!")
return
time.sleep(1)
try:
elem = wdriver.find_element_by_tag_name("html")
except StaleElementReferenceException:
return
Now at last I want to iterate over all <article> tags on the current page:
for article in driver.find_elements_by_tag_name("article"):
But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.
Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.
Any help appreciated!
Pretend not to be PhantomJS and add an Explicit Wait (worked for me):
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")
# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))
# get articles
for article in driver.find_elements_by_tag_name("article"):
print(article.text)

Click and scrape link with javascript:__doPostBack href in Python

I wish to download data for thousands of records from a government site using Python 2.7. One example of a record is http://camara.cl/pley/pley_detalle.aspx?prmID=1252&prmBL=1-07. Two related problems:
(1) the site relies on mouse clicks (in the source:
Urgencias to access another part of the data of interest to me; and
(2) I am illiterate in web scraping in general and Python in particular.
Learning-by-doing has so far taken me about half-way. Internet resources here, here, and here pushed me in the right direction. But I hit a wall.
I can get source code for the information that fills the screen when the url is invoked.
import requests
id = '1252'
bl = '1-07'
url = 'http://camara.cl/pley/pley_detalle.aspx'
parametros = {'prmID': id, 'prmBL': bl}
r = requests.get(url, params = parametros)
hitos = r.text
print hitos
But I've had no success in getting info from the 'Urgencias' tab. One attempt looks thus
import json
parametros = {'prmID': id, 'prmBL': bl, '__EVENTTARGET': 'ctl00$mainPlaceHolder$btnUrgencias'}
headers = {'content-type': 'application/x-www-form-urlencoded; charset=utf-8'}
p = requests.post(url, data = json.dumps(parametros), headers = headers)
urgencias = p.text
print urgencias
I am obviously not building/sending the request properly. (I am also missing cookies, I believe.)
Any help will be greatly appreciated. (Am open to use any method that will work from a Ubuntu machine!)

Categories

Resources