I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).
class FlipSpider(CrawlSpider):
name = "flip"
allowed_domains = ["www.amazon.com"]
start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']
rules = (Rule(LinkExtractor(), callback='lol', follow=True),
def parse_start_url(self,response):
yield scrapy.Request(response.url,
self.lol,
meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})
def lol(self, response):
"""
Some code
"""
The problem with Splash and pagination is following:
I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.
So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.
So, I on the page that has pagination link I execute
yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})
with following Lua script
def parse_categories(self, response):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(1)
splash:runjs('document.querySelectorAll(".next-page")[0].click()')
splash:wait(1)
return splash:url()
end
"""
and the get_url function
def get_url(self,response):
yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)
This way I was able to loop my queries.
Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.
You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.
A little exemple:
You define a JavaScript function that selects an element in the page and then clicks on it:
(source: splash doc)
# Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
assert(splash:go(splash.args.url))
local get_dimensions = splash:jsfunc([[
function () {
var rect = document.getElementById('button').getClientRects()[0];
return {"x": rect.left, "y": rect.top}
}
]])
splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
splash:mouse_click(dimensions.x, dimensions.y)
-- Wait split second to allow event to propagate.
splash:wait(0.1)
return splash:html()
end
"""
Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.
Exemple :
def parse(self, response):
yield SplashRequest(response.url, self.parse_elem,
endpoint="execute",
args={"lua_source": _script})
You will find all the informations about splash scripting here
I just answered a similar question here: scraping ajax based pagination. My solution is to get the current and last pages and then replace the page variable in the request URL.
Also - the other thing you can do is look on the network tab in the browser dev tools and see if you can identify any API that is called. If you look at the requests under XHR you can see those that return json.
You can then call the API directly and parse the json/ html response. Here is the link from the scrapy docs:The Network-tool
Related
I was trying to download this data from the website.
https://www.nseindia.com/market-data/oi-spurts
How can scrape this using python?
The JavaScript function downloadCSV is part of gijgo.min.js. It invokes getCSV, which goes through the fields in the table and generates a CSV file on the fly.
Fortunately, you don't have to deal with CSVs or scraping anything from the page. To get the data you want, all you have to do is make an HTTP GET request to the same RESTful API that your browser makes a request to when visiting the page:
def main():
import requests
url = "https://www.nseindia.com/api/live-analysis-oi-spurts-underlyings"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.json()["data"]
print(f"There are {len(data)} items in total.")
print(f"The first item is:\n{data[0]}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
There are 143 items in total.
The first item is:
{'symbol': 'MOTHERSUMI', 'latestOI': 7182, 'prevOI': 4674, 'changeInOI': 2508, 'avgInOI': 53.66, 'volume': 12519, 'futValue': 53892.6066, 'optValue': 3788085280, 'total': 55585.0344, 'premValue': 1692.4278, 'underlyingValue': 104}
>>>
One way to find the final download link to a certain file is to open the debugger of your web browser and to click on the download link while looking into the Networking tab of the debugger.
Normally you will see the request the javascript of the page called as same as the url, the content of the request and so on...
From here you just need to replicate what request was sent by the javascript.
I am trying to log into a website with Scrapy, but the response received is an HTML document containing only inline JavaScript. The JS redirects to the page I want to scrape data from. But Scrapy does not execute the JS and therefore doesn't route to the page I want it to.
I use the following code to submit the login form required:
def parse(self, response):
request_id = response.css('input[name="request_id"]::attr(value)').extract_first()
data = {
'userid_placeholder': self.login_user,
'foilautofill': '',
'password': self.login_pass,
'request_id': request_id,
'username': self.login_user[1:]
}
yield scrapy.FormRequest(url='https://www1.up.ac.za/oam/server/auth_cred_submit', formdata=data,
callback=self.print_p)
The print_p callback function is as follows:
def print_p(self, response):
print(response.text)
I have looked at scrapy-splash but I could not find a way to execute the JS in the response with scrapy-splash.
I'd suggest using Splash as a rendering service. Personally, I found it more reliable than Selenium. Using scripts, you can instruct it to interact with the page.
Probably selenium can help you pass this JS.
If you haven't checked it yet you can use some examples like this. If you'll have luck to reach it then you can get page url with:
self.driver.current_url
And scrape it after.
I am not sure if I worded my question correctly. I'm not actually sure how to go about this at all.
I have a site load.html. Here I can use a textbox to enter an ID, for example 123, and the page will display some information (retrieved via a Javascript function that calls AJAX from the Flask server).
I also have a site, account.html. Here it displays all the IDs associated with an account.
I want to make it so if you click the ID in account.html, it will go to load.html and show the information required.
Basically, after I press the link, I need to change the URL to load.html, then call the Javascript function to display the information associated with the ID.
My original thoughts were to use variable routes in Flask, like #app.route('/load/<int:id>') instead of simply #app.route('/load')
But all /load does is show load.html, not actually load the information. That is done in the Javascript function I talked about earlier.
I'm not sure how to go about doing this. Any ideas?
If I need to explain more, please let me know. Thanks!
To make this more clear, I can go to load.html and call the Javascript function from the web console and it works fine. I'm just not sure how to do this with variable routes in Flask (is that the right way?) since showing the information depends on some Javascript to parse the data returned by Flask.
Flask code loading load.html
#app.route('/load')
def load():
return render_template('load.html')
Flask code returning information
#app.route('/retrieve')
def retrieve():
return jsonify({
'in':in(),
'sb':sb(),
'td':td()
})
/retrieve just returns a data structure from the database that is then parsed by the Javascript and output into the HTML. Now that I think about it, I suppose the variable route has to be in retrieve? Right now I'm using AJAX to send an ID over, should I change that to /retrieve/<int:id>? But how exactly would I retrieve the information, from, example, /retrieve/5? In AJAX I can just have data under the success method, but not for a simple web address.
Suppose if you are passing the data into retrieve from the browser url as
www.example.com/retrieve?Data=5
you can get the data value like
dataValue = request.args.get('Data')
You can specify param in url like /retrieve/<page>
It can use several ways in flask.
One way is
#app.route('/retrieve/', defaults={'page': 0})
#app.route('/retrieve/<page>')
def retrieve():
if page:
#Do page stuff here
return jsonify({
'in':in(),
'sb':sb(),
'td':td()})
Another way is
#app.route('/retrieve/<page>')
def retrieve(page=0):
if page:
#Do your page stuff hear
return jsonify({
'in':in(),
'sb':sb(),
'td':td()
})
Note: You can specify converter also like <int:page>
I'm using Flask and want to render html pages and directly focus on a particular dom element using /#:id.
Below is the code for default / rendering
#app.route('/')
def my_form():
return render_template("my-form.html")
Below is the function being called upon POST request.
#app.route('/', methods=['POST'])
def my_form_post():
#...code goes here
return render_template("my-form.html")
I want to render my my-form.html page as my-form.html/#output so that it should directly focus upon the desired element in the dom.
But trying return render_template("my-form.html/#output") tells that there's no such file and trying #app.route('/#output', methods=['POST']) doesn't work either.
UPDATE:
Consider this web-app JSON2HTML - http://json2html.herokuapp.com/
What is happening: Whenever a person clicks send button the textarea input and the styling setting is send over to my python backend flask-app as below.
#app.route('/', methods=['POST'])
def my_form_post():
#...code for converting json 2 html goes here
return render_template("my-form.html",data = processed_data)
What I want: Whenever the send button is clicked and the form data is POSTED and processed and the same page is redirected with a new parameter which contains the processed_data to be displayed. My problem is to render the same page appending the fragment identifier #outputTable so that after the conversion the page directly focuses on the desired output the user wants.
The fragment identifier part of the URL is never sent to the server, so Flask does not have that. This is supposed to be handled in the browser. Within Javascript you can access this element as window.location.hash.
If you need to do your highlighting on the server then you need to look for an alternative way of indicating what to highlight that the server receives so that it can give it to the template. For example:
# focus element in the query string
# http://example.com/route?id=123
#app.route('/route')
def route():
id = request.args.get('id')
# id will be the given id or None if not available
return render_template('my-form.html', id = id)
And here is another way:
# focus element in the URL
#http://example.com/route/123
#app.route('/route')
#app.route('/route/<id>')
def route(id = None):
# id is sent as an argument
return render_template('my-form.html', id = id)
Response to the Edit: as I said above, the fragment identifier is handled by the web browser, Flask never sees it. To begin, you have to add <a name="outputTable">Output</a> to your template in the part you want to jump to. You then have two options: The hacky one is to write the action attribute of your form including the hashtag. The better choice is to add an onload Javascript event handler that jumps after the page has loaded.
I have one of those websites that basically gives you a yes or no response to a question posed by the url. An example being http://isnatesilverawitch.com.
My site is more of an in-joke and the answer changes frequently. What I would like to be able to do is store a short one or two word string and be able to change it without editing the source on my site if that is possible using only javascript. I don't want to set up an entire database just to hold a single string.
Is there a way to write to a file without too much trouble, or possibly a web service designed to retrieve and change a single string that I could use to power such a site? I know it's a strange question, but the people in my office will definitely get a kick out of it. I am even considering building a mobile app to manipulate the answer on the fly.
ADDITIONAL:
To be clear I just want to change the value of a single string but I can't just use a random answer. Without being specific, think of it as a site that states if the doctor is IN or OUT, but I don't want it to spit out a random answer, it needs to say IN when he is IN and OUT when he is out. I will change this value manually, but I would like to make the process simple and something I can do on a mobile device. I can't really edit source (nor do I want to) from a phone.
If I understand correctly you want a simple text file that you change a simple string value in and have it appear someplace on your site.
var string = "loading;"
$.get('filename.txt',function(result){
string = result;
// use string
})
Since you don't want to have server-side code or a database, one option is to have javascript retrieve values from a Google Spreadsheet. Tabletop (http://builtbybalance.com/Tabletop/) is one library designed to let you do this. You simply make a public Google Spreadsheet and enable "Publish to web", which gives you a public URL. Here's a simplified version of the code you'd then use on your site:
function init() {
Tabletop.init( { url: your_public_spreadshseet_url,
callback: function (data) {
console.log(data);
},
simpleSheet: true } )
}
Two ideas for you:
1) Using only JavaScript, generate the value randomly (or perhaps based on a schedule, which you can hard code ahead of time once and the script will take care of the changes over time).
2) Using Javascript and a server-side script, you can change the value on the fly.
Use JavaScript to make an AJAX request to a text file that contains the value. Shanimal's answer gives you the code to achieve that.
To change the value on the fly you'll need another server-side script that writes the value to some sort of data store (your text file in this case). I'm not sure what server-side scripting (e.g. PHP, Perl, ASP, Python) runtime you have on your web server, but I could help you out with the code for PHP where you could change the value by pointing to http://yoursite.com/changeValue.php?Probably in a browser. The PHP script would simply write Probably to the text file.
Though javascript solution is possible it is discouraged. PHP is designed to do such things like changing pieces of sites randomly. Assuming you know that, I will jump to javascript solution.
Because you want to store word variation in a text file, you will need to download this file using AJAX or store it in .js file using array or string.
Then you will want to change the words. Using AJAX will make it possible to change the words while page is loaded (so they may, but do not have to, change in front of viewers eyes).
Changing page HTML
Possible way of changing (words are in array):
wordlist.js
var status = "IN"; //Edit IN to OUT whenever you want
index.html
<script src="wordlist.js"></script>
<div>Doctor is <span id="changing">IN</span></div>
<script>
function changeWord(s) { //Change to anything
document.getElementById("changing").innerHTML = s;
}
changeWord(status); //Get the status defined in wordlist.js
</script>
Reloading from server
If you want to change answer dynamically and have the change effect visible on all open pages, you will need AJAX or you will have to make browser reload the word list, as following:
Reloading script
function reloadWords() {
var script = document.createElement("script"); //Create <script>
script.type="text/javascript";
script.src = "wordlist.js"; //Set the path
script.onload = function() {changeWord(status)}; //Change answer after loading
document.getElementsByTagName("head")[0].appendChild(script); //Append to <head> so it loads as script. Can be appended anywhere, but I like to use <head>
}
Using AJAX
Here we assume use of text file. Simplest solution I guess. With AJAX it looks much like this:
http = ActiveXObject==null?(new XMLHttpRequest()):(new ActiveXObject("Microsoft.XMLHTTP"));
http.onloadend = function() {
document.getElementById("changing").innerHTML = this.responseText; //Set the new response, "IN" or "OUT"
}
http.open("GET", "words.txt")
http.send();
Performance of AJAX call may be improved using long-poling. I will not introduce this feature more here, unless someone is interested.