How to download data behind buttons which call javascript functions? - javascript

I was trying to download this data from the website.
https://www.nseindia.com/market-data/oi-spurts
How can scrape this using python?

The JavaScript function downloadCSV is part of gijgo.min.js. It invokes getCSV, which goes through the fields in the table and generates a CSV file on the fly.
Fortunately, you don't have to deal with CSVs or scraping anything from the page. To get the data you want, all you have to do is make an HTTP GET request to the same RESTful API that your browser makes a request to when visiting the page:
def main():
import requests
url = "https://www.nseindia.com/api/live-analysis-oi-spurts-underlyings"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.json()["data"]
print(f"There are {len(data)} items in total.")
print(f"The first item is:\n{data[0]}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
There are 143 items in total.
The first item is:
{'symbol': 'MOTHERSUMI', 'latestOI': 7182, 'prevOI': 4674, 'changeInOI': 2508, 'avgInOI': 53.66, 'volume': 12519, 'futValue': 53892.6066, 'optValue': 3788085280, 'total': 55585.0344, 'premValue': 1692.4278, 'underlyingValue': 104}
>>>

One way to find the final download link to a certain file is to open the debugger of your web browser and to click on the download link while looking into the Networking tab of the debugger.
Normally you will see the request the javascript of the page called as same as the url, the content of the request and so on...
From here you just need to replicate what request was sent by the javascript.

Related

Display Progressbar In Django View When downloading video (youtube-dl)

I am calling 'search/' when button is clicked through ajax call. Now my question is i want to show these details {"file_name":d['filename'],"percentage":d['_percent_str'],"speed":d['_eta_str']} in a progress bar while downloading in a web page.
How should i get the json response from video_progress_hook each time it is call by 'progress_hook' parameter in ydl_opts?
I want to get response in javascriprt.
Please help.
def search(request):
file_name=""+str(uuid.uuid1()).split('-')[0]+".mp3"
query=request.GET.get("query")
ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192'}],
'outtmpl': 'media/'+file_name,
'progress_hooks':[video_progress_hook],
'quiet': False,
}
ydl = youtube_dl.YoutubeDL(ydl_opts)
ydl.download([query])
args={'url_link':file_name}
return JsonResponse(args)
def video_progress_hook(d):
args={}
if d['status'] == 'downloading':
args={"file_name":d['filename'],"percentage":d['_percent_str'],"speed":d['_eta_str']}
return JsonResponse(args)
In general there is no easy solution.
You need to store you request in db and return that id to the client. After that you need to create special endpoint that returns progress using request id. In your code you should update that request every n seconds.
I think a good solution could be https://github.com/czue/celery-progress

Execute inline JavaScript in Scrapy response

I am trying to log into a website with Scrapy, but the response received is an HTML document containing only inline JavaScript. The JS redirects to the page I want to scrape data from. But Scrapy does not execute the JS and therefore doesn't route to the page I want it to.
I use the following code to submit the login form required:
def parse(self, response):
request_id = response.css('input[name="request_id"]::attr(value)').extract_first()
data = {
'userid_placeholder': self.login_user,
'foilautofill': '',
'password': self.login_pass,
'request_id': request_id,
'username': self.login_user[1:]
}
yield scrapy.FormRequest(url='https://www1.up.ac.za/oam/server/auth_cred_submit', formdata=data,
callback=self.print_p)
The print_p callback function is as follows:
def print_p(self, response):
print(response.text)
I have looked at scrapy-splash but I could not find a way to execute the JS in the response with scrapy-splash.
I'd suggest using Splash as a rendering service. Personally, I found it more reliable than Selenium. Using scripts, you can instruct it to interact with the page.
Probably selenium can help you pass this JS.
If you haven't checked it yet you can use some examples like this. If you'll have luck to reach it then you can get page url with:
self.driver.current_url
And scrape it after.

How to scrape AJAX based websites by using Scrapy and Splash?

I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).
class FlipSpider(CrawlSpider):
name = "flip"
allowed_domains = ["www.amazon.com"]
start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']
rules = (Rule(LinkExtractor(), callback='lol', follow=True),
def parse_start_url(self,response):
yield scrapy.Request(response.url,
self.lol,
meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})
def lol(self, response):
"""
Some code
"""
The problem with Splash and pagination is following:
I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.
So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.
So, I on the page that has pagination link I execute
yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})
with following Lua script
def parse_categories(self, response):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(1)
splash:runjs('document.querySelectorAll(".next-page")[0].click()')
splash:wait(1)
return splash:url()
end
"""
and the get_url function
def get_url(self,response):
yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)
This way I was able to loop my queries.
Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.
You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.
A little exemple:
You define a JavaScript function that selects an element in the page and then clicks on it:
(source: splash doc)
# Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
assert(splash:go(splash.args.url))
local get_dimensions = splash:jsfunc([[
function () {
var rect = document.getElementById('button').getClientRects()[0];
return {"x": rect.left, "y": rect.top}
}
]])
splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
splash:mouse_click(dimensions.x, dimensions.y)
-- Wait split second to allow event to propagate.
splash:wait(0.1)
return splash:html()
end
"""
Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.
Exemple :
def parse(self, response):
yield SplashRequest(response.url, self.parse_elem,
endpoint="execute",
args={"lua_source": _script})
You will find all the informations about splash scripting here
I just answered a similar question here: scraping ajax based pagination. My solution is to get the current and last pages and then replace the page variable in the request URL.
Also - the other thing you can do is look on the network tab in the browser dev tools and see if you can identify any API that is called. If you look at the requests under XHR you can see those that return json.
You can then call the API directly and parse the json/ html response. Here is the link from the scrapy docs:The Network-tool

PythonAnywhere How to handle multiple "web workers" or processes

Summary of my website: A user fills in some information which after hitting "submit" the information is submitted to the backend via AJAX. Upon the back end receiving the information, it generates a DOCX using the information and serves that DOCX file back to the user.
Here is my AJAX Code in my HTML File
$.ajax({
type:'POST',
url:'/submit/',
data:{
data that I submit
},
dateType: 'json',
success:function() {
document.location = "/submit";
}
})
My Views Function for /submit/ that uses send_file to return file
def submit(request):
#Receive Data
#Create a File with the Data and save it to the server
return send_file(request)
def send_file(request):
lastName = get_last_name() +'.docx'
filename = get_full_path() # Select your file here.
wrapper = FileWrapper(open(filename , 'rb'))
response = HttpResponse(wrapper, content_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document')
response['Content-Disposition'] = 'attachment; filename=' + lastName
response['Content-Length'] = os.path.getsize(filename)
return response
This has worked flawlessly for sometime now. However I started having problems when I increased the amount of "web-workers"/processes from 1 to 4 in my hosting account. Whats happening is a different web-worker is being used to send the file, which is creating a new instance of the site to do that. The problem with that is that the new instance does not contain the file path that is created with the web worker that creates the file.
Like I said, this worked flawlessly when my webApp only had one "web worker" or one process. Now I only have roughly a 50% success rate.
Its almost like a process is trying to send the file before it has been created. Or the process does not have access to the file name that the process that created it does.
Any help would be much appreciated. Thanks!
Code Trying to send path_name through request and then back to the server.
Submit View returning file info back to ajax.
def submit(request):
# Receive DATA
# Generate file with data
lastName = get_last_name() +'.docx'
filename = get_full_path() # Select your file here.
return HttpResponse(json.dumps({'lastname': lastName,'filename':filename}), content_type="application/json")
Success Function of AJAX
success:function(fileInfo) {
name_last = fileInfo['lastname']
filepath= fileInfo['filepath']
document.location = "/send";
}
So can I get the fileINfo to send with the "/send" ?
Each web worker is a separate process. They do not have access to variables set in another worker. Each request could go to any worker so there is no guarantee that you'd be using the file name that was set for a particular user. If you need to transfer information between requests, you need to store it outside of the worker's memory - you could do that in a cookie, or in a database or a file.

jquery prompted variable value given to python script

I have what I think to be a simple problem, but I cannot figure out the solution. I have a javascript form with options, and when the user selects an option they get prompted to input a value as below:
var integer_value = window.prompt("What is the integer value?", "defaultText")
I then need this integer value to be used by a python script. So, if say in my python script:
integer_value = 2
It would need to change to whatever value the user inputs in the prompt window.
The code below should do what you need. There may be better ways to do this, but at least this way is fairly simple.
Next time you have a Web programming question, tell us what server you're using, and what framework, and any JavaScript things like jQuery or Ajax. And post a small working demo of your code, both the HTML/JavaScript and the Python, that we can actually run so we can see exactly what you're talking about.
Firstly, here's a small HTML/JavaScript page that prompts the user for data and sends it to the server.
send_to_python.html
<!DOCTYPE html>
<html>
<head><title>Send data to Python demo</title>
<script>
function get_and_send_int()
{
var s, integer_value, default_value = 42;
s = window.prompt("What is the integer value?", default_value);
if (s==null || s=='')
{
alert('Cancelled');
return;
}
//Check that the data is an integer before sending
integer_value = parseInt(s);
if (integer_value !== +s)
{
alert(s + ' is not an integer, try again');
return;
}
//Send it as if it were a GET request.
location.href = "cgi-bin/save_js_data.py?data=" + integer_value;
}
</script>
</head>
<body>
<h4>"Send data to Python" demo</h4>
<p>This page uses JavaScript to get integer data from the user via a prompt<br>
and then sends the data to a Python script on the server.</p>
<p>Click this button to enter the integer input and send it.<br>
<input type="button" value="get & send" onclick="get_and_send_int()">
</p>
</body>
</html>
And now for the CGI Python program that receives the data and logs it to a file. A proper program would have some error checking & data validation, and a way of reporting any errors. The Python CGI module would make this task a little easier, but it's not strictly necessary.
The Web server normally looks for CGI programs in a directory called cgi-bin, and that's generally used as the program's Current Working Directory. A CGI program doesn't run in a normal terminal: its stdin and stdout are essentially connected to the Web page that invoked the program, so anything it prints gets sent back to the Web page. And (depending on the request method used) it may receive data from the page via stdin.
The program below doesn't read stdin (and will appear to hang if you try). The data is sent to it as part of the URL used to invoke the program, and the server extracts that data and puts it into an environment variable that the CGI program can access.
save_js_data.py
#! /usr/bin/env python
''' save_js_data
A simple CGI script to receive data from "send_to_python.html"
and log it to a file
'''
import sys, os, time
#MUST use CRLF in HTTP headers
CRLF = '\r\n'
#Name of the file to save the data to.
outfile = 'outfile.txt'
def main():
#Get the data that was sent from the Web page
query = os.environ['QUERY_STRING']
#Get the number out of QUERY_STRING
key, val = query.split('=')
if key == 'data':
#We really should check that val contains a valid integer
#Save the current time and the received data to outfile
s = '%s, %s\n' % (time.ctime(), val)
with open(outfile, 'a+t') as f:
f.write(s)
#Send browser back to the refering Web page
href = os.environ['HTTP_REFERER']
s = 'Content-type: text/html' + CRLF * 2
s += '<meta http-equiv="REFRESH" content="0;url=%s">' % href
print s
if __name__ == '__main__':
main()
When you save this program to your hard drive, make sure that you give it execute permissions. Since you're using Flask, I assume your server is configured to run Python CGI programs and that you know what directory it looks in for such programs.
I hope this helps.

Categories

Resources