Execute inline JavaScript in Scrapy response - javascript

I am trying to log into a website with Scrapy, but the response received is an HTML document containing only inline JavaScript. The JS redirects to the page I want to scrape data from. But Scrapy does not execute the JS and therefore doesn't route to the page I want it to.
I use the following code to submit the login form required:
def parse(self, response):
request_id = response.css('input[name="request_id"]::attr(value)').extract_first()
data = {
'userid_placeholder': self.login_user,
'foilautofill': '',
'password': self.login_pass,
'request_id': request_id,
'username': self.login_user[1:]
}
yield scrapy.FormRequest(url='https://www1.up.ac.za/oam/server/auth_cred_submit', formdata=data,
callback=self.print_p)
The print_p callback function is as follows:
def print_p(self, response):
print(response.text)
I have looked at scrapy-splash but I could not find a way to execute the JS in the response with scrapy-splash.

I'd suggest using Splash as a rendering service. Personally, I found it more reliable than Selenium. Using scripts, you can instruct it to interact with the page.

Probably selenium can help you pass this JS.
If you haven't checked it yet you can use some examples like this. If you'll have luck to reach it then you can get page url with:
self.driver.current_url
And scrape it after.

Related

Use AJAX POST to return razor pages partial

I have searched for this, tried the accepted solutions from the questions stackoverflow has suggested might work and I am here as a last resort, after trying everything I can think of or find.
I would like a button on my razor page to send a post request, via an ajax function I am obliged to use, and return a razor page with no layout.
HTML
<button id="myawesomebutton">Go get a partial</button>
javascript
var myawesomeajaxobject=new ajax('/myawesomeurl');
myawesomeajaxobject.done=function(dat)
{
document.getElementById('myawesomediv'),innerHTML=dat
}
myawesomeajaxobject.go('myawesomeparameter01=1&myawesomeparameter02=2');
The AJAX object I am obliged to use adds the following headers:
Content-type, application/x-www-form-urlencoded
Access-Control-Allow-Origin, *
As well as adding a '?' followed by a unix time code to the url endpoint.
It is my understanding the request mus first be sent to the cshtml.cs class behind the razor page which will then redirect to my partial.
No matter what I name my C# method, onPost, myawesomeurl, many other names, I get a 404 error instead of rendering the partial inside myawesomediv.
I have attempted to add an anti forgery token, set CORS values on the server to 'accept all' and tried sending the request directly to the partial's onPost but I'm getting nowhere.
UPDATE.
I have added:
services.AddRazorPages().AddRazorPagesOptions(options =>
{
options.Conventions.ConfigureFilter(new IgnoreAntiforgeryTokenAttribute());
});
to my Startup.
My javascript reads:
var myawesomeajaxobject=new ajax('/myawesomeurl');
myawesomeajaxobject.done=function(dat)
{
document.getElementById('myawesomediv'),innerHTML=dat
}
myawesomeajaxobject.go('handler=myawesomeurl&myawesomeparameter01=1&myawesomeparameter02=2');
This is the handler method in the cshtml.cs file:
public void OnPostmyawesomeurl()
{
//myawesomecoded added, so I have a break point to hit.
}
and I'm still getting a 404.
Is this actually possible in razor pages?
It looks like you are making a POST request to a named handler method. The handler still needs On[Http Method] incorporated into its name so that it can be found. If it is a POST request, the name of the handler method should be OnPostMyAwesomeUrl.
You also need to handle the fact that Request verification is built in to Razor Pages, so you either need to include the token in your AJAX request (https://www.learnrazorpages.com/security/request-verification#ajax-post-requests-and-json), or disable it for that page entirely (https://www.learnrazorpages.com/security/request-verification#opting-out):
[IgnoreAntiforgeryToken(Order = 1001)]
public class IndexModel : PageModel
{
...
}
Thank you everyone for your input. This is how I achieved the desired results.
HTML
<button id="myawesomebutton">Go get a partial</button>
javascript
var myawesomeajaxobject=new ajax('/myawesomeurl');
myawesomeajaxobject.done=function(dat)
{
document.getElementById('myawesomediv'),innerHTML=dat
}
myawesomeajaxobject.go('/shared/myawesomepartial/?handler=myawesomehandler&myawesomeparameter01=1&myawesomeparameter02=2');
Startup
services.AddRazorPages().AddRazorPagesOptions(options =>
{
options.Conventions.ConfigureFilter(new IgnoreAntiforgeryTokenAttribute());
});
myawesomepartial.cshtml.cs
public void OnPostmyawesomehandler(MyAwesomeModel myawesomemodel)
{
//Do something with myawesomemodel and return myawesomepartial.cshtml
}
It was just a question of making the URL and endpoint method names match up.

How to scrape AJAX based websites by using Scrapy and Splash?

I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).
class FlipSpider(CrawlSpider):
name = "flip"
allowed_domains = ["www.amazon.com"]
start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']
rules = (Rule(LinkExtractor(), callback='lol', follow=True),
def parse_start_url(self,response):
yield scrapy.Request(response.url,
self.lol,
meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})
def lol(self, response):
"""
Some code
"""
The problem with Splash and pagination is following:
I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.
So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.
So, I on the page that has pagination link I execute
yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})
with following Lua script
def parse_categories(self, response):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(1)
splash:runjs('document.querySelectorAll(".next-page")[0].click()')
splash:wait(1)
return splash:url()
end
"""
and the get_url function
def get_url(self,response):
yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)
This way I was able to loop my queries.
Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.
You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.
A little exemple:
You define a JavaScript function that selects an element in the page and then clicks on it:
(source: splash doc)
# Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
assert(splash:go(splash.args.url))
local get_dimensions = splash:jsfunc([[
function () {
var rect = document.getElementById('button').getClientRects()[0];
return {"x": rect.left, "y": rect.top}
}
]])
splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
splash:mouse_click(dimensions.x, dimensions.y)
-- Wait split second to allow event to propagate.
splash:wait(0.1)
return splash:html()
end
"""
Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.
Exemple :
def parse(self, response):
yield SplashRequest(response.url, self.parse_elem,
endpoint="execute",
args={"lua_source": _script})
You will find all the informations about splash scripting here
I just answered a similar question here: scraping ajax based pagination. My solution is to get the current and last pages and then replace the page variable in the request URL.
Also - the other thing you can do is look on the network tab in the browser dev tools and see if you can identify any API that is called. If you look at the requests under XHR you can see those that return json.
You can then call the API directly and parse the json/ html response. Here is the link from the scrapy docs:The Network-tool

How do I change pages and call a certain Javascript function with Flask?

I am not sure if I worded my question correctly. I'm not actually sure how to go about this at all.
I have a site load.html. Here I can use a textbox to enter an ID, for example 123, and the page will display some information (retrieved via a Javascript function that calls AJAX from the Flask server).
I also have a site, account.html. Here it displays all the IDs associated with an account.
I want to make it so if you click the ID in account.html, it will go to load.html and show the information required.
Basically, after I press the link, I need to change the URL to load.html, then call the Javascript function to display the information associated with the ID.
My original thoughts were to use variable routes in Flask, like #app.route('/load/<int:id>') instead of simply #app.route('/load')
But all /load does is show load.html, not actually load the information. That is done in the Javascript function I talked about earlier.
I'm not sure how to go about doing this. Any ideas?
If I need to explain more, please let me know. Thanks!
To make this more clear, I can go to load.html and call the Javascript function from the web console and it works fine. I'm just not sure how to do this with variable routes in Flask (is that the right way?) since showing the information depends on some Javascript to parse the data returned by Flask.
Flask code loading load.html
#app.route('/load')
def load():
return render_template('load.html')
Flask code returning information
#app.route('/retrieve')
def retrieve():
return jsonify({
'in':in(),
'sb':sb(),
'td':td()
})
/retrieve just returns a data structure from the database that is then parsed by the Javascript and output into the HTML. Now that I think about it, I suppose the variable route has to be in retrieve? Right now I'm using AJAX to send an ID over, should I change that to /retrieve/<int:id>? But how exactly would I retrieve the information, from, example, /retrieve/5? In AJAX I can just have data under the success method, but not for a simple web address.
Suppose if you are passing the data into retrieve from the browser url as
www.example.com/retrieve?Data=5
you can get the data value like
dataValue = request.args.get('Data')
You can specify param in url like /retrieve/<page>
It can use several ways in flask.
One way is
#app.route('/retrieve/', defaults={'page': 0})
#app.route('/retrieve/<page>')
def retrieve():
if page:
#Do page stuff here
return jsonify({
'in':in(),
'sb':sb(),
'td':td()})
Another way is
#app.route('/retrieve/<page>')
def retrieve(page=0):
if page:
#Do your page stuff hear
return jsonify({
'in':in(),
'sb':sb(),
'td':td()
})
Note: You can specify converter also like <int:page>

Grails - rendering div with a javascript call within a remoteSubmit

I have a situation where I want to hit a button in the GSP (actionSubmit) and update a div when I finish the call (which includes a call to a javascript function). I want to ultimate end up in the controller rendering the searchResults parameter and the div with the results (which is currently working).
Problem is, I need to (presumably) wrap my actionSubmit in a remoteForm. But how do I:
1) Run the javascript method already existent in the onClick
2) Render the page in the controller.
If I try both wrapped in a controller, I finish the remoteForm action and the javascript action "hangs" and never finishes.
Any ideas?
List.gsp
<g:actionSubmit type="button" value="Ping All" onclick="getIds('contactList');"/>
function getIds(checkList)
{
var idList = new Array();
jQuery("input[name=" + checkList + "]:checked").each
(
function() {
idList.push(jQuery(this).val());
}
);
$.ajax({
url: "pingAll",
type:"GET",
data:{ids:JSON.stringify(idList)}
});
}
controller:
def pingAll() {
String ids = params.ids
if(ids == "[]") {
render(template:'searchResults', model:[searchResults:""])
return
}
def idArray = contactService.formatIDString(ids)
idArray.each {
def contactInstance = Contact.get(Integer.parseInt(it))
emailPingService.ping(contactInstance)
}
/**
* Added this on 3/13. Commented out line was initial code.
*/
def searchResults = contactSearchService.refactorSearchResults(contactSearchService.searchResults)
render(template:'searchResults', model:[searchResults:searchResults, total:searchResults.size()])
}
You have a couple options:
1) You can avoid using the Grails remote tags (formRemote, remoteField, etc.), and I really encourage you to explore and understand how they work. The Grails remote tags are generally not very flexible. The best way to learn how they work is to just write some sample tags using the examples from the Grails online docs and then look at the rendered page in a web browser. All the tags do generally speaking are output basic html with the attributes you define in your Grails tags. Open up your favorite HTML source view (i.e. Firebug) and see what Grails outputs for the rendered HTML.
The reason I say this is because, the code you've written so far somewhat accomplishes what I've stated above, without using any GSP tags.
g:actionSubmit submits the form you are working in using the controller action you define (which you haven't here, so it runs the action named in your value attribute). However, you also have an onClick on your actionSubmit that is running an AJAX call that also submits data to your pingAll action. Without seeing the rest of your code and what else is involved in your form, you are submitting your form twice!
You can simply just not write actionSubmit, and simply do an input of type button (not submit) with an onClick. Then in your javascript function that runs, define a jQuery success option for your AJAX call
$.ajax({
url: "pingAll",
type:"GET",
data:{ids:JSON.stringify(idList)},
success:function(data) {
$('#your-updatedDiv-id-here').html(data);
}
});
2) If you want to use the GSP tags, I think you are using the wrong one. Without knowing the full extent of your usage and form data involved, it looks like g:formRemote, g:submitToRemote, and g:remoteFunction could serve your purposes. All have attributes you can define to call javascript before the remote call, as well as defining a div to update and various event handlers.

Check iframe status after AJAX File Upload with Rails

There is a similar post Retrieving HTTP status code from loaded iframe with Javascript but the solution requires the server-side to return javascript calling a function within the iframe. Instead, I would simply like to check the HTTP status code of the iframe without having to call a function within the iframe itself since my app either returns the full site through HTML or the single object as JSON. Essentially I've been trying to implement a callback method which returns success|failure dependent upon the HTTP status code.
Currently I have uploadFrame.onLoad = function() { ... so far pretty empty ... } and I am unsure what to check for when looking for HTTP status codes. Up until now, I've mainly relied upon jQuery's $.ajax() to handle success|failure but would like to further understand the mechanics behind XHR calls and iframe use. Thanks ahead of time.
UPDATE
The solution I came up with using jQuery
form.submit(function() {
uploadFrame.load(function() {
//using eval because the return data is JSON
eval( '(' + uploadFrame[0].contentDocument.body.children[0].innerHTML + ')' );
//code goes here
});
});
I think the best solution is injecting <script> tag into your iframe <head> and insert your "detecting" javascript code there.
something like this:
$('#iframeHolderDivId').html($.get('myPage.php'));
$('#iframeHolderDivId iframe head').delay(1000).append($('<script/>').text('your js function to detect load status'));
Maybe it's not the best solution but I think it works

Categories

Resources