I'm attempting to use scrapy and splash to retrieve the Staff, job titles, and emails from a particular website's staff page. https://www.kennedaleisd.net/Page/3884. I'm using splash with docker since the emails are hidden behind dynamic javascript code.
The spider works on the first page of the staff however I can't seem to get it to work on the 2nd or 3rd pages. I opened up developer tools and have copied the request that is sent when you click on one of the pagination links and then attempted to replicate that request in the spider. The problem I appear to be having is that the response for that request only returns a sub-set of the code for the entire page (Just the staff for that page) instead of everything like the accompanying javascript. So when that is passed onto splash it doesn't have the necessary script to create the dynamic code. I also noticed that the request appeared to have a cookie entry of RedirectTo which goes back to the parent page. I had attempted including that cookie in the requests or passing cookies from the first request to the paginated pages, but it didn't seem to be working. I also attempted some lua scripts in the splash request but that didn't seem to be getting me what I wanted either. Below I've included the spider as I have it right now.
I'm not sure if there's some way to re-use the javascript with subsequent requests or to user that redict cookie in some way to get the rest of the needed code. Any help would be appreciated. I realize the pagination is probably not the proper way to loop through pages but I figured I could work on that once I get the reading of the data figured out.
import scrapy
from scrapy_splash import SplashRequest
class TestSpider(scrapy.Spider):
name = 'TestSpider'
start_urls = ['https://www.kennedaleisd.net/Page/3884']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
for item in response.css('div.staff'):
name = item.css('li.staffname::text').get()
title = item.css('li.staffjob::attr(data-value)').get()
email = item.css('li.staffemail a::attr(href)').get()
staffURL = response.request.url
yield {
'name': name,
'title': title,
'email': email,
'staffURL': staffURL
}
if response.css('a.ui-page-number-current-span::text').get() == '1':
pagination_results = response.css(
'li.ui-page-number a:not([class^="ui-page-number-current-span"])::text').getall()
base_url = 'https://www.kennedaleisd.net//cms/UserControls/ModuleView/ModuleViewRendererWrapper.aspx?DomainID=2042&PageID=3884&ModuleInstanceID=6755&PageModuleInstanceID=7911&Tag=&PageNumber='
# backend_url = '&RenderLoc=0&FromRenderLoc=0&IsMoreExpandedView=false&EnableQuirksMode=0&Filter=&ScreenWidth=922&ViewID=00000000-0000-0000-0000-000000000000&_=1584114139549'
for i in pagination_results:
next_page = base_url + str(i) # + backend_url
yield response.follow(next_page, callback=self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 3}
}
})
Well, after a bit of tinkering I figured out how to handle this with the lua script I had been toying with. I'd still much prefer a different method if there is something that is a bit more official rather than using scripting.
import scrapy
from scrapy_splash import SplashRequest
script_frontend = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(3))
assert(splash:select('#ui-paging-container > ul > li:nth-child("""
script_backend = """) > a'):mouse_click())
assert(splash:wait(3))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class TestSpider(scrapy.Spider):
name = 'TestSpider'
start_urls = ['https://www.kennedaleisd.net/Page/3884']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
for item in response.css('div.staff'):
name = item.css('li.staffname::text').get()
title = item.css('li.staffjob::attr(data-value)').get()
email = item.css('li.staffemail a::attr(href)').get()
staffURL = response.request.url
yield {
'name': name,
'title': title,
'email': email,
'staffURL': staffURL
}
if response.css('a.ui-page-number-current-span::text').get() == '1':
pagination_results = response.css(
'li.ui-page-number a:not([class^="ui-page-number-current-span"])::text').getall()
for i in pagination_results:
script = script_frontend + str(i) + script_backend
yield SplashRequest(self.start_urls[0], self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
session_id='foo'
)
Related
My urls is not matching with the hashtags. it says 404 url not found.
this is my base.html with hashtag:
$(document).ready(function() {
$("p").each(function(data) {
var strText = $(this).html();
console.log('1. strText=', strText);
var arrElems = strText.match(/#[a-zA-Z0-9]+/g);
console.log('arrElems=', arrElems);
$.each(arrElems, function(index, value){
strText = strText.toString().replace(value, ''+value+'');
});
console.log('2. strText=', strText);
$(this).html(strText);
});
});
my hshtag models.py:
from django.db import models
from blog.models import post
# Create your models here.
class HashTag(models.Model):
tag = models.CharField(max_length=120)
timestamp = models.DateTimeField(auto_now_add=True)
def __str__(self):
return self.tag
def get_postss(self):
return post.objects.filter(content__icontains="#" + self.tag)
this is my hashtag view:
from django.shortcuts import render
from django.views import View
from .models import HashTag
# Create your views here.
class HashTagView(View):
def get(self, request, hashtag, *args, **kwargs):
obj, created = HashTag.objects.get_or_create(tag=hashtag)
return render(request, 'hashtags/tag_view.html', {"obj": obj})
i put the url intothe primary url of my site:
from hashtags.views import HashTagView
from django.urls import path, re_path, include
urlpatterns = [
re_path(r'^tags/(?P<hashtag>.*)/$', HashTagView.as_view(), name='hashtag'),
]
There is one issue with your implementation.
1. Browser does not send # part to the server. So if your URL is like /tags/#tag then #tag won't be sent to the server. Further read: Why is the hash part of the URL not available on the server side?
Because of this behavior, your browser will hit /tags/ url. That is the reason of your 404 error.
You can check the example of twitter, If the hashtag is #DelhiElectionResults, then the url for that hashtag is https://twitter.com/hashtag/DelhiElectionResults.
Solution: Just remove # from the url and make it like: /tags/tag/. In your JS, you can use value.replace('#', '') to remove the # from the value.
$.each(arrElems, function(index, value){
strText = strText.toString().replace(value, ''+value+'');
});
Needing a little help in here:
Context: I am able to log in using email no problem. Redirect and url_for working flawlessly.
When I login with google, though... It is logging in, but not redirecting, thus, not reloading the page, not showing the logout button and so on.
relevant code:
python flask authorized login is "http://localhost:5000/oauth2callback":
btw: I know I shouldn't use google's id, I am still testing it.
#app.route('/oauth2callback/<id>/<nome>/<email>', methods=['POST'])
def oauth2callback(id, nome, email):
#print(f'o ID é {id}, o nome é {nome} e o email é {email}')
try:
if User().query.filter_by(email = email).first():
usuario_google = User().query.filter_by(email = email).first()
print(usuario_google)
login_user(usuario_google)
print('usuario logado')
return redirect(url_for('home', next=request.url))
else:
sessao_google = User(username=email, email=email, nome=nome)
senhas_google = Senha(senha='')
db.session.add(sessao_google)
db.session.commit()
db.session.add(senhas_google)
db.session.commit()
print('Registrado')
return redirect(url_for('login', next=request.url))
except Exception as e:
raise redirect(url_for('login'))
finally:
pass
return redirect(url_for('home'))
I will add javascript just in case:
function onSignIn(googleUser) {
var profile = googleUser.getBasicProfile();
var xhttps = new XMLHttpRequest();
var novaurl = 'http://localhost:5000/oauth2callback/'+profile.getId()+'/'+profile.getName()+'/'+profile.getEmail();
console.log(novaurl)
xhttps.open('POST', novaurl);
xhttps.send();
}
Thank you for any ideas/help.
I have written a Django script that runs a Python parser to web s*e. I am sending the request to the Django script via AJAX. However, when the Ajax runs, it comes back as 404 not found for the URL. Why is this happening?
My code is below:
Ajax (with jQuery):
//send a `post` request up to AWS, and then
//insert the data into the paths
$.post('/ca', function(data){
//evaluate the JSON
data = eval ("(" + data + ")");
//insert the vars into the DOM
var contentOne;
contentOne = data.bridge_time;
contentOne += 'min delay';
$('#timeone').html(contentOne);
var contentTwo;
contentTwo = data.tunnel_time;
contentTwo += 'min delay';
$('#timetwo').html(contentTwo);
//if this falls through, push an error.
var tunnel_time = data.tunnel_time;
var bridge_time = data.bridge_time;
var tunnel = document.getElementById('tunnel');
var bridge = document.getElementById('bridge');
var tunnelText = document.getElementById('timeone');
var bridgeText = document.getElementById('timetwo');
//algo for the changing icons. Kudos to Vito
if(tunnel_time<bridge_time){
tunnel.src="tunnel3.png";
bridge.src="bridge2r.png";
}else if( bridge_time<tunnel_time){
bridge.src="bridge21.png";
tunnel.src="tunnel2r.png";
}else{
bridge.src="bridge2n.png";
tunnel.src="tunnel2g.png";
}
$.fail(function() {
alert("We're sorry. We are having an error. Check back later.");
});
});
My urls.py:
from django.conf.urls.defaults import *
from views import views
urlpatterns = patterns('',
(r'^/us', views.american_time),
(r'^/ca', views.canadian_time),
)
My urls.py and my views.py are in the same folder, if that makes any difference. They are just titled views.py and urls.py. Thank you!
Try
from django.conf.urls.defaults import *
from views import views
urlpatterns = patterns('',
(r'^/us/$', views.american_time),
(r'^/ca/$', views.canadian_time),
)
Also you have to add the trailing slash in your JavaScript.
I just resolved this: there was an error in my urls.py. My system was having trouble with the .defaults that is was supposed to import from. Also, I didn't have a Django project set up, so It wouldn't import the views.
I'm trying to add 'Share on Twitter' functionality on one of the pages of my Django-powered site. Here's the relevant portion of link_page.html:
<a class="tweet_link metaSpacing" data-link_id={{ link.id }}>Share on Twitter</a>
Here's the JavaScript portion responsible for listening for events:
// tweet_link
$('.tweet_link').click(function(e){
link_id = $(this).attr('data-link_id');
var target = $(this);
tweetLink(target);
});
function tweetLink(t){
link_id = t.attr('data-link_id');
$.ajax({
type: "POST",
data: { "link_id": link_id, "csrfmiddlewaretoken": csrfmiddlewaretoken},
url: "/tweet_link",
});
};
In Django I added the following line at the end of the urls.py:
url(r'^tweet_link/?$', 'portnoy.views.tweet_link'),
And here's the Django view itself:
# tweet the link
def tweet_link(request):
c = RequestContext(request)
twitter = Twython(
twitter_token = TWITTER_KEY,
twitter_secret = TWITTER_SECRET,
oauth_token = request.session['request_token']['oauth_token'],
oauth_token_secret = request.session['request_token']['oauth_token_secret']
)
twitter.updateStatus(status="See how easy this was?")
return HttpResponse('')
However, what happens when I click on the 'Share on Twitter' link, I get this error on Chrome Console:
POST http://127.0.0.1:8000/tweet_link 404 (NOT FOUND)
Any idea how to fix this? Thanks in advance!
Your url shouldn't have a ? in it:
url(r'^tweet_link/$', 'portnoy.views.tweet_link'),
You know that you can simply launch a link with this URL to tweet something:
https://twitter.com/intent/tweet?source=webclient&text=d+twitter+msg+goes+here
I'm developing a web application where I'm stuck with a problem in one feature. You can check it out here http://qlimp.com You can also use this for username/password: dummy/dummy
After login, please click the link Go to cover settings You will see a palette where you can upload images, enter some text.
When you upload the image, I've written an ajax request in jQuery which uploads the image to the server and shows fullpage background preview of that image.
JQuery
$('#id_tmpbg').live('change', function()
{
$("#ajax-loader").show();
$("#uploadform").ajaxForm({success: showResponse}).submit();
});
function showResponse(responseText, statusText, xhr, $form) {
$.backstretch(responseText)
$("#ajax-loader").hide();
}
So the problem here is, when I upload the image, it shows
ValueError at /cover/
The view cover.views.backgroundview didn't return an HttpResponse object.
Request Method: POST Request URL: http://qlimp.com/cover/
I'm actually returning HttpResponse object in views.
Views.py:
#login_required
def backgroundview(request):
if request.is_ajax():
form = BackgroundModelForm(request.POST, request.FILES)
if form.is_valid():
try:
g = BackgroundModel.objects.get(user=request.user)
except BackgroundModel.DoesNotExist:
data = form.save(commit=False)
data.user = request.user
data.save()
else:
if g.tmpbg != '' and g.tmpbg != g.background:
image_path = os.path.join(settings.MEDIA_ROOT, str(g.tmpbg))
try:
os.unlink(image_path)
except:
pass
data = BackgroundModelForm(request.POST, request.FILES, instance=g).save()
return HttpResponse(data.tmpbg.url)
else:
form = BackgroundModelForm()
return render_to_response("cover.html", {'form': form}, context_instance=RequestContext(request))
Models.py:
class BackgroundModel(models.Model):
user = models.OneToOneField(User)
background = models.ImageField(upload_to='backgrounds', null=True, blank=True)
tmpbg = models.ImageField(upload_to='backgrounds', null=True, blank=True)
class BackgroundModelForm(ModelForm):
class Meta:
model = BackgroundModel
exclude = ('user','background')
But these things are working on my computer(save the image and shows the background preview) but not in the production server. Why is it so?
I've uploaded the same code to the server.
Could anyone help me? Thanks!
You are not returning a response if the form is valid.