Scrapy only scraping the first two pages

Scrapy only scraping the first two pages - javascript

I'm trying to scrape a website but need to use splash in all pages because their content created dynamically. Right now it renders only the first 2 pages, even though there are 47 pages in total.
Here's the code:
import scrapy
from scrapy.http import Request
from scrapy_splash import SplashRequest
class JobsSpider(scrapy.Spider):
name = 'jobs'
start_urls = ['https://jobs.citizensbank.com/search-jobs']
def start_requests(self):
filters_script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)
return splash:html()
end"""
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={'lua_source': filters_script})
def parse(self, response):
cars_urls = response.xpath('.//section[#id="search-results-list"]/ul/li/a/#href').extract()
for car_url in cars_urls:
absolute_car_url = response.urljoin(car_url)
yield scrapy.Request(absolute_car_url,
callback=self.parse_car)
script_at_page_1 = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)
next_button = splash:select("a[class=next]")
next_button.mouse_click()
splash:wait(3)
return {
url = splash:url(),
html = splash:html()
}
end"""
script_at_page_2 = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)
next_button = splash:select("a[class=next]")
next_button.mouse_click()
splash:wait(3)
return {
url = splash:url(),
html = splash:html()
}
end"""
script = None
if response.url is not self.start_urls[0]:
script = script_at_page_2
else:
script = script_at_page_1
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
def parse_car(self, response):
jobtitle = response.xpath('//h1[#itemprop="title"]/text()').extract_first()
location = response.xpath('//span[#class="job-info"]/text()').extract_first()
jobid = response.xpath('//span[#class="job-id job-info"]/text()').extract_first()
yield {'jobtitle': jobtitle,
'location': location,
'jobid': jobid}
I've played with it in every way I could think off but it didn't work.
I'm new to scrapy so any help appreciated.

I think you do not need to use Splash for this. If you look at the network tab of your browser inspector you will see it is making requests to this URL under XHR:
https://jobs.citizensbank.com/search-jobs/results?ActiveFacetID=0&CurrentPage=3&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&Latitude=&Longitude=&ShowRadius=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&CategoryFacetTerm=&CategoryFacetType=&LocationFacetTerm=&LocationFacetType=&KeywordType=&LocationType=&LocationPath=&OrganizationIds=&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=
Try making requests to this URL and change the page each time. If you have trouble you may need to look at the headers of the XHR request and replicate them as well. If you click the link the JSON will load in your browser. So just set page 1 as your start_url and over ride start_requests as follows:
start_urls = ['https://jobs.citizensbank.com/search-jobs/results?ActiveFacetID=0&CurrentPage={}&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&Latitude=&Longitude=&ShowRadius=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&CategoryFacetTerm=&CategoryFacetType=&LocationFacetTerm=&LocationFacetType=&KeywordType=&LocationType=&LocationPath=&OrganizationIds=&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=']
def start_requests(self):
num_pages = 10
for page in range(1, num_pages):
yield scrapy.Request(self.start_urls[0].format(page), callback=self.parse)
It's also worth noting you can set the RecordsPerPage setting. You may be able to set it higher and possibly get all records on one page or make less requests to get all records.

Related

No data scraping a table using Apps Script

I'm trying to scrape the first table (FINRA TRACE Bond Market Activity) of this website using Google Apps Script and I'm getting no data.
https://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp
enter image description here
function myFunction() {
const url = 'https://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp';
const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
const $ = Cheerio.load(res);
var data = $('table').first().text();
Logger.log(data);
}
I have also tried from this page and I do not get any result.
https://finra-markets.morningstar.com/transferPage.jsp?path=http%3A%2F%2Fmuni-internal.morningstar.com%2Fpublic%2FMarketBreadth%2FC&_=1655503161665
I can't find a solution on the web and I ask you for help.
Thanks in advance

This page does a lot of things in the background. First, there is a POST request to https://finra-markets.morningstar.com/finralogin.jsp that initiates the session. Then, XHR requests are made to load the data tables. If you grab the cookie by POST'ing to that login page, you can pass it on the desired XHR call. That will return the table. The date you want to fetch the data for can be set with the date URL paramter. Here is an example:
function fetchFinra() {
const LOGIN_URL = "https://finra-markets.morningstar.com/finralogin.jsp";
const DATE = "06/24/2022" //the desired date
let opts = {
method: "POST",
payload: JSON.stringify({redirectPage: "/BondCenter/TRACEMarketAggregateStats.jsp"})
};
let res = UrlFetchApp.fetch(LOGIN_URL, opts);
let cookies = res.getAllHeaders()["Set-Cookie"];
const XHR_URL = `https://finra-markets.morningstar.com/transferPage.jsp?path=http%3A%2F%2Fmuni-internal.morningstar.com%2Fpublic%2FMarketBreadth%2FC&_=${new Date().getTime()}&date=${DATE}`;
res = UrlFetchApp.fetch(XHR_URL, { headers: {'Cookie': cookies.join(";")}} );
const $ = Cheerio.load(res.getContentText());
var data = $('table td').text();
Logger.log(data);
}

How do I add a js map to Pyqt5?

I want to add the map provided by Marinetraffic to pyqt5. When I add the HTML codes provided by MarineTraffic to my own program, it doesn't work.
The map I want to add:
MarineTraffic Map JS
from PyQt5 import QtCore, QtGui, QtWidgets, QtWebEngineWidgets, QtWebChannel
class Backend(QtCore.QObject):
valueChanged = QtCore.pyqtSignal(str)
def __init__(self, parent=None):
super().__init__(parent)
self._value = ""
#QtCore.pyqtProperty(str)
def value(self):
return self._value
#value.setter
def value(self, v):
self._value = v
self.valueChanged.emit(v)
class Widget(QtWidgets.QWidget):
def __init__(self, parent=None):
super().__init__(parent)
self.webEngineView = QtWebEngineWidgets.QWebEngineView()
self.label = QtWidgets.QLabel(alignment=QtCore.Qt.AlignCenter)
lay = QtWidgets.QVBoxLayout(self)
lay.addWidget(self.webEngineView, stretch=1)
lay.addWidget(self.label, stretch=1)
backend = Backend(self)
backend.valueChanged.connect(self.label.setText)
backend.valueChanged.connect(self.foo_function)
self.channel = QtWebChannel.QWebChannel()
self.channel.registerObject("backend", backend)
self.webEngineView.page().setWebChannel(self.channel)
path = "index.html"
self.webEngineView.setUrl(QtCore.QUrl.fromLocalFile(path))
#QtCore.pyqtSlot(str)
def foo_function(self, value):
print(value)
if __name__ == "__main__":
import sys
app = QtWidgets.QApplication(sys.argv)
w = Widget()
w.show()
sys.exit(app.exec_())
When I run it, I get a connection failed error.
As a result of my searches, I get the same error in all the methods I tried, where am I doing wrong, can you help?

Read the documentation for QtCore.QUrl.fromLocalFile :
QtCore.QUrl.fromLocalFile
"A file URL with a relative path only makes sense if there is a base URL to resolve it against."
So we add the base path:
import os
...
path = os.getcwd() + "\\index.html"
self.webEngineView.setUrl(QtCore.QUrl.fromLocalFile(path))
Added path compatibility between os (edited)
from pathlib import Path
...
base_path = Path(Path.cwd())
full_path = base_path.joinpath('index.html')
self.webEngineView.setUrl(QtCore.QUrl.fromLocalFile(str(full_path)))

Video is freezed while video streaming

I find the following code for streaming video over a socket in python2.7. When I run it, the video will be freeze at the beginning in the server-side (It shows the video in a web browser). I debugged the code and understood that in the streamer.py, the third while loop condition creates an infinite loop because of the condition while len(data) < msg_size: is always satisfied. In other words, len(data) is always less than msg_size.So, the streamer.py does not return the image to the server.py. Could anyone help me to solve this issue?
The server.py is:
from flask import Flask, render_template, Response
from streamer import Streamer
app = Flask(__name__)
def gen():
streamer = Streamer('localhost', 8089)
streamer.start()
while True:
if streamer.client_connected():
yield (b'--frame\r\n'b'Content-Type: image/jpeg\r\n\r\n' +
streamer.get_jpeg() + b'\r\n\r\n')
#app.route('/')
def index():
return render_template('index.html')
#app.route('/video_feed')
def video_feed():
return Response(gen(), mimetype='multipart/x-mixed-replace;
boundary=frame')
if __name__ == '__main__':
app.run(host='localhost', threaded=True)
The streamer.py is:
import threading
import socket
import struct
import StringIO
import json
import numpy
class Streamer (threading.Thread):
def __init__(self, hostname, port):
threading.Thread.__init__(self)
self.hostname = hostname
self.port = port
self.connected = False
self.jpeg = None
def run(self):
self.isRunning = True
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print 'Socket created'
s.bind((self.hostname, self.port))
print 'Socket bind complete'
data = ""
payload_size = struct.calcsize("L")
s.listen(10)
print 'Socket now listening'
while self.isRunning:
conn, addr = s.accept()
print 'while 1...'
while True:
data = conn.recv(4096)
print 'while 2...'
if data:
packed_msg_size = data[:payload_size]
data = data[payload_size:]
msg_size = struct.unpack("L", packed_msg_size)[0]
while len(data) < msg_size:# the infinite loop is here(my problem)!
data += conn.recv(4096)
print ("lenght of data is " , len(data) )
print ("message size is " , msg_size )
frame_data = data[:msg_size]
#frame_data = data[:len(data)]
memfile = StringIO.StringIO()
memfile.write(json.loads(frame_data).encode('latin-1'))
memfile.seek(0)
frame = numpy.load(memfile)
ret, jpeg = cv2.imencode('.jpg', frame)
self.jpeg = jpeg
self.connected = True
print 'recieving...'
else:
conn.close()
self.connected = False
print 'connected=false...'
break
self.connected = False
def stop(self):
self.isRunning = False
def client_connected(self):
return self.connected
def get_jpeg(self):
return self.jpeg.tobytes()
Client.py is:
import socket
import sys
import pickle
import struct
import StringIO
import json
import time
cap=cv2.VideoCapture(0)
clientsocket=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
clientsocket.connect(('localhost',8089))
while(cap.isOpened()):
ret,frame=cap.read()
memfile = StringIO.StringIO()
np.save(memfile, fravidme)
memfile.seek(0)
data = json.dumps(memfile.read().decode('latin-1'))
clientsocket.sendall(struct.pack("L", len(data))+data)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
I want to show the video captured by my laptop's camera on a client machine in the same network. I expect video stream but in the browser, I just watch an image and it does not update continuously.

As I analyzed this code I noticed that the default implementation for sending OpenCV frames over the network was not working. I decided to replace it with ZeroMQ implementation I have used before. You can check out the linked question for a deeper explanation of how the streaming works. I have neatly packaged it into classes, with unit tests and documentation as SmoothStream check it out too.
Coming back to the question, here is the working code.
client.py
import base64
import cv2
import zmq
context = zmq.Context()
footage_socket = context.socket(zmq.PUB)
footage_socket.connect('tcp://localhost:5555')
camera = cv2.VideoCapture(0) # init the camera
while True:
try:
grabbed, frame = camera.read() # grab the current frame
frame = cv2.resize(frame, (640, 480)) # resize the frame
encoded, buffer = cv2.imencode('.jpg', frame)
jpg_as_text = base64.b64encode(buffer)
footage_socket.send(jpg_as_text)
except KeyboardInterrupt:
camera.release()
cv2.destroyAllWindows()
break
server.py
from flask import Flask, render_template, Response
from streamer import Streamer
app = Flask(__name__)
def gen():
streamer = Streamer('*', 5555)
streamer.start()
while True:
if streamer.client_connected():
yield (b'--frame\r\n'b'Content-Type: image/jpeg\r\n\r\n' + streamer.get_jpeg() + b'\r\n\r\n')
#app.route('/')
def index():
return render_template('index.html')
#app.route('/video_feed')
def video_feed():
return Response(gen(), mimetype='multipart/x-mixed-replace; boundary=frame')
if __name__ == '__main__':
app.run(host='localhost', threaded=True)
streamer.py
import base64
import threading
import cv2
import numpy as np
import zmq
class Streamer(threading.Thread):
def __init__(self, hostname, port):
threading.Thread.__init__(self)
self.hostname = hostname
self.port = port
self.connected = False
self.jpeg = None
def run(self):
self.isRunning = True
context = zmq.Context()
footage_socket = context.socket(zmq.SUB)
footage_socket.bind('tcp://{}:{}'.format(self.hostname, self.port))
footage_socket.setsockopt_string(zmq.SUBSCRIBE, np.unicode(''))
while self.isRunning:
frame = footage_socket.recv_string()
img = base64.b64decode(frame)
npimg = np.fromstring(img, dtype=np.uint8)
source = cv2.imdecode(npimg, 1)
ret, jpeg = cv2.imencode('.jpg', source)
self.jpeg = jpeg
self.connected = True
self.connected = False
def stop(self):
self.isRunning = False
def client_connected(self):
return self.connected
def get_jpeg(self):
return self.jpeg.tobytes()
I understand that copy-pasting entire .py files are probably not the best way to post an answer here, but this is a complex question with a lot of moving parts and I honestly could not think of a better way to help the OP.

Flask - Stream content keeping context

I am using the folllowing Flask code to stream the output of a command:
#app.route('/', methods=['GET', 'POST'])
def index():
if request.method == 'POST':
...
# some logic to get cmd from POST request
...
return redirect_to(url_to(stream, cmd=cmd))
return render_template('index.html')
#app.route('/stream/<cmd>')
def stream(cmd):
print("Executing %s" % cmd)
g = proc.Group()
p = g.run(cmd)
def stream_cmd():
while g.is_pending():
lines = g.readlines()
for proc, line in lines:
print(line)
yield line + '</br>'
return Response(stream_cmd(), mimetype='text/html') # text/html is required for most browsers to show th$
When I post my form, it redirects to a blank page where I see the output of my stream but I loose all my layout / css / html / etc ...
How can I keep the current layout in place while still seeing a streamed output ?
Ideally, I'd like to be able to update a <div> element in the current page (instead of redirect) with the stream output dynamically (Jquery), but I'm not sure that's even possible.

Following #reptilicus recommendation, I rewrote the code to use websockets.
Here is the working code:
Python
#socketio.on('message', namespace='/stream')
def stream(cmd):
# Streams output of a command
from shelljob import proc
g = proc.Group()
p = g.run(cmd)
while g.is_pending():
lines = g.readlines()
for proc, line in lines:
send(line, namespace='/stream')
eventlet.sleep(0) # THIS IS MANDATORY
The corresponding JavaScript that receives those messages sent by send calls is as follow:
JavaScript (JQuery)
var socket = io.connect('http://' + document.domain + ':' + location.port + '/stream')
socket.on('message', function(msg){
$('#streaming_text').append(msg);
})
...
jqxhr.done(function(cmd){ # This is executed after an AJAX post, but you can change this to whatever event you like
socket.send(cmd)
return false;
})

how to call python script from javascript?

I have a backend python script where it retrieves the data from the sqlalchemy engine. And I would like to show the data in a search box where you can scroll down the list of data and select it. I read some answers to the similar questions like mine, (use ajax to call python script). But I'm still not clear about this. Here is my python script.
# models.py
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
from sqlalchemy.ext.declarative import declarative_base
import pandas as pd
aURL = URL(drivername='mysql', username='chlee021690', database = 'recommender')
engine = create_engine(aURL, echo=True)
sql_command = 'SELECT product_id FROM bestbuy_data'
results = pd.read_sql(sql = sql_command, con = engine)
Can anybody tell me how to create javscript code to retrieve that results and render it in my form? Thanks.

Step 1: make your script available as a web service. You can use CGI, or you can use one of the cool server frameworks that will run standalone or WSGI like CherryPy, web.py or Flask.
Step 2: make an AJAX call to the URL served by step 1, either manually (look for XmlHttpRequest examples), or easily using jQuery or another framework (jQuery.ajax(), jQuery.get()).
These are two separate tasks, both are well documented on the web. If you have a more specific question, I suggest you ask again, showing what you are stuck on.
There are also many examples for the complete package available ("python ajax example"), for example this.

Your Python server needs to do 2 things:
Serve up the AJAX javascript file itself (via GET)
respond to calls from the web client (via POST).
Also it should be threaded to support multiple simultaneous connections.
Below is an example showing how to do all of the above with the built-in BaseHTTPServer.
JS (put in static/hello.html to serve via Python):
<html><head><meta charset="utf-8"/></head><body>
Hello.
<script>
var xhr = new XMLHttpRequest();
xhr.open("POST", "/postman", true);
xhr.setRequestHeader('Content-Type', 'application/json');
xhr.send(JSON.stringify({
value: 'value'
}));
xhr.onload = function() {
console.log("HELLO")
console.log(this.responseText);
var data = JSON.parse(this.responseText);
console.log(data);
}
</script></body></html>
Python server (for testing):
import time, threading, socket, SocketServer, BaseHTTPServer
import os, traceback, sys, json
log_lock = threading.Lock()
log_next_thread_id = 0
# Local log functiondef
def Log(module, msg):
with log_lock:
thread = threading.current_thread().__name__
msg = "%s %s: %s" % (module, thread, msg)
sys.stderr.write(msg + '\n')
def Log_Traceback():
t = traceback.format_exc().strip('\n').split('\n')
if ', in ' in t[-3]:
t[-3] = t[-3].replace(', in','\n***\n*** In') + '(...):'
t[-2] += '\n***'
err = '\n*** '.join(t[-3:]).replace('"','').replace(' File ', '')
err = err.replace(', line',':')
Log("Traceback", '\n'.join(t[:-3]) + '\n\n\n***\n*** ' + err + '\n***\n\n')
os._exit(4)
def Set_Thread_Label(s):
global log_next_thread_id
with log_lock:
threading.current_thread().__name__ = "%d%s" \
% (log_next_thread_id, s)
log_next_thread_id += 1
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
Set_Thread_Label(self.path + "[get]")
try:
Log("HTTP", "PATH='%s'" % self.path)
with open('static' + self.path) as f:
data = f.read()
Log("Static", "DATA='%s'" % data)
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write(data)
except:
Log_Traceback()
def do_POST(self):
Set_Thread_Label(self.path + "[post]")
try:
length = int(self.headers.getheader('content-length'))
req = self.rfile.read(length)
Log("HTTP", "PATH='%s'" % self.path)
Log("URL", "request data = %s" % req)
req = json.loads(req)
response = {'req': req}
response = json.dumps(response)
Log("URL", "response data = %s" % response)
self.send_response(200)
self.send_header("Content-type", "application/json")
self.send_header("content-length", str(len(response)))
self.end_headers()
self.wfile.write(response)
except:
Log_Traceback()
# Create ONE socket.
addr = ('', 8000)
sock = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(addr)
sock.listen(5)
# Launch 10 listener threads.
class Thread(threading.Thread):
def __init__(self, i):
threading.Thread.__init__(self)
self.i = i
self.daemon = True
self.start()
def run(self):
httpd = BaseHTTPServer.HTTPServer(addr, Handler, False)
# Prevent the HTTP server from re-binding every handler.
# https://stackoverflow.com/questions/46210672/
httpd.socket = sock
httpd.server_bind = self.server_close = lambda self: None
httpd.serve_forever()
[Thread(i) for i in range(10)]
time.sleep(9e9)
Console log (chrome):
HELLO
hello.html:14 {"req": {"value": "value"}}
hello.html:16
{req: {…}}
req
:
{value: "value"}
__proto__
:
Object
Console log (firefox):
GET
http://XXXXX:8000/hello.html [HTTP/1.0 200 OK 0ms]
POST
XHR
http://XXXXX:8000/postman [HTTP/1.0 200 OK 0ms]
HELLO hello.html:13:3
{"req": {"value": "value"}} hello.html:14:3
Object { req: Object }
Console log (Edge):
HTML1300: Navigation occurred.
hello.html
HTML1527: DOCTYPE expected. Consider adding a valid HTML5 doctype: "<!DOCTYPE html>".
hello.html (1,1)
Current window: XXXXX/hello.html
HELLO
hello.html (13,3)
{"req": {"value": "value"}}
hello.html (14,3)
[object Object]
hello.html (16,3)
{
[functions]: ,
__proto__: { },
req: {
[functions]: ,
__proto__: { },
value: "value"
}
}
Python log:
HTTP 8/postman[post]: PATH='/postman'
URL 8/postman[post]: request data = {"value":"value"}
URL 8/postman[post]: response data = {"req": {"value": "value"}}
Also you can easily add SSL by wrapping the socket before passing it to BaseHTTPServer.

Develop Reference

JavaScript is the programming language of the Web.

Scrapy only scraping the first two pages - javascript

Related

No data scraping a table using Apps Script

How do I add a js map to Pyqt5?

Video is freezed while video streaming

Flask - Stream content keeping context

how to call python script from javascript?

Categories

Resources