How to grab data using XPath on javascript websites? - javascript

I would like to grab data on this news site. http://www.inquirer.net/
I want to grab news titles on the tiles.
Here's the screen shot of the inspected code
As you can see, one of the title of the tile that I want to grab is already there. When I copy the xpath from the browser it returns //*[#id="tgs3_info"]/h2
I tried to run my python code.
import lxml.html
import lxml.etree
import requests
link = 'http://www.inquirer.net/'
res = requests.get(link)
r = res.content
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
print(root)
but it returns an empty list.
I tried to search for an answer here on stackoverflow and in the internet. I don't really get it. When you view the page source of the site. The data that I want is not in the javascript function. It is in the div so I don't understand why I can't grab the data. I hope I could find answer here.

With inputs from Xurasky's solution to avoid a 403 error
import lxml.html
import lxml.etree
from urllib.request import Request, urlopen
req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
r = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
for a in root:
print(a.text_content())
Output
Duterte, Roque meeting set in MalacaƱang
2 senators welcome Ventura's revelations in Atio hazing case
Paolo Duterte vows to retire from politics in 2019
NBA: DeMarcus Cousins regrets being loyal to Sacramento Kings
PH bet Elizabeth Durado Clenci wins 2nd runner-up at Miss Grand International 2017
DOJ wants Divina, 50 others in `Atio' hazing case added on BI watchlist
Georgina Wilson Shares Messages From Fans on Baby Blues

I believe you are getting a urllib.error.HTTPError: HTTP Error 403: Forbidden Error.
You can fix this by using
import lxml.html
import lxml.etree
from urllib.request import Request, urlopen
req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
res = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[#id="tgs3_info"]/h2')
print(root)

Related

Scraping fanduel with BeautifulSoup, can't find values visible in HTML

I'm trying to scrape lines for a typical baseball game from fanduel using BeautifulSoup but I found (as this person did) that much of the data doesn't show up when I try something standard like
import requests
from bs4 import BeautifulSoup
page = requests.get(<some url>)
soup = BeautifulSoup(page.content, 'html.parser')
I know I can get do Dev Tools -> Network tab -> XHR to get a json with data the site is using, but I'm not able to find the same values I'm seeing in the HTML.
I'll give an example but it probably won't be good after a day since the page will be gone. Here's the page on lines for the Rangers Dodgers game tomorrow. You can click and see that (as of right now) the odds for the Dodgers at -1.5 are -146. I'd like to scrape that number (-146) but I can't find it anywhere in the json data.
Any idea how I can find that sort of thing either in the json or in the HTML? Thanks!
Looks like I offered the solution to the reference link you have there. Those lines are there in the json, it's just in the "raw" form, so you need to calculate it out:
import requests
jsonData = requests.get('https://sportsbook.fanduel.com/cache/psevent/UK/1/false/1027510.3.json').json()
money_line = jsonData['eventmarketgroups'][0]['markets'][1]['selections']
def calc_spread_line(priceUp, priceDown, spread):
if priceDown < priceUp:
line = int((priceUp / priceDown) * 100)
spread = spread*-1
else:
line = int((priceDown / priceUp) * -100)
return line, spread
for each in money_line:
priceUp = each['currentpriceup']
priceDown = each['currentpricedown']
team = each['name']
spread = each['currenthandicap']
line, spread = calc_spread_line(priceUp, priceDown, spread)
print ('%s: %s %s' %(team, spread, line))
Output:
Texas Rangers: 1.5 122
Los Angeles Dodgers: -1.5 -146
Otherwise you could use selenium as suggested and parse the html that way. It would be less efficient though.
This may be happening to you because some web pages loads the elements using java script, in which case the html source you receive using requests may not contain all the elements .You can check this by right-clicking on the page and selecting view source , if the data you require is in that source file you can parse it using Beautiful Soup otherwise in order to get dynamically loaded content I will suggest selenium

Using Selenium to scrape webpage with javascript

I want to scrape a google scholar page with 'show more' button. I understand from my previous question that it is not a html but a javascript and there are several ways to scrape such pages. I tries selenium and tried the following code.
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
papers = soup.find_all('tr',{'class':'gsc_a_tr'})
for paper in papers:
title = paper.find('a',{'class':'gsc_a_at'}).text
author = paper.find('div',{'class':'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
The browser now clicks the 'show more' button and displays the entire page. But, I am still getting the information only for the first 20 papers. I dont understand why. Please help!
Thanks!
I believe your problem is that the new elements haven't completely loaded in when your program checks the website. Try importing time and then sleeping for a few minutes. Like this (I removed the headless features so you can see the program work):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
driver = webdriver.Chrome()
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
time.sleep(3)
driver.find_element_by_id("gsc_bpf_more").click()
time.sleep(4)
soup = BeautifulSoup(driver.page_source, 'html.parser')
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
title = paper.find('a', {'class': 'gsc_a_at'}).text
author = paper.find('div', {'class': 'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
# Awkward method
# Loading all available articles and then iterating over them
for i in range(1, 3):
driver.find_element_by_css_selector('#gsc_bpf_more').click()
# waits until elements are loaded
time.sleep(3)
# Container where all data located
for result in driver.find_elements_by_css_selector('#gsc_a_b .gsc_a_t'):
title = result.find_element_by_css_selector('.gsc_a_at').text
authors = result.find_element_by_css_selector('.gsc_a_at+ .gs_gray').text
publication = result.find_element_by_css_selector('.gs_gray+ .gs_gray').text
print(title)
print(authors)
print(publication)
# just for separating purpose
print()
Part of the output:
Tax/subsidy policies in the presence of environmentally aware consumers
S Bansal, S Gangopadhyay
Journal of Environmental Economics and Management 45 (2), 333-355
Choice and design of regulatory instruments in the presence of green consumers
S Bansal
Resource and Energy economics 30 (3), 345-368

Scraping Javascript Website With BeautifulSoup 4 & Requests_HTML

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])
In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

Get Jupyter notebook name [duplicate]

I am trying to obtain the current NoteBook name when running the IPython notebook. I know I can see it at the top of the notebook. What I am after something like
currentNotebook = IPython.foo.bar.notebookname()
I need to get the name in a variable.
adding to previous answers,
to get the notebook name run the following in a cell:
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')
this gets you the file name in nb_name
then to get the full path you may use the following in a separate cell:
import os
nb_full_path = os.path.join(os.getcwd(), nb_name)
I have the following which works with IPython 2.0. I observed that the name of the notebook is stored as the value of the attribute 'data-notebook-name' in the <body> tag of the page. Thus the idea is first to ask Javascript to retrieve the attribute --javascripts can be invoked from a codecell thanks to the %%javascript magic. Then it is possible to access to the Javascript variable through a call to the Python Kernel, with a command which sets a Python variable. Since this last variable is known from the kernel, it can be accessed in other cells as well.
%%javascript
var kernel = IPython.notebook.kernel;
var body = document.body,
attribs = body.attributes;
var command = "theNotebook = " + "'"+attribs['data-notebook-name'].value+"'";
kernel.execute(command);
From a Python code cell
print(theNotebook)
Out[ ]: HowToGetTheNameOfTheNoteBook.ipynb
A defect in this solution is that when one changes the title (name) of a notebook, then this name seems to not be updated immediately (there is probably some kind of cache) and it is necessary to reload the notebook to get access to the new name.
[Edit] On reflection, a more efficient solution is to look for the input field for notebook's name instead of the <body> tag. Looking into the source, it appears that this field has id "notebook_name". It is then possible to catch this value by a document.getElementById() and then follow the same approach as above. The code becomes, still using the javascript magic
%%javascript
var kernel = IPython.notebook.kernel;
var thename = window.document.getElementById("notebook_name").innerHTML;
var command = "theNotebook = " + "'"+thename+"'";
kernel.execute(command);
Then, from a ipython cell,
In [11]: print(theNotebook)
Out [11]: HowToGetTheNameOfTheNoteBookSolBis
Contrary to the first solution, modifications of notebook's name are updated immediately and there is no need to refresh the notebook.
As already mentioned you probably aren't really supposed to be able to do this, but I did find a way. It's a flaming hack though so don't rely on this at all:
import json
import os
import urllib2
import IPython
from IPython.lib import kernel
connection_file_path = kernel.get_connection_file()
connection_file = os.path.basename(connection_file_path)
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
# Updated answer with semi-solutions for both IPython 2.x and IPython < 2.x
if IPython.version_info[0] < 2:
## Not sure if it's even possible to get the port for the
## notebook app; so just using the default...
notebooks = json.load(urllib2.urlopen('http://127.0.0.1:8888/notebooks'))
for nb in notebooks:
if nb['kernel_id'] == kernel_id:
print nb['name']
break
else:
sessions = json.load(urllib2.urlopen('http://127.0.0.1:8888/api/sessions'))
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
print sess['notebook']['name']
break
I updated my answer to include a solution that "works" in IPython 2.0 at least with a simple test. It probably isn't guaranteed to give the correct answer if there are multiple notebooks connected to the same kernel, etc.
It seems I cannot comment, so I have to post this as an answer.
The accepted solution by #iguananaut and the update by #mbdevpl appear not to be working with recent versions of the Notebook.
I fixed it as shown below. I checked it on Python v3.6.1 + Notebook v5.0.0 and on Python v3.6.5 and Notebook v5.5.0.
import jupyterlab
if jupyterlab.__version__.split(".")[0] == "3":
from jupyter_server import serverapp as app
key_srv_directory = 'root_dir'
else :
from notebook import notebookapp as app
key_srv_directory = 'notebook_dir'
import urllib
import json
import os
import ipykernel
def notebook_path(key_srv_directory, ):
"""Returns the absolute path of the Notebook or None if it cannot be determined
NOTE: works only when the security is token-based or there is also no password
"""
connection_file = os.path.basename(ipykernel.get_connection_file())
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
for srv in app.list_running_servers():
try:
if srv['token']=='' and not srv['password']: # No token and no password, ahem...
req = urllib.request.urlopen(srv['url']+'api/sessions')
else:
req = urllib.request.urlopen(srv['url']+'api/sessions?token='+srv['token'])
sessions = json.load(req)
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
return os.path.join(srv[key_srv_directory],sess['notebook']['path'])
except:
pass # There may be stale entries in the runtime directory
return None
As stated in the docstring, this works only when either there is no authentication or the authentication is token-based.
Note that, as also reported by others, the Javascript-based method does not seem to work when executing a "Run all cells" (but works when executing cells "manually"), which was a deal-breaker for me.
The ipyparams package can do this pretty easily.
import ipyparams
currentNotebook = ipyparams.notebook_name
On Jupyter 3.0 the following works. Here I'm showing the entire path on the Jupyter server, not just the notebook name:
To store the NOTEBOOK_FULL_PATH on the current notebook front end:
%%javascript
var nb = IPython.notebook;
var kernel = IPython.notebook.kernel;
var command = "NOTEBOOK_FULL_PATH = '" + nb.base_url + nb.notebook_path + "'";
kernel.execute(command);
To then display it:
print("NOTEBOOK_FULL_PATH:\n", NOTEBOOK_FULL_PATH)
Running the first Javascript cell produces no output.
Running the second Python cell produces something like:
NOTEBOOK_FULL_PATH:
/user/zeph/GetNotebookName.ipynb
Yet another hacky solution since my notebook server can change. Basically you print a random string, save it and then search for a file containing that string in the working directory. The while is needed because save_checkpoint is asynchronous.
from time import sleep
from IPython.display import display, Javascript
import subprocess
import os
import uuid
def get_notebook_path_and_save():
magic = str(uuid.uuid1()).replace('-', '')
print(magic)
# saves it (ctrl+S)
display(Javascript('IPython.notebook.save_checkpoint();'))
nb_name = None
while nb_name is None:
try:
sleep(0.1)
nb_name = subprocess.check_output(f'grep -l {magic} *.ipynb', shell=True).decode().strip()
except:
pass
return os.path.join(os.getcwd(), nb_name)
There is no real way yet to do this in Jupyterlab. But there is an official way that's now under active discussion/development as of August 2021:
https://github.com/jupyter/jupyter_client/pull/656
In the meantime, hitting the api/sessions REST endpoint of jupyter_server seems like the best bet. Here's a cleaned-up version of that approach:
from jupyter_server import serverapp
from jupyter_server.utils import url_path_join
from pathlib import Path
import re
import requests
kernelIdRegex = re.compile(r"(?<=kernel-)[\w\d\-]+(?=\.json)")
def getNotebookPath():
kernelId = kernelIdRegex.search(get_ipython().config["IPKernelApp"]["connection_file"])[0]
for jupServ in serverapp.list_running_servers():
for session in requests.get(url_path_join(jupServ["url"], "api/sessions"), params={"token": jupServ["token"]}).json():
if kernelId == session["kernel"]["id"]:
return Path(jupServ["root_dir"]) / session["notebook"]['path']
Tested working with
python==3.9
jupyter_server==1.8.0
jupyterlab==4.0.0a7
Modifying #jfb method, gives the function below which worked fine on ipykernel-5.3.4.
def getNotebookName():
display(Javascript('IPython.notebook.kernel.execute("NotebookName = " + "\'"+window.document.getElementById("notebook_name").innerHTML+"\'");'))
try:
_ = type(NotebookName)
return NotebookName
except:
return None
Note that the display javascript will take some time to reach the browser, and it will take some time to execute the JS and get back to the kernel. I know it may sound stupid, but it's better to run the function in two cells, like this:
nb_name = getNotebookName()
and in the following cell:
for i in range(10):
nb_name = getNotebookName()
if nb_name is not None:
break
However, if you don't need to define a function, the wise method is to run display(Javascript(..)) in one cell, and check the notebook name in another cell. In this way, the browser has enough time to execute the code and return the notebook name.
If you don't mind to use a library, the most robust way is:
import ipynbname
nb_name = ipynbname.name()
If you are using Visual Studio Code:
import IPython ; IPython.extract_module_locals()[1]['__vsc_ipynb_file__']
Assuming you have the Jupyter Notebook server's host, port, and authentication token, this should work for you. It's based off of this answer.
import os
import json
import posixpath
import subprocess
import urllib.request
import psutil
def get_notebook_path(host, port, token):
process_id = os.getpid();
notebooks = get_running_notebooks(host, port, token)
for notebook in notebooks:
if process_id in notebook['process_ids']:
return notebook['path']
def get_running_notebooks(host, port, token):
sessions_url = posixpath.join('http://%s:%d' % (host, port), 'api', 'sessions')
sessions_url += f'?token={token}'
response = urllib.request.urlopen(sessions_url).read()
res = json.loads(response)
notebooks = [{'kernel_id': notebook['kernel']['id'],
'path': notebook['notebook']['path'],
'process_ids': get_process_ids(notebook['kernel']['id'])} for notebook in res]
return notebooks
def get_process_ids(name):
child = subprocess.Popen(['pgrep', '-f', name], stdout=subprocess.PIPE, shell=False)
response = child.communicate()[0]
return [int(pid) for pid in response.split()]
Example usage:
get_notebook_path('127.0.0.1', 17004, '344eb91bee5742a8501cc8ee84043d0af07d42e7135bed90')
To realize why you can't get notebook name using these JS-based solutions, run this code and notice the delay it takes for the message box to appear after python has finished execution of the cell / entire notebook:
%%javascript
function sayHello() {
alert('Hello world!');
}
setTimeout(sayHello, 1000);
More info
Javascript calls are async and hence not guaranteed to complete before python starts running another cell containing the code expecting this notebook name variable to be already created... resulting in NameError when trying to access non-existing variables that should contain notebook name.
I suspect some upvotes on this page became locked before voters could discover that all %%javascript-based solutions ultimately don't work... when the producer and consumer notebook cells are executed together (or in a quick succession).
All Json based solutions fail if we execute more than one cell at a time
because the result will not be ready until after the end of the execution
(its not a matter of using sleep or waiting any time, check it yourself but remember to restart kernel and run all every test)
Based on previous solutions, this avoids using the %% magic in case you need to put it in the middle of some other code:
from IPython.display import display, Javascript
# can have comments here :)
js_cmd = 'IPython.notebook.kernel.execute(\'nb_name = "\' + IPython.notebook.notebook_name + \'"\')'
display(Javascript(js_cmd))
For python 3, the following based on the answer by #Iguananaut and updated for latest python and possibly multiple servers will work:
import os
import json
try:
from urllib2 import urlopen
except:
from urllib.request import urlopen
import ipykernel
connection_file_path = ipykernel.get_connection_file()
connection_file = os.path.basename(connection_file_path)
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
running_servers = !jupyter notebook list
running_servers = [s.split('::')[0].strip() for s in running_servers[1:]]
nb_name = '???'
for serv in running_servers:
uri_parts = serv.split('?')
uri_parts[0] += 'api/sessions'
sessions = json.load(urlopen('?'.join(uri_parts)))
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
nb_name = os.path.basename(sess['notebook']['path'])
break
if nb_name != '???':
break
print (f'[{nb_name}]')
just use ipynbname , which is practical
import ipynbname
nb_fname = ipynbname.name()
nb_path = ipynbname.path()
print(f"{nb_fname=}")
print(f"{nb_path=}")
I found this in https://stackoverflow.com/a/65907473/15497427

Issue in invoking "onclick" event using PyQt & javascript

I am trying to scrape data from a website using beautiful soup. By default, this webpage shows 18 items and after clicking on a javascript button "showAlldevices" all 41 items are visible. Beautiful soup scrapes data only for items visible by default, to get data for all items I used PyQt module and invoked the click event using the javascript code. Below is the referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://www.att.com/shop/wireless/devices/smartphones.html'
r = Render(url)
jsClick = """var evObj = document.createEvent('MouseEvents');
evObj.initEvent('click', true, true );
this.dispatchEvent(evObj);
"""
allSelector = "a#deviceShowAllLink"
allButton = r.frame.documentElement().findFirst(allSelector)
allButton.evaluateJavaScript(jsClick)
html = allButton.webFrame().toHtml()
page = html
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_26decv2.0.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%A") ,unicode(item.string).encode('utf8').strip(),textcontent])
I am feeding the html to beautiful soup by using this line of code html = allButton.webFrame().toHtml() This code is running without any errors but I am still not getting data for all 41 items in the output csv
I also tried feeding html to beautiful soup using these lines of code:
allButton = r.frame.documentElement().findFirst(allSelector)
a = allButton.evaluateJavaScript(jsClick)
html = a.webFrame.toHtml()
page = html
soup = BeautifulSoup(page)
But I came across this error: html = a.webFrame.toHtml()
AttributeError: 'QVariant' object has no attribute 'webFrame'
Please pardon my ignorance if I am asking anything fundamental here, as I am new to programming and help me in solving this issue.
I think there is a problem with your JavaScript code. Since you're creating a MouseEvent object you should use an initMouseEvent method for initialization. You can find an example here.
UPDATE2
But I think the simplest think you can try is to use the JavaScript DOM method onclick of the a element instead of using your own JavaScript code. Something like this:
allButton.evaluateJavaScript("this.onclick()")
should work. I suppose you will have to reload the page after clicking, before passing it to the parser.
UPDATE 3
You can reload the page via r.action(QWebPage.ReloadAndBypassCache) or r.action(QWebpage.Reload) but it doesn't seem to have any effect. I've tried to display the page with QWebView, click the link and see what happens. Unfortunately I'm getting lots of Segmentation Fault errors so I would swear there is a bug somewhere in PyQt4/Qt4. As the page being scrapped uses jquery I've also tried to display it after loading jquery in the QWebPage but again no luck (the segfaults do not disappear). I'm giving up :( I hope other users here at SO will help you. Anyway I recommend you to ask for help to the PyQt4 mailing list. They provide excellent support to PyQt users.
UPDATE
The error you get when changing your code is expected: remember that allButton is a QWebElement object. And the QWebElement.evaluateJavaScript method returns a QVariant object (as stated in the docs) and that kind of objects don't have a webFrame attribute as you can check reviewing this page.

Categories

Resources