Input a value into an HTML webpage using Python - javascript

I am attempting to automate inputting values into a webpage. However, the major issue is that the Mechanize library does not work because my webpage has no forms that Mechanize's form.name recognizes. This is due to the input being a <input>.
I have spent the past hour researching alternatives to Mechanize that might work, but to no avail. Google is of no help as it only thinks I want to take data from a website.
My current code:
from mechanize import Browser
import csv
csv_file = 'city_names.csv' # file name
cities = [] # array to save values from csv into
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
cities.append(row.get('cities'))
br = Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.open("https://iafisher.com/projects/cities/world") # The website if you are curious
for form in br.forms():
print(form.name) # Prints nothing
for i in range(len(cities)):
br.select_form(class="city-input") # ISSUE IS THROWN HERE
control = br.form.find_control("controlname")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br[control] = [cities[i]] # (the method here is __setitem__)
response = br.submit() # submit current form
The input value as seen in developer tools:
<input data-v-018e983a="" id="city-input" type="text" placeholder="Try 'Tokyo' or 'Kingston, Jamaica'" autocomplete="off" spellcheck="false" class="city-input ">
If there is any alternative to Mechanize or a method in Mechanize that would work, it would be appreciated.

After trying multiple different search options, I finally stumbled across a tutorial that manually tells Selenium what to do.
Final code:
from selenium import webdriver
import csv
from selenium.webdriver.common.keys import Keys
csv_file = 'city_names.csv'
cities = []
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
cities.append(row.get('cities'))
chromedriver_location = "C:/Users/blais/Downloads/chromedriver"
driver = webdriver.Chrome(chromedriver_location)
driver.get('https://iafisher.com/projects/cities/world')
submit = '//*[#id="city-input"]'
for c in range(len(cities)):
driver.find_element_by_xpath(submit).send_keys(cities[c] + Keys.ENTER)

Related

Scraping Javascript Website With BeautifulSoup 4 & Requests_HTML

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.
I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.
The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
print(i)
p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
print(i)
Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for html in r.html:
#content scrape
bass_name = []
b = soup.find("h4", class_="grid-card__title").text.strip() #title
#for i in b:
# bass_name.append(i)
# for i in bass_name:
# print(i)
price = []
p = soup.find("div", class_="grid-card__price").text.strip() #price
#for i in p:
# print(i)
csv_writer.writerow([b, p])
In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag.
The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])
session = HTMLSession()
r = session.get(
"https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
stop = False
next_url = ""
while not stop:
print(next_url)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
titles = soup.findAll("h4", class_="grid-card__title") # titles
prices = soup.findAll("div", class_="grid-card__price") # prices
for i in range(len(titles)):
title = titles[i].text.strip()
price = prices[i].text.strip()
csv_writer.writerow([title, price])
next_link = soup.find("li", class_="pagination__page--next")
if not next_link:
stop = True
else:
next_url = next_link.find("a").get("href")
r = session.get("https://reverb.com/marketplace" + next_url)
r.html.render(sleep=5)
Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

Web scraping with BeautifulSoup won't work

Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.

Call Javascript from VBA on Excel on a Mac

I want to create a VBA macro on excel which at a click of button would open a browser (chrome or safari) login to a website, extract the desired float value, then populate a given cell in the sheet with that value.
There are examples online on how to achieve this using internet explorer but this is not available as on a mac. I have also seen guides using Selenium but this doesn't appear to work on mac.
The javascript itself is along the lines of (after opening a browser at a certain website):
document.getElementById("username").value = "username"
document.getElementById("password").value = "password"
document.getElementsByClassName("button")[0].click()
value = parseFloat(document.getElementsByClassName("value")[1].innerText.slice(1))
I've solved this by using a combination of python-selenium and xlwings. My VBA calls RunPython ("import python_script; python_script.fun()")
python_script.py
import xlwings as xw
from selenium import webdriver
def fun():
# Creates a reference to the calling Excel file
wb = xw.Book.caller()
# opens chrome
chrome_driver_path = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
# open website and login
driver.get('url')
driver.find_element_by_id('username').send_keys('username')
driver.find_element_by_id('password').send_keys('password')
driver.find_element_by_name('buttonId').click()
# finds member price sum
table_body = driver.find_element_by_xpath("//*[#class='classname']").text
price = float(table_body.split()[2][1:])
# closes chrome
driver.quit()
# changes cell value
sheet = wb.sheets['sheetname']
sheet.range('cell').value = price

How to yield fragment URLs in scrapy using Selenium?

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !

Issue in invoking "onclick" event using PyQt & javascript

I am trying to scrape data from a website using beautiful soup. By default, this webpage shows 18 items and after clicking on a javascript button "showAlldevices" all 41 items are visible. Beautiful soup scrapes data only for items visible by default, to get data for all items I used PyQt module and invoked the click event using the javascript code. Below is the referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://www.att.com/shop/wireless/devices/smartphones.html'
r = Render(url)
jsClick = """var evObj = document.createEvent('MouseEvents');
evObj.initEvent('click', true, true );
this.dispatchEvent(evObj);
"""
allSelector = "a#deviceShowAllLink"
allButton = r.frame.documentElement().findFirst(allSelector)
allButton.evaluateJavaScript(jsClick)
html = allButton.webFrame().toHtml()
page = html
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_26decv2.0.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%A") ,unicode(item.string).encode('utf8').strip(),textcontent])
I am feeding the html to beautiful soup by using this line of code html = allButton.webFrame().toHtml() This code is running without any errors but I am still not getting data for all 41 items in the output csv
I also tried feeding html to beautiful soup using these lines of code:
allButton = r.frame.documentElement().findFirst(allSelector)
a = allButton.evaluateJavaScript(jsClick)
html = a.webFrame.toHtml()
page = html
soup = BeautifulSoup(page)
But I came across this error: html = a.webFrame.toHtml()
AttributeError: 'QVariant' object has no attribute 'webFrame'
Please pardon my ignorance if I am asking anything fundamental here, as I am new to programming and help me in solving this issue.
I think there is a problem with your JavaScript code. Since you're creating a MouseEvent object you should use an initMouseEvent method for initialization. You can find an example here.
UPDATE2
But I think the simplest think you can try is to use the JavaScript DOM method onclick of the a element instead of using your own JavaScript code. Something like this:
allButton.evaluateJavaScript("this.onclick()")
should work. I suppose you will have to reload the page after clicking, before passing it to the parser.
UPDATE 3
You can reload the page via r.action(QWebPage.ReloadAndBypassCache) or r.action(QWebpage.Reload) but it doesn't seem to have any effect. I've tried to display the page with QWebView, click the link and see what happens. Unfortunately I'm getting lots of Segmentation Fault errors so I would swear there is a bug somewhere in PyQt4/Qt4. As the page being scrapped uses jquery I've also tried to display it after loading jquery in the QWebPage but again no luck (the segfaults do not disappear). I'm giving up :( I hope other users here at SO will help you. Anyway I recommend you to ask for help to the PyQt4 mailing list. They provide excellent support to PyQt users.
UPDATE
The error you get when changing your code is expected: remember that allButton is a QWebElement object. And the QWebElement.evaluateJavaScript method returns a QVariant object (as stated in the docs) and that kind of objects don't have a webFrame attribute as you can check reviewing this page.

Categories

Resources