I'm trying to scrape some data from TripAdvisor and using Selenium with Python binding to get it done.
The review objects in the webpage sometimes have a 'More' button at the bottom to display the full review content upon clicking it. It is actually a span element with an onlclick JS function written for it.
What I want to achieve is to load the page, find the 'More' links and click them so that the web page then has fully loaded reviews shown before scraping operations begin.
So far, I've tried the following code with no luck. I can't seem to understand the errors shown in stack trace.
import os
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.tripadvisor.ca/Attraction_Review-g304138-d317476-Reviews-Temple_of_the_Tooth_Sri_Dalada_Maligawa-Kandy_Central_Province.html#REVIEWS");
more = [];
more = driver.find_elements_by_class_name('moreLink')
print(len(more));
for x in range(0,len(more)):
if more[x].is_displayed():
more[x].click();
print("clicked");
These are the error logs that I'm getting in the console.
3
Traceback (most recent call last):
File "C:\Users\**\workspace\ReviewScraper\src\scraper\test3.py", line 13, in <module>
more[x].click();
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 75, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 454, in _execute
return self._parent.execute(command, params)
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 201, in execute
self.error_handler.check_response(response)
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 102, in check_response
value = json.loads(value_json)
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\**\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Any help is highly appreciated.
I managed to get this done by reverting back to Selenium 1.48.0, and by logging into TA before scraping the reviews, everytime. Once logged in, you could click on 'More' button and extract the full reviews easily.
Related
I'm coding my first django project with python3 and when I close my server I get this error message:
I'm trying to figure it out but I can't find a solution. Please help. Thanks!
^CTraceback (most recent call last):
File "manage.py", line 22, in <module>
main()
File "manage.py", line 18, in main
execute_from_command_line(sys.argv)
File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 401, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 395, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python3.7/site-packages/django/core/management/base.py", line 341, in run_from_argv
connections.close_all()
File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 230, in close_all
connection.close()
File "/usr/local/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 261, in close
if not self.is_in_memory_db():
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 380, in is_in_memory_db
return self.creation.is_in_memory_db(self.settings_dict['NAME'])
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/creation.py", line 12, in is_in_memory_db
return database_name == ':memory:' or 'mode=memory' in database_name
TypeError: argument of type 'PosixPath' is not iterable
Traceback (most recent call last):
File "manage.py", line 22, in <module>
main()
File "manage.py", line 18, in main
execute_from_command_line(sys.argv)
File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 401, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 395, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python3.7/site-packages/django/core/management/base.py", line 341, in run_from_argv
connections.close_all()
File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 230, in close_all
connection.close()
File "/usr/local/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 261, in close
if not self.is_in_memory_db():
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 380, in is_in_memory_db
return self.creation.is_in_memory_db(self.settings_dict['NAME'])
File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/creation.py", line 12, in is_in_memory_db
return database_name == ':memory:' or 'mode=memory' in database_name
TypeError: argument of type 'PosixPath' is not iterable
Since I have also started learning Python and Django recently. I have also encountered situation like this.
In the error you have shown, while creating database, it is generating error.
In my opinion, better to take it slow and just run Django project once installed without any of your code.
then, gradually increase. once you add single app, better to migrate to verify everything is working fine. before creating another, with that you can easily identify where you got problem from.
or just create test cases.
I want to send a message to this website with Python.
It is to say to do the following but with python :
That's why I tried the following script with Selenium:
api_location = 'http://iphoneapp.spareroom.co.uk'
api_search_endpoint = 'flatshares'
api_details_endpoint = 'flatshares'
location = 'http://www.spareroom.co.uk'
details_endpoint = 'flatshare/flatshare_detail.pl?flatshare_id='
def contact_room(self, room_id):
url = '{location}/{endpoint}/{id}?format=json'.format(location=self.api_location, endpoint=self.api_details_endpoint, id=room_id)
from selenium import webdriver
driver = webdriver.Chrome()
# Go to your page url
driver.get(url)
# Get button you are going to click by its id ( also you could use find_element_by_css_selector to get element by css selector)
button_element = driver.find_element_by_id('button id')
button_element.click()
But it returns:
C:\Users\antoi\Documents\Programming\projects\roomfinder>python test_message.py
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\Python36\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\Python36\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_message.py", line 21, in <module>
contact_room(13829371)
File "test_message.py", line 14, in contact_room
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
File "C:\Python36\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 73, in __init__
self.service.start()
File "C:\Python36\lib\site-packages\selenium\webdriver\common\service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
While I already added it in the PATH:
I am javascript learner. If you have tips and time to show how to answer the question as well in Javascript I am always happy to learn :)
The chromedriver needs to be in the path of your python script or you need to add it to your driver:
driver_path = 'Path\to\your\Driver'
driver = webdriver.Chrome(executable_path = driver_path)
Why are you using webdriver.Firefox() if you talk about Chrome?
I'm trying to download documents from a website.
When I inspect the element in my browser, this is what I get:
<td width="3%" align="left" id="tdvPDF0" colspan="3">
PDF
/
XML
/
DOCX
</td>
I would like download all three documents, i.e. the PDF, XML, and DOCX.
This JavaScript can accept three arguments. In this case they are:
1. JRAOB2SNRXEAPX2 (string)
2. 0 (integer)
3. PDF (string)
I have no idea of how to ascertain the correct input for the first argument (in this example: "JRAOB2SNRXEAPX2")
I would like to have my code work regardless of the first argument.
Previously, when I've encountered JavaScript functions I used the following:
driver.execute_script(name_of_JavaScript_script())
that would generally work, however I have never encountered a JavaScript with arguments as in this case, e.g. downloadClicked('JRAOB2SNRXEAPX2', 0, 'PDF')
I tried the following without success:
driver.execute_script(downloadClicked('JRAOB2SNRXEAPX2', 0, 'PDF'))
driver.execute_script(downloadClicked(''JRAOB2SNRXEAPX2', 0, 'PDF''))
driver.execute_script(downloadClicked('JRAOB2SNRXEAPX2', 0, 'PDF')); return false;
and many other similar options.
I've also tried:
javascript = driver.find_element_by_id('tdvPDF0').click()
driver.execute_script(javascript)
In addition I've tried:
driver.find_element_by_id('tdvPDF0').click()
The code for the function currently looks like this:
def private_pair_ifw_downloader(driver, application_number, pause=1):
private_pair_enter_application(driver, application_number)
time.sleep(pause)
driver.execute_script('submitTab("ifwtab")')
time.sleep(pause)
driver.execute_script('"javaScript:downloadClicked(''JRAOB2SNRXEAPX2', 0, 'PDF''); return false;"')
I expect for the code to invoke the JavaScript function which in turn should download the PDF file however, I received the following error:
Traceback (most recent call last):
File "C:/Workspaces/patents_repo/USPTO_scraper/uspto_private_pair_scraper.py", line 41, in
private_pair_ifw_downloader(driver, '15723211')
File "C:\Workspaces\patents_repo\utils\web_utils.py", line 211, in private_pair_ifw_downloader
driver.execute_script('"javaScript:downloadClicked(''JRAOB2SNRXEAPX2', 0, 'PDF''); return false;"')
File "C:\Users\eitan\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script
'args': converted_args})['value']
File "C:\Users\eitan\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\eitan\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Runtime.evaluate threw exception: SyntaxError: Unexpected identifier
(Session info: chrome=76.0.3809.100)
All you have to do is wrap your input to execute_script in double quotes.
driver.execute_script("downloadClicked('JRAOB2SNRXEAPX2', 0, 'PDF');")
If you don't know the string you can try something like:
driver.execute_script('document.querySelector("a[onclick*=PDF]").onclick()')
I've got a page.
And I want to go on every page (in order to get the URL) associated with an element of the drop down menu from the top of the page.
New to selenium, I'm trying some preliminary work:
Open the driver
Get it to webpage
Select the drop down menu
Just select a random "name" from a arbitrary value = 2
Get on the page and get the URL from it. Print it.
Just select a random "name" from a arbitrary value = 3
ERROR.
The code I use:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Firefox()
driver.get("http://www.hillsproducts.com/General.aspx/en-GB/PD/a-d-canine/original/can")
select = Select(driver.find_element_by_xpath("//select[#id='productSpecifier_product']"))
value="2"
select.select_by_value(value)
print(driver.current_url)
time.sleep(10)
value="3"
select.select_by_value(value)
print(driver.current_url)
There is something i don't get.
The error i've got is the following :
Traceback (most recent call last): File
"/Users/Luigi/Desktop/selenium_attempt.py", line 19, in
select.select_by_value(value) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/support/select.py",
line 76, in select_by_value
opts = self._el.find_elements(By.CSS_SELECTOR, css) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webelement.py",
line 485, in find_elements
{"using": by, "value": value})['value'] File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webelement.py",
line 447, in _execute
return self._parent.execute(command, params) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webdriver.py",
line 193, in execute
self.error_handler.check_response(response) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/errorhandler.py",
line 181, in check_response
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.StaleElementReferenceException: Message:
Element not found in the cache - perhaps the page has changed since it
was looked up Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9348)
at Utils.getElementAt (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/driver-component.js:8942)
at FirefoxDriver.prototype.findElementsInternal_ (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/driver-component.js:10685)
at FirefoxDriver.prototype.findChildElements (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/driver-component.js:10706)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/command-processor.js:12643)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/command-processor.js:12648)
at DelayedCommand.prototype.execute/< (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpr37ozu9l/extensions/fxdriver#googlecode.com/components/command-processor.js:12590)
Any idea would be appreciated !
UPDATE after Alex's answer :
Traceback (most recent call last): File
"/Users/Luigi/Desktop/selenium_attempt.py", line 18, in
if index >= len(select.options): File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/support/select.py",
line 46, in options
return self._el.find_elements(By.TAG_NAME, 'option') File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webelement.py",
line 485, in find_elements
{"using": by, "value": value})['value'] File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webelement.py",
line 447, in _execute
return self._parent.execute(command, params) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/webdriver.py",
line 193, in execute
self.error_handler.check_response(response) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium-2.46.1-py3.4.egg/selenium/webdriver/remote/errorhandler.py",
line 181, in check_response
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.StaleElementReferenceException: Message:
Element not found in the cache - perhaps the page has changed since it
was looked up Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9348)
at Utils.getElementAt (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/driver-component.js:8942)
at FirefoxDriver.prototype.findElementsInternal_ (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/driver-component.js:10685)
at FirefoxDriver.prototype.findChildElements (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/driver-component.js:10706)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/command-processor.js:12643)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/command-processor.js:12648)
at DelayedCommand.prototype.execute/< (file:///var/folders/8s/hl6bx6z91yq6r81hpqg995rw0000gn/T/tmpzrilw39c/extensions/fxdriver#googlecode.com/components/command-processor.js:12590)
You have to reinstantiate the Select() every time a new page is loaded:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Firefox()
driver.get("http://www.hillsproducts.com/General.aspx/en-GB/PD/a-d-canine/original/can")
index = 0
while True:
select = Select(driver.find_element_by_id("productSpecifier_product"))
# exit the loop if all the options were seen
if index >= len(select.options):
break
select.select_by_index(index)
print(driver.current_url)
index += 1
I'm first time to ask questions here and I'm new to Python.
I install the mechanize and BeautifulSoup to change some forms from a page.
Now, I use br.submit() to send the request , it doesn't work!
Is there any way to call the onclick function(javascript)?
Here is the code about that button send data:
<div class="go_btm w_a1">
<p class="gogo">search</p>
<p class="gogo">cancel</p>
<br class="CLEAR" />
</div>
UPDATE:
Thank you for support the Selenium this tool.
But I have another problem. My code below:
for i in range(len(all_options)):
arr.append(all_options[i])
count = 0
for option in arr:
print("Value is: %s" % option.get_attribute("value"))
if count > 1:
option.click()
string = u'search'
link2 = browser.find_element_by_link_text(string.encode('utf8'))
response = link2.click()
browser.back()
count = count + 1
After I back to the same page,it answer me:
Traceback (most recent call last):
File "C:\Users\pc2\Desktop\TEST.py", line 44, in <module>
print("Value is: %s" % option.get_attribute("value"))
File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 93, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 385, in _execute
return self._parent.execute(command, params)
File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 173, in execute
self.error_handler.check_response(response)
File "C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 166, in check_response
raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=40.0.2214.111)
(Driver info: chromedriver=2.9.248315,platform=Windows NT 6.1 SP1 x86_64)
I can only click the select once.
Is that talk me my option in the array disappear?
How should I keep the variable(option) let next loop to click?
mechanize cannot handle javascript:
How do I use Mechanize to process JavaScript?
Instead, you can automate a real browser via selenium. Example:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('myurl')
link = driver.find_element_by_link_text('search')
link.click()