Selenium Webdriver not executing JavaScript - javascript

I'm trying to scrape data from Aliexpress product page. example.
I need this section. (transaction history)
my code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
my_url = 'https://www.aliexpress.com/item/Cosmetic-Brush-Makeup-Blusher-Eye-Shadow-Kabuki-Brushes-Set-Tool-Kit-22pcs/32765190537.html?ws_ab_test=searchweb0_0'
chrome_options = Options()
chrome_options.add_argument("--enable-javascript")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(my_url)
innerHTML = driver.execute_script('return document.getElementsByTagName("html")[0].innerHTML')
page_html = driver.page_source
When i run
document.getElementsByTagName("html")[0].innerHTML
on the chrome console i get the entire html including the section that i need.
but, the innerHTML object give me the same html as driver.page_source (without the section that i need)
as far as i know this section is not under iFrame.
Some help please :-)

You probably want to look for this specific table.
Using
innerHTML = document.querySelectorAll('table.transaction-feedback-table');
Will probably find it

The trasactions is generated after the element ID j-transaction-feedback is visible, you have to scroll to the element and wait Ajax request finished.
from selenium.webdriver.support.ui import WebDriverWait
....
....
driver.get(my_url)
# scroll to the element
driver.find_element_by_css_selector('#j-transaction-feedback').location_once_scrolled_into_view
# wait until Ajax finished and render the element
transaction = WebDriverWait(driver, 15).until(
lambda d: d.find_element_by_css_selector('.transaction-feedback-content')
)
total_transaction = driver.find_element_by_css_selector('#j-transaction-feedback .text')
page_source = driver.page_source
print('total_transaction: ' + total_transaction.text)

Related

How to scrape links that do not have a href and not available in page source

Im trying to use selenium web driver to extract data and i get to see that one of the links that i want to click does not have a href. The html tags i see in inspect element are also not available in the page source.I badly want the link to be clicked and proceed to the next page.
The anchor tag that i see during inspect is as below and this seems to be having a angular JS
< a id="docs" ng-click="changeFragment('deal.docs')">
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('URL here');
time.sleep(5) # Let the user actually see something!
username = driver.find_element_by_name('USERID')
username.send_keys('12345')
password = driver.find_element_by_name('PASSWORD')
password.send_keys('password')
#search_box.submit()
driver.find_element_by_id("submitInput").submit()
time.sleep(5) # Let the user actually see something!
lnum = driver.find_element_by_name('Number')
lnum.send_keys('0589403823')
checkbox = driver.find_element_by_name('includeInactiveCheckBox').click()
driver.find_element_by_id("searchButton").click()
time.sleep(5)
driver.execute_script("​changeFragment('deal.docs')").click()
driver.quit()
I tried to use find element by xpath and script but both didnt work .
The url im trying access cant be shared as it can be accessed only through a specific network

Web scraping with BeautifulSoup won't work

Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.

executed_script failed to send long text despite send_keys works well [duplicate]

My code inputs text into the text area of the web page , line by line, how to make it insert the entire text all at once instead, is there a solution for this?
because line by line takes a lot of time
def Translated_Content(content):
driver= webdriver.Chrome("C:\\Users\\shricharan.arumugam\\Desktop\\PDF2txt\\chromedriver.exe")
driver.get('https://translate.shell.com/')
input_box = driver.find_element_by_id('translateText')
input_box.send_keys(content)
translate_button = driver.find_element_by_id('translate')
translate_button.click()
translated_text_element= driver.find_element_by_id('translatedText')
time.sleep(4)
translated_text=translated_text_element.get_attribute('value')
driver.close()
return translated_text
You can change the text of textbox/textarea through JavaScript DOM API in silent way, not from front UI:
long_string= <the long string>
input_box = driver.find_element_by_id('translateText')
driver.execute_script('arguments[0].value=arguments[1]', input_box, long_string)
To send the entire chunk of text into the <textarea> using selenium through Python to speed up the process you can inject a script and use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
myText = """No, there is no way to hide the console window of the chromedriver.exe
in the .NET bindings without modifying the bindings source code. This is seen
as a feature of the bindings, as it makes it very easy to see when your code
hasn't correctly cleaned up the resources of the ChromeDriver, since the console window
remains open. In the case of some other languages, if your code does not properly clean up
the instance of ChromeDriver by calling the quit() method on the WebDriver object,
you can end up with a zombie chromedriver.exe process running on your machine."""
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://translate.shell.com/')
translate_from = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "textarea.form-control#translateText")))
translate_from._parent.execute_script("""
var elm = arguments[0], text = arguments[1];
if (!('value' in elm))
throw new Error('Expected an <input> or <textarea>');
elm.focus();
elm.value = text;
elm.dispatchEvent(new Event('change'));
""", translate_from, myText)
driver.find_element_by_css_selector("input#translate").click()
Browser Snapshot:

Python Selenium: Iteration Error

I'm trying to download all xml files from a webpage. The process requires locating xml file download link one after the other, and once such a download link is clicked it leads to a form which needs to be submitted for the download to begin. The issue I'm facing lies in the iteration of these loops, once the first file is downloaded from the webpage I receive an error:
"selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of stale: either the element is no longer attached to the DOM or the page has been refreshed"
The "97081 data-extension xml" is the 2nd downloadable file in the iteration. I've hereby attached the code, any suggestions to rectify this will be much appreciated.
import os
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "F:\Projects\Poli_Map\DatG_Py_Dat")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/xml")
driver = webdriver.Firefox(firefox_profile=fp)
driver.get('https://data.gov.in/catalog/variety-wise-daily-market-prices-data-cauliflower')
wait = WebDriverWait(driver, 10)
allelements = driver.find_elements_by_xpath("//a[text()='xml']")
for element in allelements:
element.click()
class FormPage(object):
def fill_form(self, data):
driver.execute_script("document.getElementById('edit-download-reasons-non-commercial').click()")
driver.execute_script("document.getElementById('edit-reasons-d-rd').click()")
driver.find_element_by_xpath('//input[#name = "name_d"]').send_keys(data['name_d'])
driver.find_element_by_xpath('//input[#name = "mail_d"]').send_keys(data['mail_d'])
return self
def submit(self):
driver.execute_script("document.getElementById('edit-submit').click()")
data = {
'name_d': 'xyz',
'mail_d': 'xyz#outlook.com',
}
time.sleep(5)
FormPage().fill_form(data).submit()
time.sleep(5)
window_before = driver.window_handles[0]
driver.switch_to_window(window_before)
driver.back()
I found a workaround for you, no need to submit any fields.
You need to get the ID in the class field in the bottom of this picture (here for instance its 962721)
Then, use this URL like so :
https://data.gov.in/node/962721/download
This was found just doing a bit of "reverse-engineering". When you do web scraping, always have a look at the .js files and at your networking tab to see all the requests made.

Selenium WebDriver Unable to Find Element by Id, Using Python

I'm trying to pull up an element that only gets created after the JavaScript runs, but I keep getting the following error message:
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"id","selector":"post-count"}' ; Stacktrace: Method FirefoxDriver.prototype.findElementInternal_ threw an error in file:///tmp/tmpittNsw/extensions/fxdriver#googlecode.com/components/driver_component.js
I'm trying to pull this element up on cnn.com. My code:
socket.setdefaulttimeout(30)
browser = webdriver.Firefox() # Get local session of firefox
browser.get(article_url_txt) # Load page
result = browser.find_element_by_id("post-count")
The element you are looking for is inside an iframe.
The following did the trick for me:
from selenium.webdriver.support.wait import WebDriverWait
# ...
frame = WebDriverWait(browser, 30).until(lambda x: x.find_element_by_id("dsq1"))
browser.switch_to_frame(frame)
result = WebDriverWait(browser, 30).until( lambda x: x.find_element_by_id("post-count"))
Note that I included the use of WebDriverWait(...).until(...) to allow the elements to be created dynamically just in case.
You can tell the WebDriver to wait implicitly until the element is visible.
browser.implicitly_wait(30)
result = browser.find_element_by_id("post-count")

Categories

Resources