Part of my code is to create a function to scroll the page. This is reproduced from the code to scrape Google Jobs here
It throws an error "javascript error: Cannot read properties of null (reading 'scrollHeight')"
I'm not sure why document.querySelector('.zxU94d') is null:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
def scroll_page(url):
service = Service(ChromeDriverManager().install())
# Add the settings to run the Chrome browser
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
# Store the initial height of Google Jobs page
old_height = driver.execute_script("document.querySelector('.zxU94d').scrollHeight")
When I go to the URL accessed manually (here), I can get the height on Console without problem.
If I tried handling null it will return NoneType:
old_height = driver.execute_script("""
if (document.querySelector('.zxU94d')) {
document.querySelector('.zxU94d').scrollHeight;
}
""")
I'm the author of the Scrape Google Jobs organic results with Python blog you link to.
When writing the script, I ran into the same problem as you - the page height was not read. I solved this problem by declaring a function that returns the height of the page and calling it:
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
You are using the scrollHeight property to get the height of a certain element, it does not allow you to implement page scrolling. In order to scroll the page, use the scrollTo() method.
If you need to scroll down the page only once, then you can add the following script:
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
If you need to scroll the page to the end, then the algorithm looks like this:
Find out the initial page height and write the result to the old_height variable.
Scroll the page using the script and wait a few seconds for the data to load.
Find out the new page height and write the result to the new_height variable.
If the variables new_height and old_height are equal, then we complete the algorithm, otherwise we write the value of the variable new_height to the variable old_height and return to step 2.
# 1 step
old_height = driver.execute_script("""
() => document.querySelector('.zxU94d').scrollHeight;
""")
while True:
# 2 step
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
time.sleep(2)
# 3 step
new_height = driver.execute_script("""
() => document.querySelector('.zxU94d').scrollHeight;
""")
# 4 step
if new_height == old_height:
break
old_height = new_height
If there are difficulties, I will be happy to answer all questions.
When web scraping with Selenium it's best to wait for elements to load explicitly. In your case it's likely that your driver is executing the javascript before the page is fully loaded.
You can use WebDriverWait function:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
_timeout = 10 # seconds
WebDriverWait(driver, _timeout).until(
expected_conditions.presence_of_element_located(
(By.CSS_SELECTOR, ".zxU94d")
)
)
old_height = driver.execute_script("""if (document.querySelector('.zxU94d')) {
document.querySelector('.zxU94d').scrollHeight
}""")
what #granitosaurus said, but also forget about the if and use optional chain operator (?). Don't forget to "return" (very important)
old_height = driver.execute_script("""
return document.querySelector('.zxU94d')?.scrollHeight || 0
""")
Related
I am trying to automate process of sign up on virus total site and for this using selenium in python. But having a problem while getting element by id. i am stuck in this any help will be appreciated thanks.
here is my code i am trying.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver =webdriver.Chrome()
driver.get('https://www.virustotal.com/gui/join-us')
print(driver.title)
search = driver.find_element_by_id("first_name")
search.send_keys("Muhammad Aamir")
search.send_keys(Keys.RETURN)
time.sleep(5)
driver.quit()
The First name field within the website https://www.virustotal.com/gui/join-us is located deep within multiple #shadow-root (open).
Solution
To send a character sequence to the First name field you have to use shadowRoot.querySelector() and you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.virustotal.com/gui/join-us")
time.sleep(7)
first_name = driver.execute_script("return document.querySelector('vt-virustotal-app').shadowRoot.querySelector('join-us-view.iron-selected').shadowRoot.querySelector('vt-ui-two-column-hero-layout').querySelector('vt-ui-text-input#first_name').shadowRoot.querySelector('input#input')")
first_name.send_keys("Muhammad Aamir")
Browser Snapshot:
References
You can find a couple of relevant discussions in:
Unable to locate the Sign In element within #shadow-root (open) using Selenium and Python
If you look at the HTML of the website, you can see that your input field is within a so called #shadowroot.
These shadowroots prevent you from finding the elements contained within the shadowroot using a simple find_element_by_id. You can fix this by finding all the parent shadowroots that contain the element you are looking for. In each of the shadowroots you will need to use javascript's querySelector and find the next shadowroot, until you can access the element you were looking for.
In your case you would need to do the following:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver =webdriver.Chrome()
driver.get('https://www.virustotal.com/gui/join-us')
print(driver.title)
# wait a bit untill form pops up
time.sleep(3)
# Retrieve the last shadowroot using javascript
javascript = """return document
.querySelector('vt-virustotal-app').shadowRoot
.querySelector('join-us-view').shadowRoot
.querySelector('vt-ui-text-input').shadowRoot"""
shadow_root = driver.execute_script(javascript)
# Find the input box
search = shadow_root.find_element_by_id("input")
search.send_keys("Muhammad Aamir")
search.send_keys(Keys.RETURN)
time.sleep(5)
driver.quit()
I am using Electron ,I have made a custom titlebar with a div which unhides 2 divs,
I want one of these divs(labeld "open") to open a file manager
This can be done using shell.showItemInFolder(--dirname);But the problem is that i cannot retrieve any data from this method , another way is to use Dialog in electron const{dialog} = require('electron');
I treid is to write this console.log(dialog.openShowDialog({properties:['openFile']}));
This (according to some youtube videos i watched) should open a file manager and if i select a file through this , it should log a pending promise.But i get an errorcannot get the property 'showOpenDialog'of undefined
const{dialog} = require('electron');
function openFS(){
win.openDevTools();
console.log(dialog.showOpenDialog({properties:['openFile']}));
}
This openFS function is called on the click of the div mentioned above.
How do i get around?
cannot get the property 'showOpenDialog'of undefined
That error indicates that dialog is undefined. If you are executing this code in the render process, then you are not importing the dialog module correctly – it needs to be accessed through remote (assuming you have specified nodeIntegration: true for the renderer). Personally, I handle all dialog calls in the main process but that's a matter of choice.
const {dialog} = require('electron').remote
HOWEVER . . the remote module is deprecated in Electron 12 as the linked doc indicates. I haven't yet used the method it recommends so I can't speak to any issues with that.
The remote module is deprecated in Electron 12, and will be removed in
Electron 14. It is replaced by the #electron/remote module.
// Deprecated in Electron 12:
const { BrowserWindow } = require('electron').remote
// Replace with:
const { BrowserWindow } = require('#electron/remote')
// In the main process:
require('#electron/remote/main').initialize()
I've a page where i need to automate some tasks and scrape some data, but the page runs some JS after loading to inject some data into the DOM; that i cannot intercept (not in a good format anyway), I was hoping to find a solution that is fast and not memory consuming.
I've attempt to get the scripts myself and execute them using some headless driver (namely phantomJs) but it didn't update the page source and i'm not sure how to retrive the updated DOM from that
var page = GetWebPage(url);
var scripts = page.Html.QuerySelectorAll("script");
var phantomDriver = new PhantomJSDriver(PhantomJSDriverService.CreateDefaultService(Directory.GetCurrentDirectory()));
phantomDriver.Navigate().GoToUrl(url);
foreach (var script in scripts)
phantomDriver.ExecuteScript(script.InnerText);
var at = phantomDriver.PageSource;
You could use a 'wait'. According to this link, Selenium has both implicit and explicit waits. The example below is using an explicit wait.
To use an explicit wait, use WebDriverWait and ExpectedConditions. I'm not sure what language you're using but here's an example in python. This uses WebDriverWait in a try-catch block, allowing timeout seconds to meet the specified ExpectedConditions. As at June 2019, conditions are available in:
Java;
Python; and
.NET
Example code in python:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://stackoverflow.com/questions/56724178/executing-page-scripts-before-retrieving-its-contents'
target = (By.XPATH, "//div[#class='gravatar-wrapper-32']")
timeout = 20 # Allow max 20 seconds to find the target
browser = webdriver.Chrome()
browser.get(url)
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located(target))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
The important bit is between try and except which you would modify to use the specific 'expected condition' you're interested in.
When I try browser.ExecuteJavascript("alert('ExecuteJavaScript works!');") it works fine (pops up a alert when the browser is created). When I try browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24;") nothing happens. So I know that ExecuteJavascript is working but how come when I try to set a value of an input element, the input element doesn't change? The code I am trying is below if anyone has an idea as for why that particular Javascript will not execute I would be very grateful.
from cefpython3 import cefpython as cef
import platform
import sys
def main():
sys.excepthook = cef.ExceptHook
cef.Initialize()
browser = cef.CreateBrowserSync(url="https://www.google.com", window_title="Hello World!")
browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24")
cef.MessageLoop()
cef.Shutdown()
if __name__ == '__main__':
main()
DOM is not yet ready after the browser was just created. Open Developer Tools window using mouse context menu and you will see the error. You should use LoadHandler to detect when window finishes loading of web page or when DOM is ready. Options:
Implement LoadHandler.OnLoadingStateChange:
main():
browser.SetClientHandler(LoadHandler())
class LoadHandler(object):
def OnLoadingStateChange(self, browser, is_loading, **_):
if not is_loading:
browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24")
Implement LoadHandler.OnLoadStart and inject js code that adds an event listener DOMContentLoaded that will execute the actual code.
See Tutorial > Client handlers:
https://github.com/cztomczak/cefpython/blob/master/docs/Tutorial.md#client-handlers
See also API reference for LoadHandler.
I'm trying to using protractor to wait to my input 'Username' to appears and then insert value on it. How Can I do it?
browser.get('http://localhost:5555/#');
var login = browser.driver.findElement(by.id('Username'));
Use Expected Conditions to wait for certain conditions, i.e. for an element to be present or visible. Use sendKeys to fill an input.
var login = element(by.id('Username'));
var EC = protractor.ExpectedConditions;
browser.wait(EC.presenceOf(login)).then(function() {
login.sendKeys('myuser');
});
This belongs in your spec, not your config.
If your's is an Angular App and if you are doing everything right , Then You need not wait for the Element or Page to Load.
Protractor does it for you. Refer API doc for WaitForAngular
Also check this you are interested. Wrote more info on my blog Protractor Over Selenium
I would assume couple of things
using shortcut version of element(by.id())
setting variables to let, instead var
provide timeout for wait, or it will be endless, and will fail only on test timeout (wasting time for wait)
provide wait failed error message (3rd parameter in wait function) - better readability on failures
No need to put sendKeys() to callback, protractor controlFlow will execute commands in correct order even without this.
Here is code example:
let loginField = $('#Username');
let EC = protractor.ExpectedConditions;
browser.wait(EC.visibilityOf(loginField), 3000, 'Login field should be visible before entering text');
loginField.sendKeys('myuser');