I've a page where i need to automate some tasks and scrape some data, but the page runs some JS after loading to inject some data into the DOM; that i cannot intercept (not in a good format anyway), I was hoping to find a solution that is fast and not memory consuming.
I've attempt to get the scripts myself and execute them using some headless driver (namely phantomJs) but it didn't update the page source and i'm not sure how to retrive the updated DOM from that
var page = GetWebPage(url);
var scripts = page.Html.QuerySelectorAll("script");
var phantomDriver = new PhantomJSDriver(PhantomJSDriverService.CreateDefaultService(Directory.GetCurrentDirectory()));
phantomDriver.Navigate().GoToUrl(url);
foreach (var script in scripts)
phantomDriver.ExecuteScript(script.InnerText);
var at = phantomDriver.PageSource;
You could use a 'wait'. According to this link, Selenium has both implicit and explicit waits. The example below is using an explicit wait.
To use an explicit wait, use WebDriverWait and ExpectedConditions. I'm not sure what language you're using but here's an example in python. This uses WebDriverWait in a try-catch block, allowing timeout seconds to meet the specified ExpectedConditions. As at June 2019, conditions are available in:
Java;
Python; and
.NET
Example code in python:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://stackoverflow.com/questions/56724178/executing-page-scripts-before-retrieving-its-contents'
target = (By.XPATH, "//div[#class='gravatar-wrapper-32']")
timeout = 20 # Allow max 20 seconds to find the target
browser = webdriver.Chrome()
browser.get(url)
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located(target))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
The important bit is between try and except which you would modify to use the specific 'expected condition' you're interested in.
Related
Part of my code is to create a function to scroll the page. This is reproduced from the code to scrape Google Jobs here
It throws an error "javascript error: Cannot read properties of null (reading 'scrollHeight')"
I'm not sure why document.querySelector('.zxU94d') is null:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
def scroll_page(url):
service = Service(ChromeDriverManager().install())
# Add the settings to run the Chrome browser
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
# Store the initial height of Google Jobs page
old_height = driver.execute_script("document.querySelector('.zxU94d').scrollHeight")
When I go to the URL accessed manually (here), I can get the height on Console without problem.
If I tried handling null it will return NoneType:
old_height = driver.execute_script("""
if (document.querySelector('.zxU94d')) {
document.querySelector('.zxU94d').scrollHeight;
}
""")
I'm the author of the Scrape Google Jobs organic results with Python blog you link to.
When writing the script, I ran into the same problem as you - the page height was not read. I solved this problem by declaring a function that returns the height of the page and calling it:
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
You are using the scrollHeight property to get the height of a certain element, it does not allow you to implement page scrolling. In order to scroll the page, use the scrollTo() method.
If you need to scroll down the page only once, then you can add the following script:
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
If you need to scroll the page to the end, then the algorithm looks like this:
Find out the initial page height and write the result to the old_height variable.
Scroll the page using the script and wait a few seconds for the data to load.
Find out the new page height and write the result to the new_height variable.
If the variables new_height and old_height are equal, then we complete the algorithm, otherwise we write the value of the variable new_height to the variable old_height and return to step 2.
# 1 step
old_height = driver.execute_script("""
() => document.querySelector('.zxU94d').scrollHeight;
""")
while True:
# 2 step
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
time.sleep(2)
# 3 step
new_height = driver.execute_script("""
() => document.querySelector('.zxU94d').scrollHeight;
""")
# 4 step
if new_height == old_height:
break
old_height = new_height
If there are difficulties, I will be happy to answer all questions.
When web scraping with Selenium it's best to wait for elements to load explicitly. In your case it's likely that your driver is executing the javascript before the page is fully loaded.
You can use WebDriverWait function:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
_timeout = 10 # seconds
WebDriverWait(driver, _timeout).until(
expected_conditions.presence_of_element_located(
(By.CSS_SELECTOR, ".zxU94d")
)
)
old_height = driver.execute_script("""if (document.querySelector('.zxU94d')) {
document.querySelector('.zxU94d').scrollHeight
}""")
what #granitosaurus said, but also forget about the if and use optional chain operator (?). Don't forget to "return" (very important)
old_height = driver.execute_script("""
return document.querySelector('.zxU94d')?.scrollHeight || 0
""")
I am trying to automate process of sign up on virus total site and for this using selenium in python. But having a problem while getting element by id. i am stuck in this any help will be appreciated thanks.
here is my code i am trying.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver =webdriver.Chrome()
driver.get('https://www.virustotal.com/gui/join-us')
print(driver.title)
search = driver.find_element_by_id("first_name")
search.send_keys("Muhammad Aamir")
search.send_keys(Keys.RETURN)
time.sleep(5)
driver.quit()
The First name field within the website https://www.virustotal.com/gui/join-us is located deep within multiple #shadow-root (open).
Solution
To send a character sequence to the First name field you have to use shadowRoot.querySelector() and you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.virustotal.com/gui/join-us")
time.sleep(7)
first_name = driver.execute_script("return document.querySelector('vt-virustotal-app').shadowRoot.querySelector('join-us-view.iron-selected').shadowRoot.querySelector('vt-ui-two-column-hero-layout').querySelector('vt-ui-text-input#first_name').shadowRoot.querySelector('input#input')")
first_name.send_keys("Muhammad Aamir")
Browser Snapshot:
References
You can find a couple of relevant discussions in:
Unable to locate the Sign In element within #shadow-root (open) using Selenium and Python
If you look at the HTML of the website, you can see that your input field is within a so called #shadowroot.
These shadowroots prevent you from finding the elements contained within the shadowroot using a simple find_element_by_id. You can fix this by finding all the parent shadowroots that contain the element you are looking for. In each of the shadowroots you will need to use javascript's querySelector and find the next shadowroot, until you can access the element you were looking for.
In your case you would need to do the following:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver =webdriver.Chrome()
driver.get('https://www.virustotal.com/gui/join-us')
print(driver.title)
# wait a bit untill form pops up
time.sleep(3)
# Retrieve the last shadowroot using javascript
javascript = """return document
.querySelector('vt-virustotal-app').shadowRoot
.querySelector('join-us-view').shadowRoot
.querySelector('vt-ui-text-input').shadowRoot"""
shadow_root = driver.execute_script(javascript)
# Find the input box
search = shadow_root.find_element_by_id("input")
search.send_keys("Muhammad Aamir")
search.send_keys(Keys.RETURN)
time.sleep(5)
driver.quit()
I am writing an automation tool for a website. Therefore I am using Selenium in Java. For the real automation purpose I mainly use JavaScript through JavascriptExecutor. Most of the time everything works fine, but sometimes it crashes (e.g. 1 out of 10 times). I have the impression that then the code was just too fast. I am using implicit wait and explicit wait from the WebdriverWait class. I think this waits just wait for the dom or the elements within, but they are not waiting until all scripts are done. Therefore I need a function or snippet. As mentioned the website is using vue and angular.
Thanks in advance!
In your case, there is possibility that your script get failed while navigating to other html page by clicking link or button. If your application fails/crashes in these scenarios, include page load time on implicit wait as well.
You can add an explicit wait that will wait for angular to finish processing any pending requests:
public static ExpectedCondition angularHasFinishedProcessing() {
return new ExpectedCondition() {
#Override
public Boolean apply(WebDriver driver) {
JavascriptExecutor jsexec = ((JavascriptExecutor) driver)
String result = jsexec.executeScript("return (window.angular != null) && (angular.element(document).injector() != null) && (angular.element(document).injector().get('$http').pendingRequests.length === 0)")
return Boolean.valueOf(result);
}
};
}
To use it you would then do:
WebDriverWait wait = new WebDriverWait(driver, 15, 100);
wait.until(angularHasFinishedProcessing());
This used to remove of lot of flakiness in Angular automation for me.
I would suggest always using explicit waits and never using implicit waits, implicit waits will make negative checks take forever. Also don't ever mix implicit and explicit waits, it has the potential to cause all sorts of strange undefined behaviour.
When I try browser.ExecuteJavascript("alert('ExecuteJavaScript works!');") it works fine (pops up a alert when the browser is created). When I try browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24;") nothing happens. So I know that ExecuteJavascript is working but how come when I try to set a value of an input element, the input element doesn't change? The code I am trying is below if anyone has an idea as for why that particular Javascript will not execute I would be very grateful.
from cefpython3 import cefpython as cef
import platform
import sys
def main():
sys.excepthook = cef.ExceptHook
cef.Initialize()
browser = cef.CreateBrowserSync(url="https://www.google.com", window_title="Hello World!")
browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24")
cef.MessageLoop()
cef.Shutdown()
if __name__ == '__main__':
main()
DOM is not yet ready after the browser was just created. Open Developer Tools window using mouse context menu and you will see the error. You should use LoadHandler to detect when window finishes loading of web page or when DOM is ready. Options:
Implement LoadHandler.OnLoadingStateChange:
main():
browser.SetClientHandler(LoadHandler())
class LoadHandler(object):
def OnLoadingStateChange(self, browser, is_loading, **_):
if not is_loading:
browser.ExecuteJavascript("document.getElementsByName('q')[0].value = 24")
Implement LoadHandler.OnLoadStart and inject js code that adds an event listener DOMContentLoaded that will execute the actual code.
See Tutorial > Client handlers:
https://github.com/cztomczak/cefpython/blob/master/docs/Tutorial.md#client-handlers
See also API reference for LoadHandler.
I'm trying to using protractor to wait to my input 'Username' to appears and then insert value on it. How Can I do it?
browser.get('http://localhost:5555/#');
var login = browser.driver.findElement(by.id('Username'));
Use Expected Conditions to wait for certain conditions, i.e. for an element to be present or visible. Use sendKeys to fill an input.
var login = element(by.id('Username'));
var EC = protractor.ExpectedConditions;
browser.wait(EC.presenceOf(login)).then(function() {
login.sendKeys('myuser');
});
This belongs in your spec, not your config.
If your's is an Angular App and if you are doing everything right , Then You need not wait for the Element or Page to Load.
Protractor does it for you. Refer API doc for WaitForAngular
Also check this you are interested. Wrote more info on my blog Protractor Over Selenium
I would assume couple of things
using shortcut version of element(by.id())
setting variables to let, instead var
provide timeout for wait, or it will be endless, and will fail only on test timeout (wasting time for wait)
provide wait failed error message (3rd parameter in wait function) - better readability on failures
No need to put sendKeys() to callback, protractor controlFlow will execute commands in correct order even without this.
Here is code example:
let loginField = $('#Username');
let EC = protractor.ExpectedConditions;
browser.wait(EC.visibilityOf(loginField), 3000, 'Login field should be visible before entering text');
loginField.sendKeys('myuser');