Web scraping using Apify

Web scraping using Apify - javascript

I'm trying to scrape URLs from https://en.wikipedia.org/wiki/List_of_hedge_funds
Specifically, I'm trying to use Apify to scrape that page and return a list of URLs from anchor tags present in the HTML. In my console, I expect to see the value of the href attribute of one or more anchor tags that exist on the target page in a property called myValue. I also expect to see the page title in a property called title. Instead, I just see the following URL property and its value.
My Apify actor uses the Puppeteer platform. So I'm using a pageFunction similar to the way Puppeteer uses it.
Below is a screen shot of the Apify UI just before I run it.
Page function
function pageFunction( context ) {
// called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
var result = {
title: $('.wikitable').text,
myValue: $('a[href]').text,
};
return result;
}
What am I doing wrong?

You have a typo in your code, text is a function so you need to add parentheses:
var result = {
title: $('.wikitable').text(),
myValue: $('a[href]').text(),
};
But note that this will probably not do what you expect anyway - it will return text of all matched elements. You probably need to use jQuery's each() function (https://api.jquery.com/jquery.each/) to iterate the found elements, push some values from them to an array and return the array from your page function.

The page seems to be loaded by JavaScript so actually I have to use asynchronous code.

Related

Selenium how to get variable from script removed from DOM

<script id="[randomid]">
document.addEventListener("DOMContentLoaded",function(){
var quiz=new Quiz();
//// some logic here
document.getElementById('[randomid]').parentNode.removeChild(document.getElementById('[randomid]'));
});
</script>
I am trying to get variable quiz in selenium and execute some functions on it. But problem is that before I can do anything it is already removed from dom and it is working in bakground. I can print this object in chrome console using this: queryObjects(Quiz). But it just prints and doesn't return object. So I am looking for function which for example would return all variables of choosen type, or would somehow restore this script to dom. Maybe it is possible to prevent from removing script from dom.

Finally I used Selenium Wire to intercept response and change it. I added document.quiz = quiz; after initialization.

How to assert element property value with Espresso-web?

I'm working on a subclass of default Android WebView with some additional functionality (content-filtering) on the top. So i override shouldInterceptRequest(...) and prevent loading of some resource (let's say images with filename "image.png").
Now i want to test this functionality with the help of Espresso-Web. So i do start embedded http server (with WireMock) and load some webpage that uses that "image.png" and it's expected to be blocked.
How can i assert the state of DOM element with id "someId"?
I expected to have smth like:
onWebView().
.withElement(findElement(Locator.ID, blockImageId)) // finds the DOM node
.check(webMatches(getProperty("naturalWidth"), equalTo("0"))) // "getProperty"?
What i need is smth like "getProperty()" that just generates JS to access the property of node found by findElement(Locator.ID, blockImageId), is there anything out-of-box?
I was able to find getText() but it seems it does it in completely different way (so i can't just request another "node property"):
/** Returns the visible text beneath a given DOM element. */
public static Atom<String> getText() {
return new GetTextTransformingAtom(new GetVisibleTextSimpleAtom(), castOrDie(String.class));
}
I was able to do it in JS way:
// JS-way helpers
const val naturalWidth = "naturalWidth"
const val notBlockedImageWidth = 50
const val blockedImageWidth = 0
fun getPropertyForElementId(elementId: String, property: String): Atom<String> =
Atoms.script(
"""return String(document.getElementById("$elementId").$property);""",
Atoms.castOrDie(String::class.java))
fun imageIsBlocked(elementId: String): WebAssertion<String>? = WebViewAssertions.webMatches(
getPropertyForElementId(elementId, naturalWidth),
equalTo(blockedImageWidth.toString())) {
"""An image with id="$elementId" IS expected to be blocked, but it's NOT."""
}
// part of the test
val redImage = "image.png"
load(
listOf(blockingPathPrefix),
"""
|<html>
|<body>
| <img id="$blockImageId" src="${blockingPathPrefix}$redImage"/>
| <img id="$notBlockImageId" src="$greenImage"/>
|</body>
|</html>
|""".trimMargin()
)
onWebView()
.withTimeout(1, TimeUnit.SECONDS) // it's already loaded
.check(imageIsBlocked(blockImageId)) // red image IS blocked
.check(imageIsNotBlocked(notBlockImageId)) // green image is NOT blocked
I do understand that the way i did it is suboptimal as it joins everything: searching of the node and accessing at once and i wonder what's the right way to do it. Any suggestions, links?
PS. "naturalWidth" is just a property that helps me in this particular case. In common case i just need to access a property of found DOM node and it can be some other property next time (eg "style.display" to check element visibility).
PS. Can anybody explain how to write WebDriverAtomScripts, eg. smth similar to WebDriverAtomScripts.GET_VISIBLE_TEXT_ANDROID?

is there anything out-of-box?
The answer to your question is "no", I don't think there is anything better out-of-the-box than creating an Atom like you are doing (or similar approaches like Atoms.scriptWithArgs or subclassing SimpleAtom).
Your best bet is to file a feature request here (and then maybe propose/contribute an implementation): https://github.com/android/android-test
You can assert on the html document with xpath, but that won't work for computed DOM node attributes like you are looking for.
onWebView()
.check(webContent(hasElementWithXpath("//*[#id=\"myButton\" and #type=\"button\"]")))

Awaiting For Elements To Appear Within TestCafe In The Context Of A Page Object Pattern

I use a page object model to store locators and business functions specific to a certain page of the application under test. You can see a sample page object pattern style here
https://github.com/bhreinb/SYSTAC
All the page objects go through a function called pageLoaded() method which can be found in application under test page object. The pageLoaded is used within the base page. Each page object must implement that method or the framework throws an exception to force the user to implement this. The pageLoaded checks attributes of the page belonging to the application under test (for example the page title, a unique element on the page etc) to verify we are on the target page.
I found this works ok in most cases but when a page navigation occurs with multiple re-directs I have to await for elements to exist and be visible as TestCafe prematurely interacts with the application under test. Here is my code sample
pageLoaded() {
this.userNameLabel = Selector('[data-automation-id="userName"]').with({ visibilityCheck: true });
this.passWordLabel = Selector('[data-automation-id="password"]').with({ visibilityCheck: true });
logger.info('Checking Elements That Contain UserName & PassWord Labels For Existence To Confirm We Are On Tenant\'s Target Page');
const promises = [
this.userNameLabel.exists,
this.passWordLabel.exists,
];
return Promise.all(promises).then((elementsExists) => {
const userNameVisible = elementsExists.pop();
const passWordVisible = elementsExists.shift();
if ((userNameVisible) && (passWordVisible)) {
return true;
}
logger.warn(`PageLoaded Method Failure -> Element Containing UserName Is Visible '[${userNameVisible}]'`);
logger.warn(`PageLoaded Method Failure -> Element Containing PassWord Is Visible '[${passWordVisible}]'`);
return false;
});
}
this will fail but will pass if I change the promises to be like this
const promises = [
t.expect(this.userNameLabel.exists).ok({ timeout: 20000 }),
t.expect(this.passWordLabel.exists).ok({ timeout: 20000 }),
];
which is not a pattern I particularly like as I don't want assertions logic to be in the page object pattern plus I want the outcome of the promises to return a true or false if that makes sense. Is their anyway I change the expectation to operate on the method for instance
t.expect(pageLoaded()).ok({ timeout: 20000 })
or any other alternative. In addition how can you specify TestCafe to wait for window.onload event to be fired rather than Domcontentloaded in the case where multiple page redirects happen? Thanks in advance as to any help with the above.

As #AlexKamaev mentioned in his comment, you can specify the timeout option in the Selector constructor. It'll allow you to avoid using the timeout value as an assertion argument. In addition, have a look at the setPageLoadTimeout method, which will allow you to wait until the page is fully loaded.

How to click to the favorite buttons of tweets using selenium javascript

I am trying to favorite tweets using javascript selenium webdriver. What I want to do is search a keyword, go to live tab, favorite the last 50 tweets and follow those people. My favorite tweets part of code fails and I get a
StaleElementReferenceError: Element not found in the cache - perhaps the page has changed since it was looked up
error. Here is my code, can you help me how to click to the favorite buttons?
var button = driver.findElement(By.className('HeartAnimation'));
function buttoninthearray(driver, i) {
var buttons = driver.findElements(By.className('HeartAnimation'));
return webdriver.promise.filter(buttons, function(button) {
return button.isDisplayed();
}).then(function(visiblebuttons) {
return visiblebuttons[i];
});
}
for( i = 0; i <limit; i++){
buttoninthearray(driver, i).then(function(button){
button.click();
});
driver.sleep(1000);
}

This is a snippet from the selenium docs on Stale Element Reference Exception:
A common technique used for simulating a tabbed UI in a web app is to prepare DIVs for each tab, but only attach one at a time, storing the rest in variables. In this case, it's entirely possible that your code might have a reference to an element that is no longer attached to the DOM (that is, that has an ancestor which is "document.documentElement").
Here is a link to the docs:
http://docs.seleniumhq.org/exceptions/stale_element_reference.jsp
Since you are dealing with a tabbed UI, this might be the issue.
Like the page says, "If WebDriver throws a stale element exception in this case, even though the element still exists, the reference is lost. You should discard the current reference you hold and replace it, possibly by locating the element again once it is attached to the DOM."

Get the name of the HTML document that called a JS function

I'm a beginner with JS.
I am working with some JavaScript on a site, and I just want to use only 1 file of JS for determine the actions off the pages. Something like this:
function registerEvents(){
NameOfHTMLDocument = //?? Get the name of the document that called registerEvents function.
switch(NameOfHTMLDocument)
{
case:"homepage":
hmp_btn = document.getElementById("hmp_btn");
hmp_btn.onclick=otherFunction;
break;
case:"otherPage":
elem = document.getElementById("elemID");
elem.onclick=fooFunction;
break;
//etc...
}
}
This function is called with a <body onload="registerEvents()"> that is "inherited" by all the pages.
The question is, How can I get the "NameOfHTMLDocument"?. Because I don't want that JS begin doing weird things when trying to get elements that don't exist.
I found that I can get the URL of the DOM and then play a little with it to get the string that i want, but i'm not sure if this is the better way of doing it.
It Would be nice if you have a better suggestion.

Firstly I would really suggest that you create separate script tags in html documents for functionality that is used only on that page and common functionality in separate file for several reasons:
No code pollution
Ease of change
Smaller download
Secondly, you can use switch on window.location.pathname DOM variable which is everything after domain
instead of homepage, etc..
i.e.
url = vader.samplesite.com/light/saber/
window.location.pathname = /light/saber/
(look at http://www.developertutorials.com/questions/question/q-242.php )

window.location.pathname
All you need to do is some parsing, but I'm sure you'll figure that out :) If not, leave a comment.

In your <body onload="registerEvents()"> pass the object this (the BODY in the DOM) through your event function such as : <body onload="registerEvents( THIS )">.
In your function itself, call the object you passed like object.ownerDocument.URL to get the URL including the HMTL document name or object.ownerDocument.title to get the page title.

Develop Reference

JavaScript is the programming language of the Web.

Web scraping using Apify - javascript

The page seems to be loaded by JavaScript so actually I have to use asynchronous code.

Related

Selenium how to get variable from script removed from DOM

How to assert element property value with Espresso-web?

Awaiting For Elements To Appear Within TestCafe In The Context Of A Page Object Pattern

How to click to the favorite buttons of tweets using selenium javascript

Get the name of the HTML document that called a JS function

Categories

Resources