Windmill in Python hanging when using getPageText()

Windmill in Python hanging when using getPageText() - javascript

I'm trying to write a simple script with Windmill to open a page (which has javascript) and then download the entire html. My code is:
from windmill.authoring import setup_module, WindmillTestClient
from windmill.conf import global_settings
import sys
global_settings.START_FIREFOX = True
setup_module(sys.modules[__name__])
def my_func():
url = "a certain url"
client = WindmillTestClient(__name__)
client.open(url=cur_url)
html = client.commands.getPageText()
This last line, with getPageText() just seems to hang. Nothing happens and it never returns.
Also, is it necessary for windmill to open up the whole GUI every time? And if it is, is there a function in python to close it when I'm done (a link to any actual documentation would be helpful; all I've found are a few examples)?
Edit: solved the problem by just using Selenium instead, took about 15 minutes vs 3 hours of trying to make Windmill work.
A colleague of mine came up with an alternate solution, which was to actually watch the network traffic coming into the browser and scrape the GET requests. Not totally sure how he did it though.

Related

Getting 403 when using Selenium to automate checkout process

I am trying to create a script using python and selenium to automate the checkout process at bestbuy.ca.
I get all the way to the final stage where you click to review the final order, but get the following 403 forbidden message (as seen in the network response) when I try to click through to the final step.
Is there something server side that has detected that I am using selenium and preventing me to proceed?
How can I hide the fact that it is selenium being used?
These are the options I am using for selenium:
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options)
I currently have 10 second delays after each action(ie open page, wait, click add to cart, wait, click checkout, wait)
I have implemented a random useragent to be used on each run:
import fake_useragent
ua = UserAgent()
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
I have also modified my chromedriver binary as per the comments in THIS THREAD
Error seen when proceeding to order review page:

After much testing the last few days, here are the options that have allowed me to bypass the restrictions I was facing.
Modified cdc_ string in my chromedriver
Chromedriver options:
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--disable-extensions")
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_driver = webdriver.Chrome(options=options)
Change the property value of the navigator for webdriver to undefined:
chrome_driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
After all three of these were implemented I no longer faced any 403 error when navigating the site and the cart/checkout process.

In my case, either using code to control the browser, or simply starting Chrome through python and manually using the browser always leads to the 403 error, even just adding a product to the cart.
As you said, I think that this site someway knows that the user is using Selenium or some sort of automation tool and the server is blocking API requests.
Searching in stackoverflow I found this https://stackoverflow.com/a/52108199/3228768 but editing the chromedriver results anyway in a failure.
The only way I completed the flow is settings this options:
u = 'https://www.bestbuy.ca/en-ca/category/appliances/26517'
# relevant part start here
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
# relevant part ends here
driver = webdriver.Chrome(executable_path=r"chromedriver.exe", options=options)
driver.maximize_window()
driver.get(u)
In this way I managed to add a product to the cart. I think you could use it to proceed the flow until checkout.
Let me know.

Try this one: https://github.com/ultrafunkamsterdam/undetected-chromedriver
It avoids the selenium detection quite well, I've been having good result with it so far. Headless is not guaranteed though.
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://www.bestbuy.ca/')

Simple JavaScript bulk-download script for Humble Bundle library is only returning the last item

Trying to make a small JavaScript that will download all my e-books (for example) on Humble Bundle. I realize that something like this has been done before, but all the solutions I've encountered so far work in the purchases section, not the library. I also realize that Humble Bundle, at some point, added a "bulk download" button on each e-book purchase page, making the aforementioned solutions obsolete.
I prefer not asking for help, but at this point, I just want to make my script work and learn why it is not. I don't want to take the easy way out using any third-party add-on or application (e.g. download managers). I have tried this in jQuery as well, but have gotten the same results below. Would like to do it in vanilla JS, but welcome any helpful suggestions!
Here is my code:
var domItem = document.querySelectorAll("div.selector-content div.text-holder h2"), domItemName, domItemDownload;
domItem.forEach(function(itemBtn) {
itemBtn.click();
domItemName = document.querySelector("div.details-holder div.details-heading div.text-holder h2");
domItemDownload = document.querySelectorAll("div.details-holder div.js-button-holder div.js-download-button h4");
domItemDownload.forEach(function(downloadBtn) {
console.log(domItemName.innerText + ": " + downloadBtn.innerText);
downloadBtn.click();
});
});
What I expect to happen for each e-book is an output of the e-book name and type of e-book it is downloading (PDF, etc.) and then navigating to the URL obtained by clicking on the download button. An example of the URL is here: https://dl.humble.com/torrents/unixpowertools.mobi.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx.
This works as expected up to the point where it downloads all the torrent files: the browser console log will say that it has navigated to each URL to download the needed file, but only the last entry gets downloaded. For example, say I have three e-books and each of them have a PDF torrent file, the script will click everything as expected and the browser will say something like the following:
CSS Refactoring: PDF main.min.js:10:15514
CSS: The Definitive Guide: PDF main.min.js:10:15514
D3 Data-Driven Documents Pocket Primer: PDF main.min.js:10:15514
Navigated to https://dl.humble.com/torrents/cssrefactoring.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx
Navigated to https://dl.humble.com/torrents/css_thedefinitiveguide.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx
Navigated to https://dl.humble.com/torrents/d3datadrivendocuments_pocketprimer.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx
However, I will only get the torrent file for that last entry. No matter what type of e-book it is, whether it is a direct download or the torrent file, no matter where I start and end the loop, or whether I use Chrome or Firefox, I always download only the last entry's file.
So, after seeing that I can get the e-books' download URLs by clicking on the download buttons, I tried random ones directly in the browser and was able to download each of them individually, so I know the URLs are working as expected. To just get to an expected result, I then copy-pasted all the URLs in the console log and put them into an array. I then looped through the array with the following script, but still get the same result:
var urls = [
'https://dl.humble.com/torrents/cssrefactoring.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx',
'https://dl.humble.com/torrents/css_thedefinitiveguide.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx',
'https://dl.humble.com/torrents/d3datadrivendocuments_pocketprimer.pdf.torrent?gamekey=xxxxx&ttl=xxxxx&t=xxxxx'
];
for (var i = 0; i < urls.length; i++) {
document.location.href = urls[i];
};
Based on my research, this sounds just like a closure issue. However, using techniques like those found on https://dzone.com/articles/why-does-javascript-loop-only-use-last-value have not resolved the issue. Furthermore, my understanding of a closure issue is that I shouldn't be seeing the browser "navigating" to each URL, but instead expect it to say it is navigating to the same URL many times.
I also thought that maybe this was an issue with the browser trying to download too many files from the server too quickly, so I tried implementing a wait in three ways: setTimeout, setInterval, and wrote a function to while-loop until a specified time has elapsed (bad, I know). This still gave the same result, but slower.
I am sure the issue is something simple but having worked on and abandoned this particular task many times before, I just need a set of fresh, more experienced eyes on it.
This is my first post, so I appreciate your time reading this and let me know if there is any more information you may need or if I need to fix up my post.

It is not related to closures. When you click on a link, the browser closes the current page and opens a new one. If you click on another link while the page gets opened, the page will abort the loading process of the page and open the new one instead. You get the same behaviour with .click() will cause a redirect that aborts the previous ones, therefore the last page is opened.
Instead you could open each link in a new tab:
for (var i = 0; i < urls.length; i++) {
window.open(urls[i], "download");
};

How to parse a web use javascript to load .html by Python?

I'm using Python to parse an auction site.
If I use browser to open this site, it will go to a loading page, then jump to the search result page automatically.
If I use urllib2 to open the webpage, the read() method only return the loading page.
Is there any python package could wait until all contents are loaded then read() method return all results?
Thanks.

How does the search page work? If it loads anything using Ajax, you could do some basic reverse engineering and find the URLs involved using Firebug's Net panel or Wireshark and then use urllib2 to load those.
If it's more complicated than that, you could simulate the actions JS performs manually without loading and interpreting JavaScript. It all depends on how the search page works.
Lastly, I know there are ways to run scripting on pages without a browser, since that's what some functional testing suites do, but my guess is that this could be the most complicated approach.

After tracing for the auction web source code, I found that it uses .php to create loading page and redirect to result page. Reverse engineering to find the ture URLs is not working because it's the same URL as loading page.
And #Manoj Govindan, I've tried Mechanize, but even if I add
br.set_handle_refresh(True)
br.set_handle_redirect(True)
it still read the loading page.
After hours of searching on www, I found a possible solution : using pywin32
import win32com.client
import time
url = 'http://search.ruten.com.tw/search/s000.php?searchfrom=headbar&k=halo+reach'
ie = win32com.client.Dispatch("InternetExplorer.Application")
ie.Visible = 0
ie.Navigate(url)
while 1:
state = ie.ReadyState
if state == 4:
break
time.sleep(1)
print ie.Document.body.innerHTML
However this only works on win32 platform, I'm looking for a cross platform solutoin.
If anyone know how to deal this, please tell me.

Take Screenshot of Browser via JavaScript (or something else)

For support reasons I want to be able for a user to take a screenshot of the current browser window as easy as possible and send it over to the server.
Any (crazy) ideas?

That would appear to be a pretty big security hole in JavaScript if you could do this. Imagine a malicious user installing that code on your site with a XSS attack and then screenshotting all of your daily work. Imagine that happening with your online banking...
However, it is possible to do this sort of thing outside of JavaScript. I developed a Swing application that used screen capture code like this which did a great job of sending an email to the helpdesk with an attached screenshot whenever the user encountered a RuntimeException.
I suppose you could experiment with a signed Java applet (shock! horror! noooooo!) that hung around in the corner. If executed with the appropriate security privileges given at installation it might be coerced into executing that kind of screenshot code.
For convenience, here is the code from the site I linked to:
import java.awt.Dimension;
import java.awt.Rectangle;
import java.awt.Robot;
import java.awt.Toolkit;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
import java.io.File;
...
public void captureScreen(String fileName) throws Exception {
Dimension screenSize = Toolkit.getDefaultToolkit().getScreenSize();
Rectangle screenRectangle = new Rectangle(screenSize);
Robot robot = new Robot();
BufferedImage image = robot.createScreenCapture(screenRectangle);
ImageIO.write(image, "png", new File(fileName));
}
...

Please see the answer shared here for a relatively successful implementation of this:
https://stackoverflow.com/a/6678156/291640
Utilizing:
https://github.com/niklasvh/html2canvas

You could try to render the whole page in canvas and save this image back to server. have fun :)

A webpage can't do this (or at least, I would be very surprised if it could, in any browser) but a Firefox extension can. See https://developer.mozilla.org/en/Drawing_Graphics_with_Canvas#Rendering_Web_Content_Into_A_Canvas -- when that page says "Chrome privileges" that means an extension can do it, but a web page can't.

Seems to me that support needs (at least) the answers for two questions:
What does the screen look like? and
Why does it look that way?
A screenshot -- a visual -- is very necessary and answers the first question, but it can't answer the second.
As a first attempt, I'd try to send the entire page up to support. The support tech could display that page in his browser (answers the first question); and could also see the current state of the customer's html (helps to answer the second question).
I'd try to send as much of the page as is available to the client JS by way of AJAX or as the payload of a form. I'd also send info not on the page: anything that affects the state of the page, like cookies or session IDs or whatever.
The cust might have a submit-like button to start the process.
I think that would work. Let's see: it needs some CGI somewhere on the server that catches the incoming user page and makes it available to support, maybe by writing a disk file. Then the support person can load (or have loaded automatically) that same page. All the other info (cookies and so on) can be put into the page that support sees.
PLUS: the client JS that handles the submit-button onclick( ) could also include any useful JS variable values!
Hey, this can work! I'm getting psyched :-)
HTH
-- pete

I've seen people either do this with two approaches:
setup a separate server for screenshotting and run a bunch of firefox instances on there, check out these two gem if you're doing it in ruby: selenium-webdriver and headless
use a hosted solution like http://url2png.com (way easier)

You can also do this with the Fireshot plugin. I use the following code (that I extracted from the API code so I don't need to include the API JS) to make a direct call to the Fireshot object:
var element = document.createElement("FireShotDataElement");
element.setAttribute("Entire", true);
element.setAttribute("Action", 1);
element.setAttribute("Key", "");
element.setAttribute("BASE64Content", "");
element.setAttribute("Data", "C:/Users/jagilber/Downloads/whatev.jpg");
if (typeof(CapturedFrameId) != "undefined")
element.setAttribute("CapturedFrameId", CapturedFrameId);
document.documentElement.appendChild(element);
var evt = document.createEvent("Events");
evt.initEvent("capturePageEvt", true, false);
element.dispatchEvent(evt);
Note: I don't know if this functionality is only available for the paid version or not.

Perhaps http://html2canvas.hertzen.com/ could be used. Then you can capture the display and then process it.

You might try PhantomJs, a headlesss browsing toolkit.
http://phantomjs.org/
The following Javascript example demonstrates basic screenshot functionality:
var page = require('webpage').create();
page.settings.userAgent = 'UltimateBrowser/100';
page.viewportSize = { width: 1200, height: 1200 };
page.clipRect = { top: 0, left: 0, width: 1200, height: 1200 };
page.open('https://google.com/', function () {
page.render('output.png');
phantom.exit();
});

I understand this post is 5 years old, but for the sake of future visits I'll add my own solution here which I think solves the original post's question without any third-party libraries apart from jQuery.
pageClone = $('html').clone();
// Make sure that CSS and images load correctly when opening this clone
pageClone.find('head').append("<base href='" + location.href + "' />");
// OPTIONAL: Remove potentially interfering scripts so the page is totally static
pageClone.find('script').remove();
htmlString = pageClone.html();
You could remove other parts of the DOM you think are unnecessary, such as the support form if it is in a modal window. Or you could choose not to remove scripts if you prefer to maintain some interaction with dynamic controls.
Send that string to the server, either in a hidden field or by AJAX, and then on the server side just attach the whole lot as an HTML file to the support email.
The benefits of this are that you'll get not just a screenshot but the entire scrollable page in its current form, plus you can even inspect and debug the DOM.

Print Screen? Old school and a couple of keypresses, but it works!

This may not work for you, but on IE you can use the snapsie plugin. It doesn't seem to be in development anymore, but the last release is available from the linked site.

i thing you need a activeX controls. without it i can't imagine. you can force user to install them first after the installation on client side activex controls should work and you can capture.

We are temporarily collecting Ajax states, data in form fields and session information. Then we re-render it at the support desk. Since we test and integrate for all browsers, there are hardly any support cases for display reasons.
Have a look at the red button at the bottom on holidaycheck
Alternatively there is html2canvas of Google. But it is only applicable for never browsers and I've never tried it.

In JavaScript? No. I do work for a security company (sort of NetNanny type stuff) and the only effective way we've found to do screen captures of the user is with a hidden application.

How to handle signout when the browser in close in flex 3?

My project had audit module, which includes each and every action of the user to be recorded.
When the user closes the browser the audit regarding the logout has to be stored in the database.
I found one solution on the net, but it is working in my machine's IE but failed to work in the friends machine's IE why?
The code is:
window.onbeforeunload = clean_up;
function clean_up()
{
var flex = document.${application} || window.${application};
flex.myFlexFunction();
}
I placed this code in the index.template.html file in the html-template folder under flex src.
I also placed the below code in my main application.mxml file:
ExternalInterface.addCallback("myFlexFunction",btnLogout);
and I defined the logout function.

Ok, here is the deal. CAN NOT BE DONE RELIABLY. If this is for audit... you are out of luck and deliver a half baked approach to start with.
Why?
Go to your task manager, kill the IIS process - nothing logs out. No audit. Ergo - the solution does most likely not fulfill the legal audit requirements ;)
Another approach:
Call a service exvery X seconds from the running page. Like every 5 seconds.
Assume client dies when you dont receive call for 2 * X seconds (like after 10 seconds).
This way you realize when the client does not connect anymore. Does not stop the user from pulling the network cable and conrinuing viewing, so a failure of the audit method call should wipe the HTML content ;)
But at least you handle browser crashes / terminations, too.

Develop Reference

JavaScript is the programming language of the Web.

Windmill in Python hanging when using getPageText() - javascript

Related

Getting 403 when using Selenium to automate checkout process

Simple JavaScript bulk-download script for Humble Bundle library is only returning the last item

How to parse a web use javascript to load .html by Python?

Take Screenshot of Browser via JavaScript (or something else)

How to handle signout when the browser in close in flex 3?

Categories

Resources