Alright, so I'm in a small pickle. I'm running into issues with JSoup since the page needs Javascript to to finish loading some of the page. Fortunately, I've worked around it in the past (parsed the raw javascript code), and it's very tedious. As of late, I tried to make a program to login to a website but it requires a token from a element. That form element is not visible unless JavaScript is executed, so it wont show up at all for me to even extract. So I decided to look into Selenium.
First question, is this the library I should be looking into? The reason why I'm so bent on using HttpClient is because some of these websites are very high in traffic and doesn't load up all the way BUT I don't need these pages to load up all the way. I just need it to load up enough to where I can retrieve the login token. I prefer to communicate with the webserver with raw JSON/POST methods once I discover the the methods required vs. having Selenium automate a click/wait/type sequence.
Basically, I only need selenium to load up 1/4 of the page, just to retrieve request tokens. The rest of my program will send POST methods using HttpClient.
Or should I just let selenium do all the work? My goal is speed. I need to login, purchase an item fast.
Edit: Actually, I might go with HtmlUnit because it's very minimal. I only need to scrape information, and I don't want to run Selenium's StandAlone Server. Is this the better approach?
Basically, HtmlUnit is quicker than Selenium so if you are going for speed you should use that. Anyway, keep in mind that Selenium has its own implementation of HtmlUnitDriver. So, as another option, you could use Selenium with HtmlUnit. The difference between them is that HtmlUnit is a browser itself without GUI, meanwhile Selenium works calling browsers feature. You may want to take a look at this other question for further details: Selenium vs HtmlUnit?
Related
So there is a problem with JavaScript and requests (in Python) and that is, it does not use JavaScript when requesting a webpage.
The website I'm working with (https://access.paylocity.com/) requires JavaScript and without it, it changes the content of the page to just a text at the top saying, "Please enable JavaScript to view the page content."
(I could be wrong here but) I think one solution is the use of Selenium, but that would replace requests which I'm fine with as long as there are no other ways of fixing/bypassing this JavaScript detection.
(For those wondering, this python project of mine is supposed to automatically fetch the events on the Paylocity calendar, then port those events to another calendar that I frequently use everyday. It's also just intended for myself.)
Edit: Here is the code I have if that will help https://pastecode.io/s/GXTUO1BgtR (I didn't know where to paste my code, so I decided on that website. If I should change it, please comment or say something about it.)
Since the website you're working with is dynamically loading the JS as far I can tell, I think you have no other choice as to making use of Selenium. I had a project on my own a couple weeks ago and run into a similar problem which I could also solve using Selenium. But, I'm no expert, I'm just giving away my thoughts on this.
Because Selenium can traverse javascript websites (which Mechanize cannot), and Mechanize can make post requests (which Selenium cannot), in some cases it would be powerful to use the two in conjunction.
The answer by +Zarkonnen to this question suggests that one would use Selenium initially, then Mechanize would step in to make the post request and than pass that back to Selenium.
How would one integrate Mechanize post method into Selenium?
I am using the Ruby versions of these libraries, but any information would be useful.
EDIT Here's a Venn Diagram to hopefully clarify the functionality I am seeking.
"Javascript website" in this case simply means a website whose functions in question will not work without javascript enabled. Meaning, say I needed to traverse a website to get to a form on that website. Along the way I ran into buttons which didn't work without javascript enabled. Then, in order for the form to work the way I wanted, I had to do a custom post. In this case scenario, neither Selenium WebDriver nor Mechanize can handle it by themselves - they need help from each other.
How would you accomplish this? Would you use Selenium and then have Mechanize step into to help when you had to do the post? Would you use some other method to make a post within Selenium? Would you use the Capybara gem? I get there are limitations with WebDrivers making Posts, but I know there must be a workaround.
The question is a bit vague, but both Selenium (WebDriver) and a good non-interactive HTTP library (like Mechanize) are crucial elements in a tester's armoury.
In general I say that if you need to simulate a human being in an interactive scenario, then you can't beat WebDriver. However, the web is built upon HTTP, everything Selenium does is HTTP, and so the less interactive your scenario, the less you need to simulate a real user, and the more performance matters, the more you should look to Mechanize, and possibly even lower-level HTTP libraries.
Because of that, although the two technologies are complementary in a sense, I can't think of all that many good reasons to use them in conjunction. But perhaps the following:
WebDriver manages a user on a web site, Mechanize is used to query REST endpoints to dump metrics, clear caches, run usage reports, kick off simultaneous requests to simulate concurrency.
Mechanize is used to seed/prepare test data prior to a WebDriver run.
Those are both examples where WebDriver could be used for everything, but where it would be vastly easier and more efficient to use a non-interactive tool.
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.
I want to analyse some data of a webpage, but here's the problem: the site has more pages which gets called with a __doPostBack function.
How can I "simulate" to go a page further and analyse this site, and so on..
At this time I analyse the data with JSoup in java - but I'm open to use some other language if it's necessary.
A postback-based system (.NET, Prado/PHP, etc) works in a manner that it keeps a complete snapshot of the browser contents on the server side. This is called a pagestate. Any attempt to manipulate with a client that is not JavaScript-capable is almost sure to fail.
What you need is a JavaScript-capable browser. The easiest solution I found is to use the framework Firefox is written in - XUL - to create such a desktop application. What you do is basically create a desktop application with a single browser element in it, which you can then script from the application itself without restrictions of the security container. Alternatively, you could also use the Greasemonkey plugin to do your bidding. The latter is a bit easier to get started with, but it's fairly limited since it's running on a per-page basis.
With both solutions you then have access to the page's DOM to gather data and you can also fire events (like clicking on a button). Unfortunately you have to learn JavaScript for this to work.
I used an automation library which is Selenium, which you can use in a lot of languages (C#, Java, Perl,...)
For more information how to start this link is very helpful: this.
As well as Selenium, you can use http://watin.org/
I'm connecting to a web site, logging in.
The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.
I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.
At present, Mechanize doesn't handle JavaScript. There's talk of eventually merging Johnson's capabilities into Mechanize, but until that happens, you have two options:
Figure out the JavaScript well enough to understand how to traverse those pages.
Automate an actual browser that does understand JavaScript using Watir.
what are those pages for? how they can redirect to other pages. what should i do to pass these?
Sometimes work is done on those pages. Sometimes the JavaScript is there to prevent automated access like what you're trying to do :). A lot of websites have unnecessary checks to make sure you have a "good" browser, so make sure that your user_agent is set to something common, like IE. Sometimes setting the user_agent to look like an old browser will let you get past without JavaScript.
Website automation is fun because you have to outsmart the website and its software developers, using multiple strategies. Like the others said, Watir is the best tool for getting past JavaScript at the moment.