I would like to use python to login to a site that uses both javascript and https encrypted comm.
Specifically - it's this site:
https://registration.orange.co.il/he-il/login/login/?TYPE=100663297&REALMOID=06-73b4ebbc-5fd9-4b19-b3f4-42671c0df793&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=vmwebadmin3&TARGET=-SM-http%3a%2f%2fwww1%2eorange%2eco%2eil%2fSendSMS%2f
All I wish to do is writing a python script, successfully logging in, and later on to transfer the algorithm to java.
Every solution I've tried so far, just got me back to the same login form.
Thank you all!
Is using a browser automation tool too far off from what you're trying to do? Without knowing the exact goal this could be way off base, but what about using something like Selenium? You can use Selenium from Java, Python, C#, and Ruby.
I've used Selenium to automatically log in to a private wiki and retrieve, edit, and submit changes to articles. If that's similar to what you're trying to do, it could work.
It's a pretty heavyweight approach though, since you actually have to be running a realtime browser to do the work.
You should check spynner module, you can process javascript and https.
Related
Alright, so I'm in a small pickle. I'm running into issues with JSoup since the page needs Javascript to to finish loading some of the page. Fortunately, I've worked around it in the past (parsed the raw javascript code), and it's very tedious. As of late, I tried to make a program to login to a website but it requires a token from a element. That form element is not visible unless JavaScript is executed, so it wont show up at all for me to even extract. So I decided to look into Selenium.
First question, is this the library I should be looking into? The reason why I'm so bent on using HttpClient is because some of these websites are very high in traffic and doesn't load up all the way BUT I don't need these pages to load up all the way. I just need it to load up enough to where I can retrieve the login token. I prefer to communicate with the webserver with raw JSON/POST methods once I discover the the methods required vs. having Selenium automate a click/wait/type sequence.
Basically, I only need selenium to load up 1/4 of the page, just to retrieve request tokens. The rest of my program will send POST methods using HttpClient.
Or should I just let selenium do all the work? My goal is speed. I need to login, purchase an item fast.
Edit: Actually, I might go with HtmlUnit because it's very minimal. I only need to scrape information, and I don't want to run Selenium's StandAlone Server. Is this the better approach?
Basically, HtmlUnit is quicker than Selenium so if you are going for speed you should use that. Anyway, keep in mind that Selenium has its own implementation of HtmlUnitDriver. So, as another option, you could use Selenium with HtmlUnit. The difference between them is that HtmlUnit is a browser itself without GUI, meanwhile Selenium works calling browsers feature. You may want to take a look at this other question for further details: Selenium vs HtmlUnit?
I am trying to create a dynamic tester which will get the input from a database and compare it to what comes out of the page. Can Selenium IDE do something like that using storeEval?
Selenium IDE is built on JavaScript technology. This tool will not allow you to do what you want.
You will need to upgrade to something more. Consider Selenium WebDriver, together with Java, which allows you to do (almost) anything. Including access to database via JDBC.
I am working on parsing a html page.
I tried spynner, selenium, mechanize but didnt able to solve issue with javascript with this case.
Can anyone let me know how can i work with such issue to fetch the data to next page?
when I worked on selenium, in this url first we have to get the data in other select box and then proceed but using selenium I am able to get only same url after click on the next page,
same incase of spynner too.
From what I can tell, mechanize doesn't support javascript. So, if you're doing any automation with javascript heavy sites, mechanize is probably not the way to go. Rather, you probably need python to script a fully functional web browser. You can do this with Mozilla via PyXPCOM, with Ruby and WATIR, or with spynner. Of these options, I'd probably try spynner, first, as spynner is well integrated with python.
Good luck with your project, and happy coding!
Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ?
Code from a website (full page source here LINK:
<a style="color: Black;" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvSearchResults','Page$2')">2</a>
Of course I have basically know idea where to go from here.
Thanks in advance for your help and ideas
Ok there are lots of posts asking how to CLICK a javascript button when web scraping with python libraries mechanize, beautifulsoup....,similar. I see a lot of "that is not supported" responses use THIS non python solution. I think a python solution to this problem would be of great benefit to many. In that light I am not looking for answers like use x,y or z which are not python code or require interacting with a browser.
The mechanize page is not suggesting that you can emulate JavaScript in Python. It is saying that you can change a hidden field in a form, thus tricking the web server that a human1 has selected the field. You still need to analyse the target yourself.
There will be no Python-based solution to this problem, unless you wish to create a JavaScript interpreter in Python.
My thoughts on this problem have led me to three possible solutions:
create an XULRunner application
browser automation
attempt to interpret the client-side code
Of those three, I've only really seen discussion of 2. I've seen something
close to 1 in a commercial scraping application, where you basically create
scripts by browsing on sites and selecting things on the pages that you
would like the script to extract in the future.
1 could possibly made to work with a Python script by accepting a
serialisation (JSON ?) of wsgi Request objects, getting the app to fetch the
URL, then sending the processed page as a wsgi Response object. You could
possibly wrap some middleware around urllib2 to achieve this. Overkill
probably, but kind of fun to think about.
2 is usually achieved via Selenium RC (Remote Control), a testing-centric
tool. It provides a few methods like getHtmlSource but most people that I've
heard using it get don't like its API.
3 I have no idea about. node.js is very hot right now, but I haven't
touched it. I've never been able to build spidermonkey on my Ubuntu
machine, so I haven't touched that either. My hunch is that in order to do
this, you would provide the HTML source and your details to a JS
interpreter, that would need to fake being your User-Agent etc in case the
JavaScript wanted to reconnect with the server.
1 well, more technically, a JavaScript compliant User-Agent, which is almost always a web browser used by a human
I'm connecting to a web site, logging in.
The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.
I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.
At present, Mechanize doesn't handle JavaScript. There's talk of eventually merging Johnson's capabilities into Mechanize, but until that happens, you have two options:
Figure out the JavaScript well enough to understand how to traverse those pages.
Automate an actual browser that does understand JavaScript using Watir.
what are those pages for? how they can redirect to other pages. what should i do to pass these?
Sometimes work is done on those pages. Sometimes the JavaScript is there to prevent automated access like what you're trying to do :). A lot of websites have unnecessary checks to make sure you have a "good" browser, so make sure that your user_agent is set to something common, like IE. Sometimes setting the user_agent to look like an old browser will let you get past without JavaScript.
Website automation is fun because you have to outsmart the website and its software developers, using multiple strategies. Like the others said, Watir is the best tool for getting past JavaScript at the moment.