python mechanize - submit custom form - javascript

I'm interfacing a page that needs a login with mechanize. It uses some javascript on the front page making using mechanize straight-up more difficult. I know what form I have to submit to log-in - the one always generated by the js, the same every time. How can I make mechanize just submit a custom form that isn't on the page? Basically equivalent to this perl problem but in Python.

(NOTE: This came up again recently and I actually got it to work now.)
This seems to work:
br.open(URL)
res = mechanize._form.ParseString(FORM_HTML, BASE_URL)
br.form = res[1]
#continue as if the form was on the page and selected with .select_form()
br['username'] = 'foo'
br['password'] = 'bar'
br.submit()
URL is the full URL of the visited site. BASE_URL is the directory the URL is in. FORM_HTML is any HTML that has a form element, e.g.:
<form method='post' action='/login.aspx'>
<input type='text' name='username'>
<input type='text' name='password'>
<input type='hidden' name='important_js_thing' value='processed_with_python TM'>
</form>
For some reason, mechanize._form.ParseString returns two forms. The first is a GET request to the base URL with no inputs; the second, the properly parsed form from FORM_HTML.

Parse the page, extract the elements you want, reform the page, and inject them back into mechanize.
For a project I worked on, I had to employ a simulated browser and found Mechanize to be very poor at form handling. It would yank uninterpreted elements out of Javascript blocks and die. I had to write a workaround that used BeautifulSoup to strip out all the bits that would cause it to die before it reached the form parser.
You may or may not run into that problem, but it's something to keep in mind. I ultimately ended up abandoning the Mechanize approach and went with Selenium. It's form handler was far superior and it could handle JS. It's got its issues (the browser adds a layer of complexity), but I found it much easier to work with.

Related

How do I access the console of the website that I want to extract data from?

Sorry for the confusing title. I am a beginner in JavaScript and would like to build this little project to increase my skill level: an image extractor. The user is able to input the website name into the form input. Press Extract and the links of all images show up.
Question: how do I access the website DOM that was entered into the input field?
As mentioned by #Quentin in the comments, browsers enforce restrictions on cross-domain requests like this. The Same Origin policy will prevent your site from pulling the HTML source of a page on a different domain.
Since this is a learning exercise, I'd recommend picking another task that doesn't get into the weeds of cross-origin request security issues. Alternatively, you could implement a "scraper" like this out of the browser using Node (JavaScript), Python, PHP, Ruby, or many other scripting languages.
You could try something like this if you already have the html content:
var html = document.createElement('html');
html.innerHTML = "<html><body><div><img src='image-url.png'></div></body></html>";
console.log(html.querySelector("img").src);
If you also need to get the content via ajax calls, I would suggest doing your entire code server side, using something like scrapy

XSS javascript, exploit check

I am currently working on a page where I need the user to input several variables which when submitted are then displayed throughout the page.
Problem is, it needs to be 100% secure code and whilst I'm ok using PDO/mysql etc javascript is not something I'm very fluent in.
At the moment, I have the following:
<script language="JavaScript">
function showInput() {
document.getElementById('var1').innerText =
document.getElementById("user_var1").value;
document.getElementById('var2').innerText =
document.getElementById("user_var2").value;
}
</script>
with the html
<form>
your variable 1 is = <input type="text" name="message" id="user_var1"><br />
your variable 2 is = <input type="text" name="message" id="user_var2"><br />
</form>
<input type="submit" onclick="showInput();">
<p>var1 = <span id='var1'></span></p>
<p>var2 = <span id='var2'></span></p>
From what I can tell, using ".innerText" should stop any html etc being used and I have tested with
<script>alert(document.cookie);</script>
which results in the above just being printed as is (not run).
e.g.
your variable 1 is = <script>alert(document.cookie);</script>
Is there anything else you would recommend doing to make sure it is secure (XSS or otherwise)? Only characters that should need to be entered are / and A-Z 0-9
Thanks in advance :)
edit
Just to clarify, the only code is what is above, the page is not pulling data from a database etc (what you see above is virtually the full php page, just missing the html head body tags etc).
Just based on what you're doing above you're not going to have XSS. innerText will do proper escaping.
To have your site be 100% secure is a tall order. Some of the things I'd look at are running your site over HTTPS with HSTS to prevent a network level adversary tampering with the site, parameterizing your SQL queries, adding CSRF tokens as necessary on form submission.
Specifically regarding XSS, one of the most common ways people get XSS'd is because they perform insecure DOM manipulation. If you're concerned about security I'd highly recommend porting your JS to React as you're manipulating a "virtual DOM", which allows React to perform context sensitive escaping. It also takes the burden off of the developer from having to do proper escaping.
One quick security win is adding a CSP policy to your site and setting the script-src directive to self. A CSP policy establishes the context in which certain content can run on your site. So if for example, you have script-src set to self (meaning your JS is loaded in the src attribute of a <script> tag pointing to the same domain as where the HTML is served, and not inline on the page) if someone does XSS it will (most likely*) not run.
These are just some examples of different security solutions available to you and a brief intro to security-in-depth practices. I'm glad you're taking security seriously!
*There are some circumstances (if you're dynamically generating your scripts for example) in which their code could run.
There is no vulnerability here (please read before downvote).
Just to clarify, the only code is what is above, the page is not
pulling data from a database etc (what you see above is virtually the
full php page, just missing the html head body tags etc).
Therefore the following two fields cannot be populated by anything other than the current user:
<input type="text" name="message" id="user_var1">
<input type="text" name="message" id="user_var2">
because there is no code present that populates these two fields.
The two DOM elements that are populated by code are as follows:
<span id='var1'></span>
<span id='var2'></span>
The code which does this is
document.getElementById('var1').innerText =
document.getElementById("user_var1").value;
document.getElementById('var2').innerText =
document.getElementById("user_var2").value;
It is using the non-standard innerText rather than textContent, however innerText will set the text content rather than HTML content, preventing the browser from rendering any tags or script.
However, even if it was setting the innerHTML property instead, all the user could do is attack themselves (just the same as they would opening up developer tools within their browser).
However, in the interests of correct functional behaviour and internet standards, I would use textContent rather than innerText or innerHTML.
Note that
<script>alert(document.cookie);</script>
would not work anyway, it would have to be
<svg onload="alert(document.cookie)" />
or similar. HTML5 specifies that a <script> tag inserted via innerHTML should not execute.

Auto login using Javascript and POST

I am trying to create a link that navigates to a 3rd party site and automatically logs in.
There is no API and the form doesn't support query strings. Security isn't an issue (I know passing variables in links isn't good practice but in our situation that's ok).
I can get it to work using VBS but IE makes it really tough to execute scripts.
I am now using Javascript:
function autoLogin() {
document.Form1.submit();
}
My HTML:
<form name="namofform" method=post action="www.websiteofloginpage.com">
<input type=hidden id=ID name="USERNAME" value="USERNAME"/>
<input type=hidden id=ID name="PASSWORD" value="PASSWORD"/>
</form>
I change the fields to the one on the form. When I execute the script (on load or by a link) it navigates to the page but isn't posting (logging in).
I noticed the submit button is using the _doPostBack - is that why it's not working trying from my a different site?
Have you looked into other cross-domain POSTing answers? There are certainly a variety of ways you can circumvent the same origin policies of browsers, but you won't be able to do it with simple JavaScript POSTing of forms.
See more here:
Cross Domain Form POSTing
Perhaps you can use a CORS-based or JSONp solution:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS
What is JSONP all about?
Did you try to submit form with the good URI, generally we will have something like that: www.example.com/login. There also another point mentionned in Jim Miller's answer which is the Cross Domain Form POSTing.

Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

Injecting HTML into a page using Mechanize

I am writing a webscraping program to get my grades from a website. I used Mechanize to log into the page and navigate to the area I'm scraping. Unfortunately, the page uses Javascript to encrypt the page (possibly to stop me from scraping). I found the decryption script and ported to Python. It works and I used it to extract the encrypted string from the page and when I convert it, it becomes a table in HTML.
So, to get to my point, is there any way to inject the HTML back into the page and use mechanize to use the links on the table to get my grades?
Thanks for the help!
EDIT: I have beautiful soup also, if that is any help.
I ended up just using this:
response = br.open("www.linknotonpagethatiwanttogoto.com")
page = response.read()
I found out that you store the .open() of a link as a response, instead of using the .follow_link(). Also the browser uses the same cookies so the session cookies are preserved. So after parsing the html, I popped the links into the .open() and got the new page.

Categories

Resources