How to manipulate Javascript websites in Perl

How to manipulate Javascript websites in Perl - javascript

I have been asked to automate the logging into a webapp(what I assume to be one, that runs a lot of .aspx and .js scripts) that, currently, can only run in IE. Now i am programming in Perl and have tried to use Win32::IE::Mechanize to run the IE browser and log in. What i did was try an extract all the forms from the webapp, and given the users information, fill out the required forms, but this is where the problem arises, when I try and run the subroutine no forms appear......
So then I transitioned into WWW::Mechanize and used the post subroutine(from LWP::UserAgent) which solved the problem for the most part. Now i've run into a problem in the response, from the server, I get this script as the content of the response and I don't know what to do with it.
So my question is: Using Perl how can I go about to manipulate a Javascript functions in a website? Would that even be a valid solution to the problem?
I am open to writing this in other programming languages as well. Thanks in advance for the help!
(So that I can fully log in to the webapp)
Update: The content of the response:
var msgTimerID;
var strForceLogOff = "false";
function WindowOnLoad(){
if ("false" == "true" && "false" == "false")
MerlinSystemMsg("",64);
if ("false"=="true")
msgTimerID = window.setInterval("MerlinSystemMsg(10095,64)", 300000,'javascript');
}
function MyShowModal(){
showModalDialog("", window, strFeatures);}
function clearMsgInterval(){
window.clearInterval(msgTimerID);
}
function WindowOnUnLoad(){
if(top.frames(0).document.getElementById("OPMODE").value =="LOGOFF"){
strFeatures = "width=1,height=1,left=1000,top=1000,toolbar=no,scrollbars=no,menubar=no,location=no,directories=no,status=yes,resizable=1";
window.open("ForceLogOff.aspx","forcelogout",strFeatures);
}
}
window.onbeforeunload = WindowOnUnLoad;
window.onload = WindowOnLoad;
There is also this Frame Title that has the src:
FRAME TITLE="Service Desk Express Navigator" SRC="options_nailogo.aspx" MARGINWIDTH=0 MARGINHEIGHT=0 NORESIZE scrolling=no

Trying to emulate the browser with a fully functioning JS engine is going to be a mighty big task. Instead, I'd suggest that you just try to emulate the actual interaction with the web site and not care what HTML/JS is actually sent back. Your server side code doesn't care how the HTTP submissions take place, only that they do. Admittedly this is more fragile if the forms change a lot, but at least you're not trying to implement a full browser.
So look at modules like LWP::UserAgent, HTTP::Request and HTTP::Response.

I'm copying and pasting my answer to your other duplicate question here
(You should consider deleting one of these?)
That content is the website source :)
How WWW::Mechanize deals with FRAME SRC as a link:
Note that <FRAME SRC="..."> tags are parsed out of the the HTML and
treated as links so this method works with them.
You'll want to use follow_link on that link.
As far as dealing with Javascript, there is support for a Firefox Add-on called MozRepl that you can use in conjunction with WWW::Mechanize::Firefox that I have used in the past to call Javascript code while crawling a page.

Related

Getting access to a XML from javascript without Node-Without JQuery

I am trying to developp modifications to a game. The thing is the game is already compiled and the developpers prefer not to decompile the game (for the time beeing). Because of the compilation probably, everytime I try to load JQuery or Node.js whatever version I get the error "that a key already exists in the dictionary". The thing is everything is fine without Node.js or JQuery.js.
What I am trying to achieve is add some features to the game that unfortunately aren't available through the Game's API function call itself. I want to be able to get access to data Inside .xml files used for items/weapons/devices/engines spécifications of items Inside the game. I've tried pretty much all I could find on Stackexchange with what I searched for which was Node and JQuery. Im sorry if you guys think this is a duplicate question. Because it isn't. I can't use Node.js neither can i use JQuery. What else could I try? can someone help me please.
I am a bit new to programing with only 1 year experience in c# and Javascript. Sorry if this feels really noObish to you guys.

What you need is ajax. Modern browsers provide a pretty functional XMLHttpRequest, so you don’t even need a framework anymore.
One important thing to know: you most likely won’t be able to download the xml file using ajax if it’s on a distant server, due to the same-origin policy. You need a reliable access to it. The most convenient solution is to have a copy of the file on a local server such as WAMP, XAMPP, and the like.
I’m not going to write yet another ajax tutorial. Insteal I’ll just provide you with a working minimal HTML page, and point you towards XMLHttpRequest documentation.
<button>Request</button>
<script>
'use strict';
document.querySelector('button').addEventListener('click',
function () {
let req = new XMLHttpRequest();
req.onload = function () {
if (this.responseXML) {
console.log(this.responseXML);
}
else {
console.log(this.responseText);
}
};
req.open('GET', xmlURL); // xmlURL should be the location of the .xml file
req.send();
});
</script>
When you click on the button, the script will request, and then display the server’s response, if any, in your browser console. To open the console, press F12 and select the console tab.
Be aware that the responseXML property will only be populated if the xml sent by the server is strictly well-formed. Xml parsing in JS is somewhat finicky, so you may want to rely on responseText as a fallback.

Best option for crawling a website that loads content via ajax [duplicate]

Please advise how to scrape AJAX pages.

Overview:
All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.
When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.
You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.
Example:
Take this as an example, assume the page you want to scrape from has the following script:
<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
{
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e)
{
try
{
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e)
{
alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function()
{
if(xmlHttp.readyState==4)
{
document.myForm.time.value=xmlHttp.responseText;
}
}
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>
Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.
Advanced scraping with C++:
For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.
Advanced scraping with Java:
For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino
Advanced scraping with .NET:
For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.
The whole page is loaded, and it's very easy to scrape any ajax-related data.
You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS
You can also give a look at this example code, on how to scrape google suggests keywords :
/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);
if (!word) {
casper.echo('please provide a word').exit(1);
}
casper.start('http://www.google.com/', function() {
this.sendKeys('input[name=q]', word);
});
casper.waitFor(function() {
return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
suggestions = this.evaluate(function() {
var nodes = document.querySelectorAll('.gsq_a table span');
return [].map.call(nodes, function(node){
return node.textContent;
});
});
});
casper.run(function() {
this.echo(suggestions.join('\n')).exit();
});

If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.

The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjs is a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit (in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.
I wrote many articles about this subject like web scraping Ajax and Javascript sites and automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.

I like PhearJS, but that might be partially because I built it.
That said, it's a service you run in the background that speaks HTTP(S) and renders pages as JSON for you, including any metadata you might need.

Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.
Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.

I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.
For example,I capture a packet like the following:
GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
HTTP/1.1
Accept: */*
Referer: http://quote.hexun.com/stock/default.aspx
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: quote.tool.hexun.com
Connection: Keep-Alive
Then the URL is :
http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330

As a low cost solution you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.

Selenium WebDriver is a good solution: you program a browser and you automate what needs to be done in the browser. Browsers (Chrome, Firefox, etc) provide their own drivers that work with Selenium. Since it works as an automated REAL browser, the pages (including javascript and Ajax) get loaded as they do with a human using that browser.
The downside is that it is slow (since you would most probably like to wait for all images and scripts to load before you do your scraping on that single page).

I have previously linked to MIT's solvent and EnvJS as my answers to scrape off Ajax pages. These projects seem no longer accessible.
Out of sheer necessity, I have invented another way to actually scrape off Ajax pages, and it has worked for tough sites like findthecompany which have methods to find headless javascript engines and show no data.
The technique is to use chrome extensions to do scraping. Chrome extensions are the best place to scrape off Ajax pages because they actually allow us access to javascript modified DOM. The technique is as follows, I will certainly open source the code in sometime. Create a chrome extension ( assuming you know how to create one, and its architecture and capabilities. This is easy to learn and practice as there are lots of samples),
Use content scripts to access the DOM, by using xpath. Pretty much get the entire list or table or dynamically rendered content using xpath into a variable as string HTML Nodes. ( Only content scripts can access DOM but they can't contact a URL using XMLHTTP )
From content script, using message passing, message the entire stripped DOM as string, to a background script. ( Background scripts can talk to URLs but can't touch the DOM ). We use message passing to get these to talk.
You can use various events to loop through web pages and pass each stripped HTML Node content to the background script.
Now use the background script, to talk to an external server (on localhost), a simple one created using Nodejs/python. Just send the entire HTML Nodes as string, to the server, where the server would just persist the content posted to it, into files, with appropriate variables to identify page numbers or URLs.
Now you have scraped AJAX content ( HTML Nodes as string ), but these are partial html nodes. Now you can use your favorite XPATH library to load these into memory and use XPATH to scrape information into Tables or text.
Please comment if you cant understand and I can write it better. ( first attempt ). Also, I am trying to release sample code as soon as possible.

load page and execute javascript in a url

Hello wonderful stackoverflow users.
I have a question about url loading.
In many browsers and web viewers, there is the functionality to load a url to a website, but also a url to execute javascript.
Load a website: http://www.google.com
Load a script: javascript:alert("Hello!");
My question is, is there a way to load an http request as well as a javascript.
The answer is most likely no, but I want to confirm because I can't find any resources that describe this.
I was thinking it would be something like:
http://www.google.com&&javascript:alert("Hello!");
but the problem is, of course, this is not correct.
The reason why I am doing this is to provide a url that once it is clicked, it will also execute a certain javascript function. This will be in Android.
I appreciate any response, and understand that the answer may be no.

It all depends on whether you have control of the page being linked to. If you cannot modify the source of the linked page, then the answer is quite simply, no.
But, if it is your page, you can pass arguments in the hash, and then read the hash when the page loads and execute script accordingly.
window.onload = function () {
if (location.hash.indexOf("doSomething") > -1) {
// do something
}
};

You can execute javascript when a page loads using Browser plugins, such as GreaseMonkey for Firefox, or TamperMonkey for Chrome.
https://addons.mozilla.org/en-us/firefox/addon/greasemonkey/
http://tampermonkey.net/index.php?version=3.11&ext=dhdg&updated=true

Take Screenshot of Browser via JavaScript (or something else)

For support reasons I want to be able for a user to take a screenshot of the current browser window as easy as possible and send it over to the server.
Any (crazy) ideas?

That would appear to be a pretty big security hole in JavaScript if you could do this. Imagine a malicious user installing that code on your site with a XSS attack and then screenshotting all of your daily work. Imagine that happening with your online banking...
However, it is possible to do this sort of thing outside of JavaScript. I developed a Swing application that used screen capture code like this which did a great job of sending an email to the helpdesk with an attached screenshot whenever the user encountered a RuntimeException.
I suppose you could experiment with a signed Java applet (shock! horror! noooooo!) that hung around in the corner. If executed with the appropriate security privileges given at installation it might be coerced into executing that kind of screenshot code.
For convenience, here is the code from the site I linked to:
import java.awt.Dimension;
import java.awt.Rectangle;
import java.awt.Robot;
import java.awt.Toolkit;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
import java.io.File;
...
public void captureScreen(String fileName) throws Exception {
Dimension screenSize = Toolkit.getDefaultToolkit().getScreenSize();
Rectangle screenRectangle = new Rectangle(screenSize);
Robot robot = new Robot();
BufferedImage image = robot.createScreenCapture(screenRectangle);
ImageIO.write(image, "png", new File(fileName));
}
...

Please see the answer shared here for a relatively successful implementation of this:
https://stackoverflow.com/a/6678156/291640
Utilizing:
https://github.com/niklasvh/html2canvas

You could try to render the whole page in canvas and save this image back to server. have fun :)

A webpage can't do this (or at least, I would be very surprised if it could, in any browser) but a Firefox extension can. See https://developer.mozilla.org/en/Drawing_Graphics_with_Canvas#Rendering_Web_Content_Into_A_Canvas -- when that page says "Chrome privileges" that means an extension can do it, but a web page can't.

Seems to me that support needs (at least) the answers for two questions:
What does the screen look like? and
Why does it look that way?
A screenshot -- a visual -- is very necessary and answers the first question, but it can't answer the second.
As a first attempt, I'd try to send the entire page up to support. The support tech could display that page in his browser (answers the first question); and could also see the current state of the customer's html (helps to answer the second question).
I'd try to send as much of the page as is available to the client JS by way of AJAX or as the payload of a form. I'd also send info not on the page: anything that affects the state of the page, like cookies or session IDs or whatever.
The cust might have a submit-like button to start the process.
I think that would work. Let's see: it needs some CGI somewhere on the server that catches the incoming user page and makes it available to support, maybe by writing a disk file. Then the support person can load (or have loaded automatically) that same page. All the other info (cookies and so on) can be put into the page that support sees.
PLUS: the client JS that handles the submit-button onclick( ) could also include any useful JS variable values!
Hey, this can work! I'm getting psyched :-)
HTH
-- pete

I've seen people either do this with two approaches:
setup a separate server for screenshotting and run a bunch of firefox instances on there, check out these two gem if you're doing it in ruby: selenium-webdriver and headless
use a hosted solution like http://url2png.com (way easier)

You can also do this with the Fireshot plugin. I use the following code (that I extracted from the API code so I don't need to include the API JS) to make a direct call to the Fireshot object:
var element = document.createElement("FireShotDataElement");
element.setAttribute("Entire", true);
element.setAttribute("Action", 1);
element.setAttribute("Key", "");
element.setAttribute("BASE64Content", "");
element.setAttribute("Data", "C:/Users/jagilber/Downloads/whatev.jpg");
if (typeof(CapturedFrameId) != "undefined")
element.setAttribute("CapturedFrameId", CapturedFrameId);
document.documentElement.appendChild(element);
var evt = document.createEvent("Events");
evt.initEvent("capturePageEvt", true, false);
element.dispatchEvent(evt);
Note: I don't know if this functionality is only available for the paid version or not.

Perhaps http://html2canvas.hertzen.com/ could be used. Then you can capture the display and then process it.

You might try PhantomJs, a headlesss browsing toolkit.
http://phantomjs.org/
The following Javascript example demonstrates basic screenshot functionality:
var page = require('webpage').create();
page.settings.userAgent = 'UltimateBrowser/100';
page.viewportSize = { width: 1200, height: 1200 };
page.clipRect = { top: 0, left: 0, width: 1200, height: 1200 };
page.open('https://google.com/', function () {
page.render('output.png');
phantom.exit();
});

I understand this post is 5 years old, but for the sake of future visits I'll add my own solution here which I think solves the original post's question without any third-party libraries apart from jQuery.
pageClone = $('html').clone();
// Make sure that CSS and images load correctly when opening this clone
pageClone.find('head').append("<base href='" + location.href + "' />");
// OPTIONAL: Remove potentially interfering scripts so the page is totally static
pageClone.find('script').remove();
htmlString = pageClone.html();
You could remove other parts of the DOM you think are unnecessary, such as the support form if it is in a modal window. Or you could choose not to remove scripts if you prefer to maintain some interaction with dynamic controls.
Send that string to the server, either in a hidden field or by AJAX, and then on the server side just attach the whole lot as an HTML file to the support email.
The benefits of this are that you'll get not just a screenshot but the entire scrollable page in its current form, plus you can even inspect and debug the DOM.

Print Screen? Old school and a couple of keypresses, but it works!

This may not work for you, but on IE you can use the snapsie plugin. It doesn't seem to be in development anymore, but the last release is available from the linked site.

i thing you need a activeX controls. without it i can't imagine. you can force user to install them first after the installation on client side activex controls should work and you can capture.

We are temporarily collecting Ajax states, data in form fields and session information. Then we re-render it at the support desk. Since we test and integrate for all browsers, there are hardly any support cases for display reasons.
Have a look at the red button at the bottom on holidaycheck
Alternatively there is html2canvas of Google. But it is only applicable for never browsers and I've never tried it.

In JavaScript? No. I do work for a security company (sort of NetNanny type stuff) and the only effective way we've found to do screen captures of the user is with a hidden application.

Creating a bookmarklet that doesn't get blocked

Goal: To create a bookmarklet that calls a remote javascript file that opens a popup window. The popup window is functionally similar to what Delicious's bookmarklet does.
Background: Currently, I'm using window.open within this javascript file, however the popup is getting blocked by pretty much every major browser.
The alternative solution to this is very similar to the way Delicious wrote their bookmarklet - calling window.open through a javascript query within the bookmarklet itself. However, I need the ability to modify the other contents of my javascript file in the future without requiring users to continually grab newest releases of the bookmarklet.
What I've determined to be happening: Since the window.open call is not occurring directly as a result of a click by the user, the browser feels this is something that should be blocked. Here's a source on this.
This is the tutorial I referenced most recently in creating the call to the remote js file.
Here is a basic example of what my code is doing; the window.open/popup portion is the only significant part I'm including as it's the only part I feel is causing the complication:
Example of the remote javascript file:
if (typeof jQuery == 'undefined') {
var jQ = document.createElement('script');
jQ.type = 'text/javascript';
jQ.onload=runthis;
jQ.src = 'http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js';
document.body.appendChild(jQ);
} else {
runthis();
}
function runthis() {
window.open('http://www.google.com/', 'a title',
'location=yes,links=no,scrollbars=no,toolbar=no,width=550,height=550');
}
I'd really appreciate any help as this has been stumping me for a while!

An approach that looks better and side-steps the blocking issue is to have the bookmarklet insert an iframe in the page the user is currently viewing. Ended up taking this approach back when I asked this question. Worked out fine.

Develop Reference

JavaScript is the programming language of the Web.