wget + JavaScript? - javascript

I have this webpage that uses client-side JavaScript to format data on the page before it's displayed to the user.
Is it possible to somehow use wget to download the page and use some sort of client-side JavaScript engine to format the data as it would be displayed in a browser?

You could probably make that happen with something like PhantomJS
You can write a phantomjs script that will load the page like a browser would, and then either take screenshots or use JS to inspect the page and pull out data.

Here is a simple little phantomjs script that triggers javascript on a webpage and allows you to pull it down locally:
file: get.js
var page = require('webpage').create(),
system = require('system'), address;
address = system.args[1];
page.scrollPosition= { top: 4000, left: 0}
page.open(address, function(status) {
if (status !== 'success') {
console.log('** Error loading url.');
} else {
console.log(page.content);
}
phantom.exit();
});
Use it as follows:
$> phantomjs /path/to/get.js "http://www.google.com" > "google.html"
Changing /path/to, url and filename to what you want.

Not with wget, as I doubt it includes any form of a JavaScript engine. However, you could use WebKit to process the page, and thus the output.
Using things like this as a base for how to get the content: http://situated.wordpress.com/2008/06/04/take-screenshots-of-a-website-from-the-command-line/

Related

Window.location file download

I am trying to download a file to client side using following javascript code:
window.location = InsightRoute + "GetOrderXML?orderNumber=" + txtOrderNoVal
If the file is available then it will get downloaded to the client machine. But the issues is if no file is available for downloading, it
will simply gets redirect to a blank page
http://mysite/GetOrderXML?orderNumber=1
You should check whether the file is available for downloading before redirecting, for example like this:
if (sdpInsightRoute && txtOrderNoVal)
window.location = sdpInsightRoute + "GetOrderXML?orderNumber=" + txtOrderNoVal
This way, if the variable txtOrderNoVal is undefined, the redirection wouldn't take place.
If file is not available then use following code inside controller, so that the alert will pop up:
Response.Write("<script>alert('Item does not exist on this environment.');window.history.go(-1);</script>");
return null;
Use of: window.history.go(-1); If there is no file and since it is getting redirected to a new page: http://mysite/Insight/GetOrderXML?orderNumber=1, which can be avoided.

How to get css files and js files when scraping a web page using phantomjs

I am working on a project where I need to scrape the webpage so I gone through with tutorials and I found that phantomJs would be the best choice for it. as it allows us to get HTML content of angularJs site and ajax based view sites, and I have already write code for it and working fine, But the problem is that I am not able to get css and js file if that has only written short path of files.
if the victim is using full URL of the site like below
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
it works fine because the victim is using full URL for js which I can use.
but if the victim is using url
<script src="assets/js/jquery.min.js"></script>
then it is a problem for me I am not able to get css and js of my current HTML contents, so as far as what I did I have written some phantom code. I have posted below.
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
};
page.open("http://insttaorder.com/", function() {
page.evaluate(function() {
});
});
What I need is, I need all css and js file on my local computer, I have searched on google, GitHub but not get any specific solution for that,
The strategy of solving the task is this:
Open the page in PhantomJS
Enumerate all the links to JS and CSS resources
Download them all
Although PhantomJS could be used to download and save the files, it would be very suboptimal to do so. Instead let us follow that unix philisophy that one program should do just one job but do it well. We will use the excellent wget utility to download files from a list that PhantomJS will prepare.
var page = require('webpage').create();
var fs = require('fs');
page.open('http://insttaorder.com/', function(status)
{
// Get all links to CSS and JS on the page
var links = page.evaluate(function(){
var urls = [];
$("[rel=stylesheet]").each(function(i, css){
urls.push(css.href);
});
$("script").each(function(i, js){
if(js.src) {
urls.push(js.src);
}
});
return urls;
});
// Save all links to a file
var url_file = "list.txt";
fs.write(url_file, links.join("\n"), 'w');
// Launch wget program to download all files from the list.txt to current folder
require("child_process").execFile("wget", ["-i", url_file], null, function (err, stdout, stderr) {
console.log("execFileSTDOUT:", stdout);
console.log("execFileSTDERR:", stderr);
// After wget finished exit PhantomJS
phantom.exit();
});
});
You can get all requested resources via onResourceRequested event.
By checking the request method and url, you can filter out the resources you don't want and download it yourself later.
You don't need to worry about the path, the url you get from the event is always complete.
var webPage = require('webpage');
var page = webPage.create();
page.onResourceRequested = function(req) {
if(req.method === 'GET')
if(req.url.endsWith('.css')) console.log('requested css file', JSON.stringify(req));
else if (req.url.endsWith('.js')) console.log('requested js file', JSON.stringify(req));
};
More about onResourceRequested

How to print a pdf on client side from a groovy webApp?

I am know developping a webapp that as to create a pdf document and print it on client side.
Here is my problem:
i created the pdf using itext and stored it in a shared folder. i did it on a server side.
Now i need to print the created pdf on the client side, the client knows the path of the pdf and can access it.
To be on client side, i am trying to print that document using javascript or jquery.
I tried using embed in my html but it didn't work.
Thx for helping,
here is a working code on server side :
FileInputStream fis = new FileInputStream("test.pdf");
DocFlavor psInFormat = DocFlavor.INPUT_STREAM.AUTOSENSE;
Doc pdfDoc = new SimpleDoc(fis, psInFormat, null);
PrintRequestAttributeSet aset = new HashPrintRequestAttributeSet();
PrintService services = PrintServiceLookup.lookupDefaultPrintService();
DocPrintJob job = services.createPrintJob();
job.print(pdfDoc, aset);
and here is what i tried on client side :
// Grabs the Iframe
document.write('<embed type="application/pdf" src="\\SN782629\TempPartage\test.pdf" id="PDF" name="PDF" width="100%" height="100%" />');
var ifr = document.getElementById("PDF");
//PDF is completely loaded. (.load() wasn't working properly with PDFs)
ifr.onreadystatechange = function() {
if (ifr.readyState == 'complete') {
ifr.contentWindow.focus();
if ($.browser.msie) {
document.execCommand('print', false, null);
} else {
window.print();
}
}
ifr.parentNode.removeChild(ifr);
this second code is on the success section of ajax request, i can put the entire function if needed.
fyi: Recommended way to embed PDF in HTML?
and another point: you'll meet a lot of restrictions in different browsers when including pdf from local file system into a page loaded through HTTP.
i'll advice to expose your pdf through url on your server. for example http://myhost/getPdf/SN782629/TempPartage/test.pdf instead of ""\SN782629\TempPartage\test.pdf" and use this link in rendered page.

Do we need a web server (like Apache) to access a .json file?

I was trying to read an info.json file, using the jQuery API. Please find the code below, which is part of test.html.
$.getJSON('info.json', function(data) {
var items = [];
$.each(data, function(key, val) {
items.push('<li id="' + key + '">' + val + '</li>');
});
The test.html file resides on my local machine and when I try to open it in the browser, the Ajax call is not getting triggered and the info.json file is not read.
Is it not working because I don't have a web server? Or am I doing anything wrong in the code? (I don't see any errors in the Firebug console though).
Thanks in advance.
You will always have to host your site from where you are making AJAX call. Otherwise it will throw this exception.
origin null is not allowed by access-control-allow-origin
Host your page on localhost server and I guess everything will work nicely.
While technically you don't need a web server for this, some of the libraries you use to abstract network access may not work with local files and some browsers don't let local files do a lot, so something like a little test web server for static files would be very useful for your development and testing.
Install a small webserver like http://jetty.codehaus.org/jetty/
easy to install, and small download ;)
By putting your JSON string into a text file and loading it in a iframe, you can extrapolate the data. (Most browsers can load .txt files in iframes.)
var frame = document.createElement("IFRAME"); //Create new iframe
var body = document.body;
frame.onload = function() { //Extrapolate JSON once loaded
data = JSON.parse(frame.contentDocument.documentElement.innerText); //Loads as a global.
body.removeChild(frame); //Removes the frame once no longer necessary.
}
frame.style.display = "none"; //Because the frame will have to be appended to the body.
body.appendChild(frame);
frame.src = "your JSON.txt"; //Select source after the onload function is set.

Issues in developing web scraper

I want to develop a platform where users can enter a URL and then my website will open the webpage in an iframe. Now the user can modify his website by simply right clicking and I will provide him options like "remove this element", "copy this element". I am almost through. Many of the websites are opening perfectly in iframe but for a few websites some errors have shown up. I could not identify the reason so asking for your help.
I have solved other issues like XSS problem.
Here is the procedure I have followed :-
Used JavaScript and sent the request to my Java server which makes connection to the URL specified by the user and fetches the HTML and then use Jsoup HTML parser to convert relative URLs into absolute URLs and then save the HTML to my disk in Java. And then I render the saved HTML into my iframe.
Is somewhere wrong ?
A few websites are working perfectly but a few are not.
For example:-
When I tried to open http://www.snapdeal.com it gave me the
Uncaught TypeError: Cannot read property 'paddingTop' of undefined
error. I don't understand why this is happening..
Update
I really wonder how this is implemented? # http://www.proxywebsites.in/browse.php?u=Oi8vd3d3LnNuYXBkZWFsLmNvbQ%3D%3D&b=13&f=norefer
2 issues, pick any you like:
your server side proxy code contains bugs
plenty of sites have either explicit frame-break code or at least expect to be top level frame.
You can try one more thing. In your proxy script you are saving your webpage on your disk and then loading into iframe. I think instead of loading the page you saved on disk in iframe try to open that page in browser. All those sites that restirct their page to be loaded into iframe will now get opened without any error.
Try this I think it an work
My Proxy Server side code :-
DateFormat df = new SimpleDateFormat("ddMMyyyyHHmmss");
String dirName = df.format(new Date());
String dirPath = "C:/apache-tomcat-7.0.23/webapps/offlineWeb/" + dirName;
String serverName = "http://localhost:8080/offlineWeb/" + dirName;
boolean directoryCreated = new File(dirPath).mkdir();
if (!directoryCreated)
log.error("Error in creating directory");
String html = Jsoup.connect(url.toString()).get().html();
doc = Jsoup.parse(html, url);
links = doc.select("link");
scripts = doc.select("script");
images = doc.select("img");
for (Element element : links) {
String linkHref = element.attr("abs:href");
if (linkHref != "") {
element.attr("href", linkHref);
}
}
for (Element element : scripts) {
String scriptSrc = element.attr("abs:src");
if (scriptSrc != "") {
element.attr("src", scriptSrc);
}
}
for (Element element : images) {
String imgSrc = element.attr("abs:src");
if (imgSrc != "") {
element.attr("src", imgSrc);
log.info(imgSrc);
}
}
And Now i am just returning the path where i saved my html file
That's it about my server code

Categories

Resources