How to download entire HTML of a webpage using javascript? - javascript

Is it possible to download the entire HTML of a webpage using JavaScript given the URL? What I want to do is to develop a Firefox add-on to download the content of all the links found in the source of current page of browser.
update: the URLs reside in the same domain

It should be possible to do using jQuery ajax. Javascript in a Firefox extension is not subject to the cross-origin restriction. Here are some tips for using jQuery in a Firefox extension:
Add the jQuery library to your extension's chrome/content/ directory.
Load jQuery in the window load event callback rather than including it in your browser overlay XUL. Otherwise it can cause conflicts (e.g. clobbers a user's customized toolbar).
(function(loader){
loader.loadSubScript("chrome://ryebox/content/jquery-1.6.2.min.js"); })
(Components.classes["#mozilla.org/moz/jssubscript-loader;1"].getService(Components.interfaces.mozIJSSubScriptLoader));
Use "jQuery" instead of "$". I experienced weird behavior when using $ instead of jQuery (a conflict of some kind I suppose)
Use jQuery(content.document) instead of jQuery(document) to access a page's DOM. In a Firefox extension "document" refers to the browser's XUL whereas "content.document" refers to the page's DOM.
I wrote a Firefox extension for getting bookmarks from my friend's bookmark site. It uses jQuery to fetch my bookmarks in a JSON response from his service, then creates a menu of those bookmarks so that I can easily access them. You can browse the source at https://github.com/erturne/ryebox

You can do XmlHttpRequests (XHR`s) if the combination scheme://domain:port is the same for the page hosting the JavaScript that should fetch the HTML.
Many JS-frameworks gives you easy XHR-support, Jquery, Dojo, etc. Example using DOJO:
function getText() {
dojo.xhrGet({
url: "test/someHtml.html",
load: function(response, ioArgs){
//The repsone is the HTML
return response;
},
error: function(response, ioArgs){
return response;
},
handleAs: "text"
});
}
If you prefer writing your own XMLHttpRequest-handler, take a look here: http://www.w3schools.com/xml/xml_http.asp

For JavaScript in general, the short answer is no, not unless all pages are within the same domain. JavaScript is limited by the same-origin policy, so for security reasons, you cannot do cross-domain requests like that.
However, as pointed out by Max and erturne in the comments, when JavaScript is written as part of an extension/add-on to the browser, the regular rules about same origin policy and cross-domain requests does not seem to apply - at least not for Firefox and Chrome. Therefor, using JavaScript to download the pages should be possible using a XMLHttpRequest, or using some of the wrapper methods included in your favorite JS-library.
If you like me prefer jQuery, you can have a look at jQuery's .load() method, that loads HTML from a given resource, and inject it into an element that you specify.
Edit:
Made some updates to my answer based on the comments about cross-domain requests made by add-ons.

if you only write a text web page downloader with your mind,and you only know html and javascript, you can write a downloader name "download.hta" with html and javascript to control Msxml2.ServerXMLHTTP.6.0 and FSO

Related

How does "instant inject" in tampermonkey work?

In tampermonkey, advanced settings, you can find a setting called "Inject mode" at the "Expirimental" tab. Here, you can select a mode called "Instant".
I was wondering, what does it do different? How does it work? Is it simular to ViolentMonkeys injection methods?
Judging by the source code in other extensions like Stylus or Violentmonkey (beta) that recently added the same feature:
The background script makes a Blob in the background script with the data, gets its URL via URL.createObjectURL, puts it into a Set-Cookie header via chrome.webRequest API.
The content script reads the URL from document.cookie and uses it in a synchronous XMLHttpRequest to get the original data synchronously.
This trick is based on an answer by Xan.

Replacing a jQuery script using a Chrome Extension

A website is running a jQuery script. I want to use a Chrome Extension to have the site run my own version of the jQuery script, and not the normal one.
So far, I've managed to make the chrome extension find where the website calls the normal script, and I've replaced it with my own:
document.querySelector("script[src^='website.com/originaljqueryscript']").src = 'myjqueryscript';
As a test, I made myjqueryscript the exact same script as the originaljqueryscript. I set the run_at in the manifest to run at document_end.
When I try to open the website with my script enabled, the console logged an error $(...).dialog is not a function. I think this is because jQuery is not loaded in my chrome extension. So then I found which version of jQuery the website is using, and added that to my chrome extension. Now I get this error: $(…).dialog is not a function I believe that error is due to a conflict between the two jQuerys that have been loaded (one on the website, one from my extension).
Am I on the right track, or is there a better way to replace a websites jQuery script with my own?
If this is for a very specific website, loading jQuery from a specific URL, you can use webRequest API to intercept the request to jQuery and redirect it to a file bundled in your extension. E.g.:
chrome.webRequest.onBeforeRequest.addListener(
function(details) { return {redirectUrl: chrome.runtime.getUrl("js/myjquery.js")}; },
{urls: ["https://example.com/js/jquery.js"]},
["blocking"]
);
(Note: this sample is very minimal; you may need to include and inspect request headers to make sure that the source page is your target site - you really don't want to replace a CDN-provided jQuery for all sites)
This assumes that the website does not use Subresource Integrity checks, however I believe that it will bypass a script-src Content Security Policy (redirect is transparent).
Note that .dialog() is part of jQuery UI, not core jQuery; the site presumably loads both, and you'll need to intercept both. It's possible that the site actually bundles them together.

Best option for crawling a website that loads content via ajax [duplicate]

Please advise how to scrape AJAX pages.
Overview:
All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.
When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.
You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.
Example:
Take this as an example, assume the page you want to scrape from has the following script:
<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
{
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e)
{
try
{
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e)
{
alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function()
{
if(xmlHttp.readyState==4)
{
document.myForm.time.value=xmlHttp.responseText;
}
}
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>
Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.
Advanced scraping with C++:
For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.
Advanced scraping with Java:
For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino
Advanced scraping with .NET:
For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.
In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.
The whole page is loaded, and it's very easy to scrape any ajax-related data.
You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS
You can also give a look at this example code, on how to scrape google suggests keywords :
/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);
if (!word) {
casper.echo('please provide a word').exit(1);
}
casper.start('http://www.google.com/', function() {
this.sendKeys('input[name=q]', word);
});
casper.waitFor(function() {
return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
suggestions = this.evaluate(function() {
var nodes = document.querySelectorAll('.gsq_a table span');
return [].map.call(nodes, function(node){
return node.textContent;
});
});
});
casper.run(function() {
this.echo(suggestions.join('\n')).exit();
});
If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.
The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjs is a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit (in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.
I wrote many articles about this subject like web scraping Ajax and Javascript sites and automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.
I like PhearJS, but that might be partially because I built it.
That said, it's a service you run in the background that speaks HTTP(S) and renders pages as JSON for you, including any metadata you might need.
Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.
Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.
I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.
For example,I capture a packet like the following:
GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
HTTP/1.1
Accept: */*
Referer: http://quote.hexun.com/stock/default.aspx
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: quote.tool.hexun.com
Connection: Keep-Alive
Then the URL is :
http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
As a low cost solution you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.
Selenium WebDriver is a good solution: you program a browser and you automate what needs to be done in the browser. Browsers (Chrome, Firefox, etc) provide their own drivers that work with Selenium. Since it works as an automated REAL browser, the pages (including javascript and Ajax) get loaded as they do with a human using that browser.
The downside is that it is slow (since you would most probably like to wait for all images and scripts to load before you do your scraping on that single page).
I have previously linked to MIT's solvent and EnvJS as my answers to scrape off Ajax pages. These projects seem no longer accessible.
Out of sheer necessity, I have invented another way to actually scrape off Ajax pages, and it has worked for tough sites like findthecompany which have methods to find headless javascript engines and show no data.
The technique is to use chrome extensions to do scraping. Chrome extensions are the best place to scrape off Ajax pages because they actually allow us access to javascript modified DOM. The technique is as follows, I will certainly open source the code in sometime. Create a chrome extension ( assuming you know how to create one, and its architecture and capabilities. This is easy to learn and practice as there are lots of samples),
Use content scripts to access the DOM, by using xpath. Pretty much get the entire list or table or dynamically rendered content using xpath into a variable as string HTML Nodes. ( Only content scripts can access DOM but they can't contact a URL using XMLHTTP )
From content script, using message passing, message the entire stripped DOM as string, to a background script. ( Background scripts can talk to URLs but can't touch the DOM ). We use message passing to get these to talk.
You can use various events to loop through web pages and pass each stripped HTML Node content to the background script.
Now use the background script, to talk to an external server (on localhost), a simple one created using Nodejs/python. Just send the entire HTML Nodes as string, to the server, where the server would just persist the content posted to it, into files, with appropriate variables to identify page numbers or URLs.
Now you have scraped AJAX content ( HTML Nodes as string ), but these are partial html nodes. Now you can use your favorite XPATH library to load these into memory and use XPATH to scrape information into Tables or text.
Please comment if you cant understand and I can write it better. ( first attempt ). Also, I am trying to release sample code as soon as possible.

javascript failing with permission denied error message

I have a classic ASP web page that used to work... but the network guys have made a lot of changes including moving the app to winodws 2008 server running iis 7.5. We also upgraded to IE 9.
I'm getting a Permission denied error message when I try to click on the following link:
<a href=javascript:window.parent.ElementContent('SearchCriteria','OBJECT=321402.EV806','cmboSearchType','D',false)>
But other links like the following one work just fine:
<a href="javascript:ElementContent('SearchCriteria','OBJECT=321402.EV806', 'cmboSearchType','D',false)">
The difference is that the link that is failing is in an iframe. I noticed on other posts, it makes a difference whether or not the iframe content is coming from another domain.
In my case, it's not. But I am getting data from another server by doing the following...
set objhttp = Server.CreateObject("winhttp.winhttprequest.5.1")
objhttp.open "get", strURL
objhttp.send
and then i change the actual html that i get back ... add some hyperlinks etc. Then i save it to a file on my local server. (saved as *.html files)
Then when my page is loading, i look for the specific html file and load it into the iframe.
I know some group policy options in IE have changed... and i'm looking into those changes. but the fact that one javascript link works makes me wonder whether the problem lies somewhere else...???
any suggestions would be appreciated.
thanks.
You could try with Msxml2.ServerXMLHTTP instead of WinHttp.WinHttpRequest.
See differences between Msxml2.ServerXMLHTTP and WinHttp.WinHttpRequest? for the difference between Msxml2.ServerXMLHTTP.
On this exellent site about ASP you get plenty of codesamples on how to use Msxml2.ServerXMLHTTP which is the most recent of the two:
http://classicasp.aspfaq.com/general/how-do-i-read-the-contents-of-a-remote-web-page.html
About the IE9 issue: connect a pc with an older IE or another browser to test if the browser that is the culprit. Also in IE9 (or better in Firefox/Firebug) use the development tools (F12) and watch the console for errors while the contents of the iFrame load.
Your method to get dynamic pages is not efficient i'm afraid, ASP itself can do that and you could use eg a div instead of an iframe and replace the contents with what you get from the request. I will need to see more code to give better advice.

Chrome extension to modify page's script includes and JS

I work on a javascript library that customers include on their site to embed a UI widget. I want a way to test dev versions of the library live on the customer's site without requiring them to make any changes to their code. This would make it easy to debug issues and test new versions.
To do this I need to change the script include to point to my dev server, and then override the load() method that's called in the page to add an extra parameter to tell it what server to point to when making remote calls.
It looks like I can add JS to the page using a chrome extension, but I don't see any way to modify the page before it's loaded. Is there something I'm missing, or are chrome extensions not allowed to do this kind of thing?
I've done a fair amount of Chrome extension development, and I don't think there's any way to edit a page source before it's rendered by the browser. The two closest options are:
Content scripts allow you to toss in extra JavaScript and CSS files. You might be able to use these scripts to rewrite existing script tags in the page, but I'm not sure it would work out, since any script tags visible to your script through the DOM are already loaded or are being loaded.
WebRequest allows you to hijack HTTP requests, so you could have an extension reroute a request for library.js to library_dev.js.
Assuming your site is www.mysite.com and you keep your scripts in the /js directory:
chrome.webRequest.onBeforeRequest.addListener(
function(details) {
if( details.url == "http://www.mysite.com/js/library.js" )
return {redirectUrl: "http://www.mysite.com/js/library_dev.js" };
},
{urls: ["*://www.mysite.com/*.js"]},
["blocking"]);
The HTML source will look the same, but the document pulled in by <script src="library.js"></script> will now be a different file. This should achieve what you want.
Here's a way to modify content before it is loaded on the page using the WebRequest API. This requires the content to be loaded into a string variable before the onBeforeRequest listener returns. This example is for javascript, but it should work equally well for other types of content.
chrome.webRequest.onBeforeRequest.addListener(
function (details) {
var javascriptCode = loadSynchronously(details.url);
// modify javascriptCode here
return { redirectUrl: "data:text/javascript,"
+ encodeURIComponent(javascriptCode) };
},
{ urls: ["*://*.example.com/*.js"] },
["blocking"]);
loadSynchronously() can be implemented with a regular XMLHttpRequest. Synchronous loading will block the event loop and is deprecated in XMLHttpRequest, but it is unfortunately hard to avoid with this solution.
You might be interested in the hooks available in the Opera browser. Opera used to have* very powerful hooks, available both to User JavaScript files (single-file things, very easy to write and deploy) and Extensions. Some of these are:
BeforeExternalScript:
This event is fired when a script element with a src attribute is encountered. You may examine the element, including its src attribute, change it, add more specific event listeners to it, or cancel its loading altogether.
One nice trick is to cancel its loading, load the external script in an AJAX call, perform text replacement on it, and then re-inject it into the webpage as a script tag, or using eval.
window.opera.defineMagicVariable:
This method can be used by User JavaScripts to override global variables defined by regular scripts. Any reference to the global name being overridden will call the provided getter and setter functions.
window.opera.defineMagicFunction:
This method can be used by User JavaScripts to override global functions defined by regular scripts. Any invocation of the global name being overridden will call the provided implementation.
*: Opera recently switched over to the Webkit engine, and it seems they have removed some of these hooks. You can still find Opera 12 for download on their website, though.
I had an idea, but I didn't try it, but it worked in theory.
Run content_script that was executed before the document was loaded, and register a ServiceWorker to replace page's requested file content in real time. (ServiceWorker can intercept all requests in the page, including those initiated directly through the dom)
Chrome extension (manifest v3) allow us to add rules for declarativeNetRequest:
chrome.declarativeNetRequest.updateDynamicRules({
addRules: [
{
"id": 1002,
"priority": 1,
"action": {
"type": "redirect",
"redirect": {
"url": "https://example.com/script.js"
}
},
"condition": {
"urlFilter": 'https://www.replaceme.com/js/some_script_to_replace.js',
"resourceTypes": [
'csp_report',
'font',
'image',
'main_frame',
'media',
'object',
'other',
'ping',
'script',
'stylesheet',
'sub_frame',
'webbundle',
'websocket',
'webtransport',
'xmlhttprequest'
]
}
},
],
removeRuleIds: [1002]
});
and debug it by adding listener:
chrome.declarativeNetRequest.onRuleMatchedDebug.addListener(
c => console.log('onRuleMatchedDebug', c)
)
It's not a Chrome extension, but Fiddler can change the script to point to your development server (see this answer for setup instructions from the author of Fiddler). Also, with Fiddler you can setup a search and replace to add that extra parameter that you need.

Categories

Resources