Retrieving the entire html with external js/css/images through javascript

Retrieving the entire html with external js/css/images through javascript - javascript

I already have a Javascript file (performing some functions), that will be appended to a webpage. Now I want the Javascript to collect the entire webpage along with its html tags, images, external Javascript files and external css files. I don't want to use Jquery/any other external library here.
My motive is to get the entire webpage, save it, and display it as similar as the original one.
Is this possible with Javascript?
Any help will be greatly appreciated.

Short Answer - No
No, it's not possible with JavaScript, especially the "saving" part, as JavaScript doesn't have file access rights in browser environments (which we assume here), except when developing browser extensions or when explicitly modifying your browser's security properties to allow this.
Long Answer - If You Reall Must: The Long and Winding Road...
Loading the Right Content
First you need to figure out whether you want to fetch the page in its static status (as it is sent by the server on the first page load), or in its currently rendered status (after it's been rendered in the browser, and that scripts have executed and may have added content to the page).
Loading Resources
Then you'll need to iterate over all the elements of the DOM, and fetch all external resources (including the ones referenced in CSS files).
You'll probably want to have all resources fetch using HTML or plain-text mime-types in your requests, as otherwise your browser might trigger visible downloads with end-user popups, and not at all perform your transparent downloads.
Updating all references
Next you need to figure out how you'd want to organize your "downloaded" content, and where to put the resources and how to name them to avoid conflicts.
Once done, you need to iterate over all the DOM elements again and update the references to use the paths of your local resources instead of your local resources.
Writing Content to Disk
Now the last bit is to save all these resources to disk, using either your browser's custom APIs or the HTML5 File System APIs.
Exploring the HTML5 FileSystem APIs
Basic Concepts about the FileSystem API
Here Be Dragons
None of this guarantees that you'll achieve what you want, as some pages could still contain code that won't behave nicely once downloaded like this. There may be code requesting content from remote URLs or assuming some directory structures and endpoints, or using resource names that you may have modified, etc... (that would be strange, but is not that uncommon).

Related

Should JS dynamically generate metadata/the whole page?

So I am going to have many pages that have a bunch of text in them, that a JS and CSS file will convert to a colored and everything webpage. I noticed that the text is usually going to be long, and since there are going to be many webpages, I should lower file size. Also since I don't want to ruin file quality, I have decided that my JS file is going to take the text and make a webpage out of it. Side Note: what I am trying to do is make tutorial pages, so I am going to use JS to generate a lot of the things that are on every tutorial page, like the lessons list, to lower file size.
I have noticed that metadata (<head> content) usually takes up some space that JS could generate, so I thought, Why don't I just generate this with JS? But then arose the problem that the some browsers might not parse it, or it might be slow to parse it. So I am asking here on Stack Overflow:
Should JavaScript generate metadata (and maybe almost the whole page, like remove the <head> tag completely and generate it with JS)?

It depends on your desired result.
Google has improved it's SEO mechanisms to render your page before indexing it, see here:
https://developers.google.com/search/docs/guides/javascript-seo-basics
However other bots may not do the same, such as social media crawlers like facebook or twitter that read Open Graph meta tags, or other search engines like Baidu.
If a bot doesn't render your document then the javascript doesn't get executed and your meta isn't present.
Additionally, if your initial document does not contain the stylesheets or other CDNs it takes a bit longer for the client. Imagine the process:
With head
fetch document
fetch resources
render content
Without head
fetch document
render content
fetch resources
re-render
That's way over-simplified but it demonstrates my point.
Alternative:
If your content is so dynamic, you might consider Server Side Rendering (SSR) or Pre-Rendering
You would build your pages programmatically and store/serve them all, or build them on the server-side as they are requested.
https://developers.google.com/web/updates/2019/02/rendering-on-the-web

Is there a downside to including javascript in html instead of linking to an external file?

I'm thinking through ways to speed up a website I'm developing. I know socket connections are expensive, so I was thinking... Is there any downside to including css and javascript code in the actual html/php source code as opposed to linking to it?
It seems that instead of making 10 calls to various files I could simply put all of the code in the html source code and not have any socket calls to external files?
I know I could put everything into 1 javascript file and call that, but that would still create a socket call.
I realize this is probably not going to make much difference and might just be a thought exercise, but is there any real downside to just inlining the code?

Different resources (i.e. HTML, CSS, JS, images ...) do not necessarily require a new socket connection. With HTTP/1.1 the same connection is usually used for multiple resources (but only after each other) and with HTTP/2 multiple resources can be loaded in parallel over the same TCP connection. Thus instead of trying to optimize delivery by combining HTML, JS, CSS into a single file it would be possible to optimize the transport instead by using HTTP/2.
Apart from that often resources like script, CSS and images are shared between HTML pages. In this case serving the same script etc again and again would just be wasteful. Proper caching instead enables reuse of shared resources between the pages.
And finally, inline script is considered a security problem - just look for Cross Site Scripting. Having the script separated from the content allows the use of a strict Content Security Policy which prevents such attacks.

It depends on different circumstances. For example (size of your js file).
If your js file size is large and having lots of codes then you should consider linking as there are many advantages -
(i) It can be cached by the browser locally so that on next visit of the user, your website will load faster.
If your js file size is very small consider embedding it into HTML as it has many advantage too -
(i) There will be less request on your server.
(ii) If you want some variables to be assigned dynamically you should embed js in HTML because it cannot be done with linked js file.
and so on...

Reading a collection of external files locally in JavaScript

I'm working on an app that needs to access a collection of external files. It's basically a music player. It works as-expected under a web server, but I also want it to work locally in the browser.
General overview:
index.htm (Small index file with markup, gather external js, css)
index.js (All the app code here)
dir.js (An array of file paths of all music files)
/AHX/ (location of the music files)
ahx.js (music player code)
The two main difficulties for this are:
JavaScript cannot list directory contents, even if it is a child directory. Instead I express file paths as an array of strings.
Loading external files only possible using XMLHttpRequest, which has security restrictions when running local/offline, but works in other environments (under HTTP or Chrome App, perhaps other platforms, not sure).
Oddly, in the latest Firefox, 2) is not an issue anymore. XMLHttpRequest works locally without disabling security.fileuri.strict_origin_policy. I'm not sure if that is standard behavior, but Chrome doesn't appear to allow it.
In any case, my current solution is generating a list of file-paths in a .js file (previously I used a txt file that required XHR), and using XMLHttpRequest to load the music files. This of course means I need to keep the folder structure and the file-path database in sync, using a shell script to rebuild the dir.js file.
XHR is only supposed to work over HTTP, so the app requires a web server. I want the app to work locally (and not just force the user to install as a Chrome App). So I am asking this question to find alternative methods of reading the data.
One alternative I tried is encoding all 1000 files in base64 strings and storing it in a JS object. This produces a rather large 8MB .js file. It doesn't appear to be slow to load, but I am assuming it isn't exactly efficient... Plus it is a pain to update/maintain.
localStorage, IndexedDB and Web SQL are all options, but there is no way to pre-populate the storage before the app runs. Perhaps utilize File API for a one-time setup of the storage database.
So back to my question: What are some solutions to accessing a large collection of binary files (200+ files, over 6MB etc) locally (i.e. opening the .html file directly)?
Edit: The app in question on GitHub, to clear up any confusion on my use case. But in general, I'm looking for ways to automatically read these music files from the app locally, without cross-origin errors. Also, here is the 'js-database' version. It stores all 1000 files in a 8MB js file likes so:
[{data:"base64-string-of-data-here",path:"original-path-here"}, ...]
In that way it bypasses the need for XHR.
Edit2: A solution using jszip and IndexedDB appears promising. It is not possible to load multiple files from multiple selected folders, but if the directory tree is zipped, jszip can access an array of all files in the format /FOLDER_HERE/FILE_HERE. The paths and binary data can then be imported into IndexedDB in a one-time setup. It also works fine on file:// URLs which is important.
It is also possible that jszip could be used to effectively build/update a large JSON structure of BASE64 strings of the contents, which doesn't require any setup by the user. Still need to be tested though.

don't take this as a definitive answer, this subject interests me too, if people around dont want to take time to elaborate an answer, please comment, it will be more useful than votes..
from what i learnt in javascript resources, consider that you cannot really bypass the security aspect of the question. Even open source, you should warn explicitly if you didn't take in account the security. People could distribute a modified version of the resources for example. It depends on what is done with the resources.
If this is for a player i recommend treating it as a data resource, not as a script resource, because of security (as long as you don't eval strings or such). JSON data could do the job here, but that would need to process the 1000 files. Not so hard to write a script that processes the files though.
HTML5 file API
I haven't used it yet, so i can just give you one or two links. With the downside that it restricts your player to recent browsers.
https://www.html5rocks.com/en/tutorials/file/dndfiles/
HTML5 File API read as text and binary
(i know, not an answer) use a library:
Except that in this case, this might be an answer, just because there is no real universal data retreivement in javascript. A good library would add that and a support for old browers.
Among these solutions, for example jQuery JSONP allows to do dynamical cross-domain GET requests. With data formatting (and not script), it is much safer to inject. But keep in mind that you should be aware in detail what your player does with the binary, and in which way it can be a risk.
http://api.jquery.com/jQuery.getJSON/
direct inclusion of script: not recommended
<script src="./sameFolderFile.js"></script>
As for direct script inclusion in a local folder structure, it actually works in local. IE says there is ActiveX content and asks for use permission, but it works in firefox and chrome. The tag can be dynamically added, but there is a big security risk here: malicious javascript code added in the resources will be executed. This can lead to risks for the users

Is it beneficial to inline all JavaScript when deploying a website

In our HTML page, we have a list of tags to load in many (small) JavaScript source files.
For deployment we plan to concatenate the individual JavaScript files into one bundle which will be included in the HTML page, to save on 'expensive' HTTP requests.
But would it be even more beneficial, to just write all the JavaScript directly into the HTML file, in an in-line Javascript tag?

If the JavaScript code changes on every request ("tags"?), then yes, it's beneficial.
Otherwise: No, because the browser will not be able to cache the JS files.

the best way would be to concatenate them but don't put them directly into you html-file. that way the js-file can be cached independently from the (probably) changing html-source.

A file is better than writing the whole stuff into the HTML, as you can cache the javascript file coming from your server, but unless you cache all .html files, you won't get this benefit (i.e. browsers have to keep redownloading all the inline scripts inside your html files)

But would it be even more beneficial, to just write all the JavaScript directly into the HTML file, in an in-line Javascript tag?
No! You would increase the size of every request and destroy cacheability. One big (but external) JS file is the way to go.
Make sure the JS file is emitting the proper caching headers, and it will be loaded only once per client. Unless your JS is exceedingly small (and your description doesn't sound so), that's pretty much the optimum.

I'd suggest that you compile all your javascript into one file and load it with one <script> tag. Yes, HTTP requests take some time, and browsers limit number of concurrent requests (to one domain).
I wouldn't put all javascript in the HTML, because this is mixing logic and representation, prevents caching (of javascript), etc. Avoid this.
This is the general rule I follow: separate content that changes often from content that changes rarely. This way static content will be cached efficiently. And you can optimize "fluid" content (gzip, minify, etc.) so that it takes less time to load.

I'm assuming that you mean 'embed inside a <script> block' rather than in 'on*' attributes inside the HTML elements. If that's not the case, the answer is a definite no - 'on*' attributes are harder to maintain, and typically bad for accessibility.
Normally the answer is no, because although the user's first request will be more expensive if it has to get external resources, those resources will be cached so future requests will be cheaper. If you embed everything, the user has to load them every time they load the page.
So it depends on a few things, the most important of which are probably:
Are users browsing multiple pages? Will they return? If the answer to both questions is 'no', then there is no benefit from caching, so embedded JavaScript can be quicker.
Is the JavaScript static? If it's dynamic - as in, changes on every page load, then again, caching is irrelevant. You could probably improve your JavaScript architecture to separate the static bits from the dynamic.
You can mix the JavaScript so that static JavaScript is linked, while dynamic or page-specific JavaScript is embedded. This is especially useful with libraries - it may already be cached in the client from another site, but if not, you're still loading from a CDN like Google, so it's very quick.

I wouldn't have thought so.
I always just include files and try to keep my base html looking as clean as possible.

Die hards will say don't do it, separate content from styles and scripting, and I agree. But if its not a lot of JS, you may as well save on any additional HTTP requests. Yes, the browser won't cache it, but that's because it won't need to. And on an SEO basis, Page ranking is improved with faster page load, determined possibly on first visit, not after a cache.

Javascript and website loading time optimization

I know that best practice for including javascript is having all code in a separate .js file and allowing browsers to cache that file.
But when we begin to use many jquery plugins which have their own .js, and our functions depend on them, wouldn't it be better to load dynamically only the js function and the required .js for the current page?
Wouldn't that be faster, in a page, if I only need one function to load dynamically embedding it in html with the script tag instead of loading the whole js with the js plugins?
In other words, aren't there any cases in which there are better practices than keeping our whole javascript code in a separate .js?

It would seem at first glance that this would be a good idea, but in fact it would actually make matters worse. For example, if one page needs plugins 1, 2 and 3, then a file would be build server side with those plugins in it. Now, the browser goes to another page that needs plugins 2 and 4. This would cause another file to be built, this new file would be different from the first one, but it would also contain the code for plugin 2 so the same code ends up getting downloaded twice, bypassing the version that the browser already has.
You are best off leaving the caching to the browser, rather than trying to second-guess it. However, there are options to improve things.
Top of the list is using a CDN. If the plugins you are using are fairly popular ones, then the chances are that they are being hosted with a CDN. If you link to the CDN-hosted plugins, then any visitors who are hitting your site for the first time and who have also happened to have hit another site that's also using the same plugins from the same CDN, the plugins will already be cached.
There are, of course, other things you can to to speed your javascript up. Best practice includes placing all your script include tags as close to the bottom of the document as possible, so as to not hold up page rendering. You should also look into lazy initialization. This involves, for any stuff that needs significant setup to work, attaching a minimalist event handler that when triggered removes itself and sets up the real event handler.

One problem with having separate js files is that will cause more HTTP requests.
Yahoo have a good best practices guide on speeding up your site: http://developer.yahoo.com/performance/rules.html
I believe Google's closure library has something for combining javascript files and dependencies, but I havn't looked to much into it yet. So don't quote me on it: http://code.google.com/closure/library/docs/calcdeps.html
Also there is a tool called jingo http://code.google.com/p/jingo/ but again, I havn't used it yet.

I keep separate files for each plug-in and page during development, but during production I merge-and-minify all my JavaScript files into a single JS file loaded uniformly throughout the site. My main layout file in my web framework (Sinatra) uses the deployment mode to automatically either generate script tags for all JS files (in order, based on a manifest file) or perform the minification and include a single querystring-timestamped script inclusion.
Every page is given a body tag with a unique id, e.g. <body id="contact">.
For those scripts that need to be specific to a particular page, I either modify the selectors to be prefixed by the body:
$('body#contact form#contact').submit(...);
or (more typically) I have the onload handlers for that page bail early:
jQuery(function($){
if (!$('body#contact').length) return;
// Do things specific to the contact page here.
});
Yes, including code (or even a plug-in) that may only be needed by one page of the site is inefficient if the user never visits that page. On the other hand, after the initial load the entire site's JS is ready to roll from the cache.

The network latency is the main problem.You can get a very responsive page if you reduce the http calls to one.
It means all the JS, CSS are bundled into the HTML page.And if your can forget IE6/7 you can put the images as data:image/png;base64
When we release a new version of our web app, a shell script minify and bundle everything into a single html page.
Then there is a second call for the data, and we render all the HTML client-side using a JS template library: PURE
Ensure the page is cached and gzipped. There is probably a limit in size to consider.We try to stay under 400kb unzipped, and load secondary resources later when needed.

You can also try a service like http://www.blaze.io. It automatically peforms most front end optimization tactics and also couples in a CDN.
There currently in private beta but its worth submitting your website to.

I would recommend you join common bits of functionality into individual javascript module files and load them only in the pages they are being used using RequireJS / head.js or a similar dependency management tool.
An example where you are using lighbox popups, contact forms, tracking, and image sliders in different parts of the website would be to separate these into 4 modules and load them only where needed. That way you optimize caching and make sure your site has no unnecessary flab.
As a general rule its always best to have less files than more, its also important to work on the timing of each JS file, as some are needed BEFORE the page completes loading and some AFTER (ie, when user clicks something)
See a lot more tips in the article: 25 Techniques for Javascript Performance Optimization.
Including a section on managing Javascript file dependencies.
Cheers, hope this is useful.

Develop Reference

JavaScript is the programming language of the Web.