How to download and query html pages where JS processing is necessary?

How to download and query html pages where JS processing is necessary? - javascript

I often compile informal datasets by running some kind of XPath/XQuery on publicly available web pages. Usually the structure of the HTML is regular enough that useful information can be extracted easily.
But today I've come across tunefind.com. This website makes extensive use of the REACTJS framework, and so most of the structure of the page is configured client-side by Javascript. The pages, when initially downloaded, are very basic and missing a lot of information. The pages are populated by a script that uses a hopelessly messy blob of JSON data at the bottom of the page.
The only way I can think of to deal with this would be to use some kind of GUI-based web engine and just not display the GUI part. But that is a preposterous amount of work for these casual little CLI tools that I use to gather information.
Is there any way to perform the javascript preprocessing without dealing with unnecessary graphics?

Even if you were to process without the graphics the react javascript will be geared towards running in a browser context, at the very least it will expect a functioning DOM to exist, the application itself may also require clicks / transitions to happen before you can see some data.
Your best bet then is to load the page in a browser, to keep this simple, there are plenty of good browser automation frameworks designed for this.
I've used a fair few libraries over the years including phantomJS and recently I've gotten the most mileage out of nightmarejs.
It runs an electron browser for you and gives you a useful promisified javascript API to control it with, that has common browser functions such as clicking, following links etc.
You can configure it to hide the browser which is useful for making a CLI tool, however its a bit of a pseudo-headless mode and will still require a windowing/graphical context (e.g. x window).
Hope this helps.
PS - If you're at all used to docker it's not hard to make this just a running container!

Related

Client-side Website Localization Using URL Path

I'm working on localizing a website that I recently built - https://xmllint.com
The project is rather small, and I mostly use it to teach myself javascript along with Webpack and other web-related technologies/frameworks.
The website is 100% browser-based and does not have a lot of content. For that reason, I decided to go with this approach to translate the content itself.
The replacement of the placeholders with the 'real' content happens via javascript that is at the bottom of the HTML. Ultimately I want to have the content ready before the page renders. Just so that that search engines can index the new pages nicely.
What I want to achieve is that the page itself detects the language code (e.g., https://xmllint.com/es/ for Spanish) from the URL and then performs the translation based on that value.
What I'm struggling with is how to handle the part of the URL in the web page itself as the directory itself does not exist on the server directly.
So far, I tried redirecting all HTTP 404 codes to the index.html file itself (on the hosting side) - As suggested for SPAs.
This leads me to problems loading the resources as the relative paths now include the language code part of the URL.
Two ideas came to mind.
Improve the current Webpack build so that I only deliver a single file including all assets. That way I would not have problems with relative paths and I should be good. (Is Single page application just one page using for entire web application?)
Should I introduce a routing framework like Vue?
What I'm not asking for is
How to parse the URL itself.
For SEO reasons I also don't want to use URL parameters.
Hacky ideas or workarounds. I have no time pressure and want to know how this is done best.
Any help/ideas are greatly appreciated.

Under the circumstances that you have no time pressure, I'd personally recommend to use a JavaScript framework - or more specifically - Vue.js. Since you already mentioned it, I assume you have basic knowledge of it.
I see various ways to benefit from choosing this path:
The actual problem you're facing will no longer be an issue. The application will handle all the routing, so all you have to do is return the index.html and you're good to go
The developer experience (build process, hot reload, deployment, ...) will dramatically improve your daily work
Your bundle size will very likely reduce
You're prepared for future growth of your application
Best of all: you're challenging yourself by using a technology you probably have not much experience with. Speaking for myself, that should be reason enough. :-)
Happy coding!

PHP template system vs javascript AJAX template

My PHP template looks like this:
$html=file_get_contents("/path/to/file.html");
$replace=array(
"{title}"=>"Title of my webpage",
"{other}"=>"Other information",
...
);
foreach(replace AS $search=>$replace){
$html=str_replace($search,$replace,$html);
}
echo $html;
I am considering switching to a javascript/ajax template system. The AJAX will fetch the $replace array in JSON format and then I'll use javascript to replace the HTML.
The page would then be a plain .html file and a loading screen would be shown until the ajax was complete.
Is there any real advantages to this or is the transition a waste of time?
A few of the reasons I think this will be beneficial:
Page will still load even if the Mysql or PHP services are down. If the ajax fails I can handle it with an error message.
Bot traffic (and anything else that doesnt run JS) will cause very little load to my server since the ajax will never be sent.
Please let me know what your thoughts are.

My 2cents is it is better to do the logic on the template side (javascript). If you have a high traffic site you can off load some of the processing to each computer calling the site. Maybe less servers.
With Javascript frameworks like AngularJs the template stuff is pretty simple and efficient. And the framework will do caching for you.
Yes, SEO can be an issue with certain sites. There at proxy tools you can put in place that will render the site and return the static html to the bot. Plus I think some bots render javascript these days.
Lastly, I like to template on the front-end because I like the backend to be a generic data provider (RESTful API). This way I can build a generic backend that drives web / mobile and other platforms in a generic way. The UI logic can be its separate thing in javascript.
But it comes down to the design needs of your application. I build lots of Software as a service applications so a single page application works well for me.

I've worked with similar design pattern in other projects. There are several ways to do this and the task would involve managing multiple project or application modules. I am assume you are working with a team of developers and not using either PHP or JavaScript MVC framework.
PHP Template
For many reasons, I'm against using “search and replace” method especially using server-side scripting language to parse HTML documents as a templating kit.
Why?
As you maintain business rules and project becomes larger, you will
find yourself reading through a long list of regular expressions,
parse HTML into DOM, and/or complicated algorithms for searching
nodes to replace with correct text(s).
If you had a placeholder, such as {title}, that would help the
script to have fewer search and replace expressions but the design
pattern could lead to messy sharing with multiple developers.
It is ok to parse one or two HTML files to manage the output but not
the entire template. The network response could be slower with
multiple and repetitive trips to server and that's just only for
template. There could be other scripts that is also making trips to
the server for different reason unrelated to template.
AJAX/JavaScript
Initially, AJAX with JavaScript might sound like a neat idea but I'm still not convinced.
Why?
You can't assume web browser is JavaScript-enabled in every mobile
or desktop. You might need to structure the HTML template in few
ways to manage the output for non-JavaScript browsers. You might
need to include <noscript> and/or <iframe> tags on every page. And,
managing alternative template for non-JavaScript browser can be
tedious.
Every web browser interpret JavaScript differently. Most developers
should know few differences between IE, FireFox, Chrome, Safari, and
to name few. You might need to create multiple JavaScript files to
detect then load JavaScript for that specific web browser. You
update one feature, you have to update script for all web browsers.
JavaScript is visible in page source. I wouldn't want to display
confidential JavaScript functions that might include credentials,
leak sensitive data about web services, and/or SQL queries. The idea
is to secure your page as much as possible.
I'm not saying both are impossible. You could still do either PHP or JavaScript for templating.
However, my “ideal” web structure should consist of a reliable MVC like Zend, Spring, or Magnolia. Those MVC framework include many useful features such as web services, data mapping, and templating kits. Granted, it's difficult for beginners with configuration requirements to integrate MVC into your project. But in the end, you could delegate tasks in configurations, MVC concepts, custom SQL queries, and test cases to developers. That's my two cents.

I think the most important aspects you forgot are:
SEO : What about search engine bots ? They wont be able to index your content if it is set by javascript only.
Execution and Network Latency : When your service is working, the browser will wait until the page is loaded (let's say 800ms) before making the extra Ajax calls to get your values. This might add an extra 500ms to get it (depending on network speed and geographic location...). If you have sent all the generated data by your server, you would have spent only ~1ms more to prepare the complete response. You would have a lot of waiting on a blank page.
Caching : You could cache the generated pages on your web app. That way your load will be minimized as well. And also, if you still want to deliver content while your backend services (MySQL/PHP..) are down you could even use Apache or Nginx caching.
But I guess it really depends on what you want to do.
For fast and simple pages, which seems to be your case, stick with backend enhancements.
For a dynamic/interactive app which can afford loading times, and doesn't care about SEO, you can delegate most things to the front-end. But then use an advanced framework like Angular, to handle templating, caching, etc...

Using node.js to serve content from a Backbone.js app to search crawlers for SEO

Either my google-fu has failed me or there really aren't too many people doing this yet. As you know, Backbone.js has an achilles heel--it cannot serve the html it renders to page crawlers such as googlebot because they do not run JavaScript (although given that its Google with their resources, V8 engine, and the sobering fact that JavaScript applications are on the rise, I expect this to someday happen). I'm aware that Google has a hashbang workaround policy but it's simply a bad idea. Plus, I'm using PushState. This is an extremely important issue for me and I would expect it to be for others as well. SEO is something that cannot be ignored and thus cannot be considered for many applications out there that require or depend on it.
Enter node.js. I'm only just starting to get into this craze but it seems possible to have the same Backbone.js app that exists on the client be on the server holding hands with node.js. node.js would then be able to serve html rendered from the Backbone.js app to page crawlers. It seems feasible but I'm looking for someone who is more experienced with node.js or even better, someone who has actually done this, to advise me on this.
What steps do I need to take to allow me to use node.js to serve my Backbone.js app to web crawlers? Also, my Backbone app consumes an API that is written in Rails which I think would make this less of a headache.
EDIT: I failed to mention that I already have a production app written in Backbone.js. I'm looking to apply this technique to that app.

First of all, let me add a disclaimer that I think this use of node.js is a bad idea. Second disclaimer: I've done similar hacks, but just for the purpose of automated testing, not crawlers.
With that out of the way, let's go. If you intend to run your client-side app on server, you'll need to recreate the browser environment on your server:
Most obviously, you're missing the DOM (Document Object Model) - basically the AST on top of your parsed HTML document. The node.js solution for this is jsdom.
That however will not suffice. Your browser also exposes BOM (Browser Object Model) - access to browser features like, for example, history.pushState. This is where it gets tricky. There are two options: you can try to bend phantomjs or casperjs to run your app and then scrape the HTML off it. It's fragile since you're running a huge full WebKit browser with the UI parts sawed off.
The other option is Zombie - which is lightweight re-implementation of browser features in Javascript. According to the page it supports pushState, but my experience is that the browser emulation is far from complete - however give it a try and see how far you get.

I'm going to leave it to you to decide whether pushing your rendering engine to the server side is a sound decision.
Because Nodejs is built on V8 (Chrome's engine) it will run javascript, like Backbone.js. Creating your models and so forth would be done in exactly the same way.
The Nodejs environment of course lacks a DOM. So this is the part you need to recreate. I believe the most popular module is:
https://github.com/tmpvar/jsdom
Once you have an accessible DOM api in Nodejs, you simply build its nodes as you would for a typical browser client (maybe using jQuery) and respond to server requests with rendered HTML (via $("myDOM").html() or similar).

I believe you can take a fallback strategy type approach. Consider what would happen with javascript turned off and a link clicked vs js on. Anything you do on your page that can be crawled should have some reasonable fallback procedure when javascript is turned off. Your links should always have the link to the server as the href, and the default action happening should be prevented with javascript.
I wouldn't say this is backbone's responsibility necessarily. I mean the only thing backbone can help you with here is modifying your URL when the page changes and for your models/collections to be both client and server side. The views and routers I believe would be strictly client side.
What you can do though is make your jade pages and partial renderable from the client side or server side with or without content injected. In this way the same page can be rendered in either way. That is if you replace a big chunk of your page and change the url then the html that you are grabbing can be from the same template as if someone directly went to that page.
When your server receives a request it should directly take you to that page rather than go through the main entry point and the load backbone and have it manipulate the page and set it up in a way that the user intends with the url.
I think you should be able to achieve this just by rearranging things in your app a bit. No real rewriting just a good amount of moving things around. You may need to write a controller that will serve you html files with content injected or not injected. This will serve to give your backbone app the html it needs to couple with the data from the models. Like I said those same templates can be used when you directly hit those links through the routers defined in express/node.js

This is on my todo list of things to do with our app: have Node.js parse the Backbone routes (stored in memory when the app starts) and at the very least serve the main pages template at straight HTML—anything more would probably be too much overhead /processing for the BE when you consider thousands of users hitting your site.
I believe Backbone apps like AirBnB do it this way as well but only for Robots like Google Crawler. You also need this situation for things like Facebook likes as Facebook sends out a crawler to read your og:tags.

Working solution is to use Backbone everywhere
https://github.com/Morriz/backbone-everywhere but it forces you to use Node as your backend.
Another alternative is to use the same templates on the server and front-end.
Front-end loads Mustache templates using require.js text plugin and the server also renders the page using the same Mustache templates.
Another addition is to also render bootstrapped module data in javascript tag as JSON data to be used immediately by Backbone to populate models and collections.

Basically you need to decide what it is that you're serving: is it a true app (i.e. something that could stand in as a replacement for a dedicated desktop application), or is it a presentation of content (i.e. classical "web page")? If you're concerned about SEO, it's likely that it's actually the latter ("content site") and in that case the "single-page app" model isn't appropriate; you really want the "progressively enhanced website" model instead (look up such phrases as "unobtrusive JavaScript", "progressive enhancement" and "adaptive Web design").
To amplify a little, "server sends only serialized data and client does all rendering" is only appropriate in the "true app" scenario. For the "content site" scenario, the appropriate model is "server does main rendering, client makes it look better and does some small-scale rendering to avoid disruptive page transitions when possible".
And, by the way, the objection that progressive enhancement means "making sure that a user can see doesn't get anything better than a blind user who uses text-to-speech" is an expression of political resentment, not reality. Progressively enhanced sites can be as fancy as you want them to from the perspective of a user with a high-end rendering system.

Why do web applications send HTML over the wire?

This question pertains to web applications. I have very little web app development experience, so might be missing some very obvious points/issues. Please point them out.
As I understand, in most web applications, a web server sends HTML over the wire to a client (browser). This happens every time a HTTP request is made. I feel this is very wasteful of bandwidth.
1) Since browsers can run JavaScript, why don't we just send a JavaScript program which can generate the webpage's HTML content (which the browser then renders).
2) Further a browser might cache the JavaScript program and next time the server only need send the data. The protocol might involve the browser sending the "program version" it has.
Consider an example of a relatively simple website Hacker News [http://news.ycombinator.com]. Let us separate the data (30 posts + their metadata) from its presentation. Assuming 1) above, the server can just send the data (say in JSON) + a JavaScript program to generate HTML. This gist shows the idea. The data for the 30 posts is in JSON [http://www.json.org/js.html] format. For this particular example the data transferred is cut in 1/2 (size of data+JavaScript / size of HTML). Further if browsers can do 2) above, it reduces the data transferred on each visit to 1/4 (size of data / size of HTML). [Note: this analysis is without considering compression; gzip,deflate is very successful in reducing the size of HTML. But isn't prevention better than cure?]
I see atleast the following advantages of this :-
* For most web pages, it will reduce the size of data transferred over the wire.
* Forces web apps to separate data from its presentation.
Disadvantages might include - more complex browsers, time to run the JavaScript program to generate HTML (this might get offset by the reduction in data size).
Now my question is - why are web applications not developed this way, or, why do web applications send HTML over the wire? Surely the web server (sending out HTML) doesn't care about HTML at all, so why should it, first, generate it, and then send it over the wire?

There are a few reasons, some of them historical this is by no means a complete list but just some of my experiences:
HTML predates JS, and a lot of scripts and libraries predate JS
Older browsers (think IE<=6) had rubbish, inconsistent JS engines, their rendering engines were much more consistent in how they treat HTML. So many more libraries and scripts predate consistent JS
It is a nightmare to debug applications written as you suggest if they are not constructed right (we have one at my work, it takes 30 minutes to find where a piece of html is actually generated)
It is a lot more work to do it right - why not use templates or static docs or something much simpler
Its not really a problem - HTML compresses really well
What you suggest is done - its called AJAX (OK, so ajax is more general than this but you all know what i mean)
It simply doesn't work for most plain-text user agents including those used by most search engines. If this page is serving most of your content, its generally a good idea to make it easy for Google to parse

Well the obvious reason on why this is the case is that JavaScript wasn't around when we started sending HTML around, and HTML was an improvement to sending around plaintext documents.
The reason we don't do this now: we eschew complex solutions to problems that aren't really problems.
Average internet connections download nearly 1M bytes per second, and web browsers are quite adept at parsing and starting to render this HTML before it's even all ready to be. They're also great at parallelizing the downloading of resources on the page. If we want to save a few bytes at the cost of some compute cycles, we gzip content before sending it. Problem solved.
And for the record, we do this with AJAX in complex webpages (checkout Github's source browsing for a great example of how awesome this can be).

What you suggest can, and is, done. Remember, web pages used to be static documents. Full blown web-based applications are a relatively recent idea.
I might also suggest that it isn't necessarily more efficient, especially when your pages are sent gzipped.

What you suggest is basically what a JavaScript full stack framework like ExtJS does. You can create rich, data intensive applications without writing any HTML -- well, only enough to reference the necessary .js libraries. The complex DOM needed for layouts, grids, forms etc is all created by the framework.

The simple answer is that HTML is older. Why is C99 not fully implemented with a lot of compilers? They figure 1989 is new enough for them. Also, JavaScript exercises a lot more control over people's browsers than they seem to want. Conditional statements and encoded data pose a security concern, and some people want to keep that can of worms closed to begin with. True, HTML is a very inefficient markup, but the size is insignificant compared to the images you download from the internet. That favicon takes up as much data as the page itself, and it's only 16 pixels across.

A good reason that the server-side code of a web application might do lots of HTML template work on the server side is that in many server environments it's not made easy to bundle up server-side data structures (object graphs) for easy delivery to the client. There may be information kept in server-side data structures that really shouldn't be delivered out to the client. Thus in order to send out a "pure" data-only response, the server would have to trim off sensitive data before delivering out the JSON. That's not an unsolvable problem, but I don't know of many server frameworks that facilitate a solution.
The server has direct, unfettered access to the database and to everything else that makes an application work: user preferences, history, account details, system settings, etc. To build an application that's client-centric for rendering purposes would mean concocting ways of keeping all that information intact and up-to-date on the client. For a lot of applications, that might not be terribly easy.
Finally, it's only relatively recently that it would make sense to trust a browser to provide a stable enough platform for building a long-lived "application environment" as a continually-updating web page. By building a web app such that pages are sometimes completely reloaded, there are lots of little "reboots". That's a cheap and dumb way of keeping a lid on at least some kinds of memory leaks.

Most implementations of sites with heavy Javascript use won't start executing until the DOM has fully loaded; then you'll get every page with 'loading screens' when the page wrapper has downloaded, but none of the content has.
Also, do remember that not all users have Javascript enabled, and not all browsers support high-level Javascript (think mobiles).

I would send HTML in a response if I wanted my application to work without Javascript. I would write HTML rendering code in my server-side language (most of the time not Javascript), which could then be used for two purposes: serving whole HTML pages, and serving bits of HTML in response to XHRs.
If the Javascript code is restricted to things like reporting UI events and replacing innerHTML with server-generated code, I don't have to duplicate any of my application logic across languages/frameworks. This duplication problem is one of the reasons why server-side Javascript is getting people excited.

Building Standalone Applications in JavaScript

With the increased power of JavaScript frameworks like YUI, JQuery, and Prototype, and debugging tools like Firebug, doing an application entirely in browser-side JavaScript looks like a great way to make simple applications like puzzle games and specialized calculators.
Is there any downside to this other than exposing your source code? How should you handle data storage for this kind of program?
Edit: yes, Gears and cookies can be used for local storage, but you can't easily get access to files and other objects the user already has around. You also can't save data to a file for a user without having them invoke some browser feature like printing to PDF or saving page as a file.

I've written several application in JS including a spreadsheet.
Upside:
great language
short code-run-review cycle
DOM manipulation is great for UI design
clients on every computer (and phone)
Downside:
differences between browsers (especially IE)
code base scalability (with no intrinsic support for namespaces and classes)
no good debuggers (especially, again, for IE)
performance (even though great progress has been made with FireFox and Safari)
You need to write some server code as well.
Bottom line: Go for it. I did.

Another option for developing simple desktop like applications or games in JavaScript is Adobe AIR. You can build your app code in either HTML + JavaScript or using Flash/Flex or a combination of both. It has the advantage of being cross-platform (actually cross-platform, Linux, OS X, and Windows. Not just Windows and OS X).
Heck, it may be the only time in your career as a developer that you can write a web page and ONLY target ONE browser.

SproutCore is a wholly JavaScript-hosted application framework, borrowing concepts particularly from Cocoa (such as KVO) and Ruby on Rails (such as using a CLI generator for your models, views and controllers). It includes Prototype, but builds plenty of stuff such as sophisticated controls on top of that. Its Photos demo is arguably impressive (especially in Safari 3.1).
Greg already pointed you to Gears; in addition, HTML 5 will come with a standardized means of local storage. Safari 3.1 ships with an implementation where you have a per-site SQLite database with user-settable size maximums, as well as a built-in database browser with SQL querying. Unfortunately, it will be a long time until we can expect broad browser support. Until then, Gears is indeed an alternative (but not for Safari… yet!). For simpler storage, there is of course always cookies.

The downside to this would be that you are at the mercy of them having js enabled. I'm not sure that this is a big deal now. Virtually every browser supports js and has it enabled by default.
Of course the other downside would be performance. You are again at the mercy of the client handling all the intensive work. This also may not be that big of a deal, and would be dependent on the type of app you are building.
I've never used Gears, but it looks like it is worth a shot. The backup plan would be to run some server side script through ajax that dumps your data somewhere.
Not completely client side, but oh well.

Nihilogic (not my site) does a lot of stuff with Javascript. They even have several games that they've made in Javascript.
I've also seen a neat roguelike game made in Javascript. Unfortunately, I can't remember what it was called...

If you want to write a standalone JavaScript application, look at XULrunner. It's what Firefox is built on, but it is also built so that you can distribute it as an application runtime. You will write some of the interface in JavaScript and use JavaScript for your code.

Gears might provide the client-side persistent data storage you need. There isn't a terribly good way of not exposing your source code, though. You could obfuscate it but that only helps somewhat.
I've done simple apps like this for stuff like a Sudoku solver.

You might run into performance issues given that you're completely at the mercy of the client's Javascript interpreter. Gears would be a nice way of data storage, but I don't think it has penetrated the market that much. You could just use cookies if you're not fussy about that kind of thing.

I'm with ScottKoon here, Adobe AIR is great. I've really only made one really nice (imho) widget thus far, but I did so using jQuery and Prototype.js, which floored in such wonderful ways because I didn't have to learn a whole new event model. Adobe AIR is really sweet, the memory foot print isn't too bad, upgrading to a new version is built into AIR so it's almost automatic, and best of all it's cross-platform...they even have an alpha-version for Linux, but it works pretty well already on my Eee.

Standalone games in GWT:
http://gpokr.com/
http://kdice.com/

In regard to saving files from a javascript application:
I am really excited about the possibilities of client-side applications. Flash 10 introduced the ability to create files for save right in the browser. I thought it was super cool, so I built a javascript+flash component to wrap the saving feature. Right now it only works for creating text based files (vcard, ical, xml, html, css, etc.)
Downloadify Home Page
Source Code & Documentation on Github
See It In Use at Starter for jQuery
I am looking to add support for non-text files soon, but this is a start.

My RSS feeds have served me well- I found that Javascript roguelike!
It's called The Tombs of Asciiroth.

Given that you're going to be writing some server code anyway, it makes sense to keep storage on the server for a lot of domains (address books, poker scores, gui configuration, etc.,.) For anything the size of what you'll get in Webkit or Gears, you can probably also keep it on your server.
The advantage of keeping it on your server is two-fold:
You can integrate it fairly simply as a Model layer in a typical MVC framework, and,
Users get a consistent view without being tied to their browser/PC, or in a less-than-ideal environment (Internet Cafés).
The server code for handling this can also be fairly trivial, particularly if it's written with this task in mind, so it's not a huge cognitive burden.

Go with qooxdoo. They recently realsed 1.0, although most users of it say it was ripe for 1.0 at least two versions ago.
I compared qooxdoo with YUI and ext, and I think qooxdoo is the way to go for programmers - YUI isn't that polished as qooxdoo, from a programmer's point of view and ext has a not so friendly licensing model.
A few of the strong points (for me) of qooxdoo are:
extremely clean code
the nicest OO programming model I've seen among Javascript frameworks
an extremely rich UI widget library
It also features a test runner for unit tests, an API doc generator and reader, a logging facility, and several useful features for debugging, grouped under something called Inspector.
The only downside is that there aren't readymade themes (something like skins) for qooxdoo. But creating your own theme is quite easy.

Develop Reference

JavaScript is the programming language of the Web.