When I discovered that Node.js was built using the V8 JavaScript engine, I thought:
Great, web scraping will be easier as the page
will be rendered like in the browser, with a
"native" DOM supporting XPath and any AJAX calls on
the page executed.
Why doesn't it have a native DOM when it uses the same JavaScript engine as Chrome?
Why doesn't it have a mode to run JavaScript in retrieved pages?
What am I not understanding about JavaScript engines vs the engine in a web browser?
Many thanks!
The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.
What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.
I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.
Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.
There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.
P.S: When reading this question I was also wondering if V8 (node.js is built on top of this) had a DOM
Why when it uses the same JS engine as Chrome doesn't it have a native
DOM?
But I searched google and found Google's V8 page which recites the following:
JavaScript is most commonly used for client-side scripting in a
browser, being used to manipulate Document Object Model (DOM) objects
for example. The DOM is not, however, typically provided by the
JavaScript engine but instead by a browser. The same is true of
V8—Google Chrome provides the DOM. V8 does however provide all the
data types, operators, objects and functions specified in the ECMA
standard.
node.js uses V8 and not Google Chrome.
Likewise, why doesn't it have a mode to run JS in retrieved pages?
I also think we don't really need it that bad. Ryan Dahl created node.js as one man (single programmer). Maybe now he (his team) will develop this, but I was already extremely amazed by the amount of code he produced (crazy). He wanted to make a non-blocking easy/efficient library, which I think he did a mighty good job at.
But then again, another developer created a module which is pretty good and actively developed (today) at https://github.com/tmpvar/jsdom.
What am I not understanding about Javascript engines vs the engine in
a web browser? :)
Those are different things as is hopefully clear from the quote above.
The Document Object Model (DOM in short) is a programming interface for HTML and XML documents and it represents the page so that programs can change the document structure, style, and content. More on this subject.
The necessary distinction between client-side (browser) and server-side (Node.js) and their main goals:
Client-side: accessing and displaying information of the web
Server-side: providing stable and reliable ways to deliver web information
Why is there no DOM in Node.js be default?
By default, Node.js doesn't have access, nor have any knowledge about the actual DOM in your own browser. Node.js just delivers the data, that will be used by your own browser to process and render the whole website, the DOM included. The server provides the data to your browser to use and process. That is the intended way.
Why wouldn't you want to access the DOM in Node.js?
Accessing your browser's actual DOM using Node.js would be just simply out of the goal of the server. Your own browser's role is to display the data coming from the server. However it is certainly possible and there are multiple solutions in different level of depths and varieties to pre-render, manipulate or change the DOM using AJAX calls. We'll see what future trends will bring.
Why would you want to access the DOM in Node.js?
By default, you shouldn't access your own, actual DOM (at least some data of it) using Node.js. Client-side and server-side are separated in terms of role, functionality, and responsibility based on years of experience and knowledge. Although there are several situations, where there are solid reasons to do so:
Gathering usage data (A/B testing, UI/UX efficiency and feedback)
Headless testing (Development, automation, web-scraping)
How can you access the DOM in Node.js?
jsdom: pure-JavaScript implementation, good for testing your own DOM/browser-related project
cheerio: great solution if you like/often use jQuery
puppeteer: Google's own way to provide headless testing using Google Chrome
own solution (your possible future project link here)
Although these solutions do not provide a way to access your browser's own, actual DOM by default, but you can create a project to send some form of data about your DOM to the server, then use/render/manipulate that data based on your needs.
...and yes, web-scraping and web development in terms of tools and utilities became more sophisticated and certainly easier in several fields.
node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.
That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).
You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.
It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.
2018 answer: mainly for historical reasons, but this may change in future.
Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.
However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.
Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.
If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'
Node is a runtime environment, it does not render a DOM like a browser.
Because there isn't a DOM. DOM stands for Document Object Model. There is no document in Node, so not DOM to manipulate it. That is definitively a browser thing.
You can use a library like cheerio though which gives you some simple DOM manipulation.
Node is server-level JavaScript. It's just the language applied to a basic system API, more like C++ or Java.
It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.
Node.js is for serverside programming. There is no DOM to be rendered in the server.
1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.
2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.
3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java
Related
I often compile informal datasets by running some kind of XPath/XQuery on publicly available web pages. Usually the structure of the HTML is regular enough that useful information can be extracted easily.
But today I've come across tunefind.com. This website makes extensive use of the REACTJS framework, and so most of the structure of the page is configured client-side by Javascript. The pages, when initially downloaded, are very basic and missing a lot of information. The pages are populated by a script that uses a hopelessly messy blob of JSON data at the bottom of the page.
The only way I can think of to deal with this would be to use some kind of GUI-based web engine and just not display the GUI part. But that is a preposterous amount of work for these casual little CLI tools that I use to gather information.
Is there any way to perform the javascript preprocessing without dealing with unnecessary graphics?
Even if you were to process without the graphics the react javascript will be geared towards running in a browser context, at the very least it will expect a functioning DOM to exist, the application itself may also require clicks / transitions to happen before you can see some data.
Your best bet then is to load the page in a browser, to keep this simple, there are plenty of good browser automation frameworks designed for this.
I've used a fair few libraries over the years including phantomJS and recently I've gotten the most mileage out of nightmarejs.
It runs an electron browser for you and gives you a useful promisified javascript API to control it with, that has common browser functions such as clicking, following links etc.
You can configure it to hide the browser which is useful for making a CLI tool, however its a bit of a pseudo-headless mode and will still require a windowing/graphical context (e.g. x window).
Hope this helps.
PS - If you're at all used to docker it's not hard to make this just a running container!
I've looked at the various ways to embed a web browser into an application (like IE or Safari via OS-specific means, or Firefox/Mozilla via XULRunner, or Chrome via the Chromium Embedded Framework) and I've managed to integrate CEF with my app up to a point where I'm convinced that it'll all work as expected. Now, it seems to me that whenever I want to modify the DOM (e.g. to add or remove elements), I'll have to do this via Javascript, i.e. my application calls out to Javascript where the actual work is done.
I wonder why this is so. My (naive?) belief is that if for example I call appendChild in Javascript, the actual "work" of appending a child will eventually be performed by a C/C++ function as the browser itself is written in C/C++ and not in Javascript. So, I'm wondering why in an embedded web browser I can't call this C/C++ function directly instead of going through Javascript. I understand that for general scripting you don't want other languages than Javascript for security reasons, but if the browser is embedded into an application I can control anyway this shouldn't be the reason, should it?
What am I missing?
CEF is implemented as a layer between chromium's content api and your application. When using CEF, Chromium is a library inside CEF, and you only have access to CEF's Public API, which is more or less restricted to whatever chromium content api leverages (keep in mind no browser was created as an embeddable plugin and then evolved into an application, it was always the other way around). The content API was the way google engineers had to formalize some forms of introspection, but they aren't completed simply because the browser isn't completely modular by itself. There's work in progress on chromium code to separate specific "do-it-all" components in more general ones that you may pick at will.
Therefore you can't simply hook into chromium's implementation details when using CEF: you'd need to patch it to implement something it doesn't expose by itself. CEF implements a class for DOM traversal (see here), but you can only pick at DOM, not change it.
That said, on the C++ side you can do some arbitrary stuff such as inspecting/mangling http requests (which allows you to inject javascript into pages, for instance), and running arbitrary javascript code straight from C++, which can, by it's own turn, asynchronously call back to C++ code by diverse paths (ajax -> http handling in C++, or V8 extensions which you can code straightly in C++.
See https://bitbucket.org/chromiumembedded/cef/wiki/JavaScriptIntegration for more details.
One could customize CEF or go straightly to chromium source code, but that thing is huge. Other solutions I heard of are more or less alike in terms of API limitations, i.e. Awesomium, Mozilla's Gecko, etc.
I have some C++ code that I want to expose to client side of a web app. Ideally, I want to write Javascript wrapper objects for my C++ classes so that I can use them clientside.
Has this been done before?. Does anyone have a link to show how this may be achieved?
There is a library to convert C++ code to javascript, it might help:
emscripten
Libjspp C++ template based wrapper for embedding and extending Javascript engine spidermonkey 1 . 8 . 5 and more
SpiderMonkey? is Mozilla Project's Javascript/ECMAScript engine.
Libjspp allows C++ developers to embed SpiderMonkey? simply and easily into their applications. Libjspp allows to run multiple Javascript Engines within same process which suits one engine per thread para dime which is helpful in achieving true parallisim. Also Libjspp no way stops user from running multiple threads within engine.
http://code.google.com/p/libjspp/
I guess that RPC is what you want. You'll need to wrap your functions on the server side using some sort of framework. I've not yet used it, but this one looks promising.
On the client side you use proxy objects to dispatch the function calls. The communication is handled usually either via XML-RPC or JSON-RPC. I used this client side framework and was quite content but I'm sure you'll find many others.
This is an old topi, however, I was in the exact situation right now, and all of the solutions I found on the net complicated or outdated.
Recently, I ran across a library which supports V8 engine (including the new isolation API, which makes 90% of the libraries I found outdated) and provides great exposure and interaction API.
https://github.com/QuartzTechnologies/v8bridge
I hope that my solution will help anybody.
There's a relatively new library for doing this called nbind. Maybe that would suit you? It looks very good to me, and I'm just about to start using it.
I think you want a C++ JSON parser. You should be able to find one here http://www.json.org/. It may not do all you want because it just serializes and deserializes C++ objects without any behavior, but it should be good enough. See https://stackoverflow.com/questions/245973/whats-the-best-c-json-parser for some discussion.
If the C++ code has to be on the client, then there is no simple way to do this for a web app. A solution may involve coding plugins for the browsers you want to support, which may then be accessed from javascript code.
If, for example, you need this for a client application, that is another case. Such a thing has been done and involves linking your application to (or running from outside) with for example chromium library, or any other javascript execution engine. That way you can create bindings to C++ classes and use such objects from javascript and vice-versa. Note that this is also not a trivial solution and may be a big effort to implement (also requires additional resources).
You could for example wrap the C++ classes in PHP or Python, and then implement an API over HTTP to access the required functions.
Or if you insist on exposing the functions as JavaScript you could try using Node.js, and create an C++ add-on to wrap you classes. See the Node.js documentation here: http://nodejs.org/api/addons.html#addons_wrapping_c_objects
But either way, I don't think avoid creating some sort of API (HTTP SOAP, XML RPC) to access the functions on your server.
Though QML is not exactly Javascript, Qt is not plain C++, but what they do together seem just like what you need
I was using Greasemonkey eariler in the week to automate some calls to a page to scrape some data from a website, this was awkward for two reasons:
It's GUI based instead of commandline based)
I had to store all persisted information in JSON, and not directly in a database.
Would it be possible, to use node.js as a Greasemonkey alternative since node.js can store records directly in a database, and won't be required visually load pages the way the Greasemonkey does?
Also I would think that node.js would be easier to work with since you don't have to re-deploy it's scripts to Firefox the way that you have to with GreaseMonkey, allowing you to easily use version control on separate scripting projects.
On the other hand using node.js to do GreaseMonkey's job might just be using a hammer to pound in a screw, so I thought I would check here to find out if I am mistaken.
On the other hand using node.js to do GreaseMonkey's job might just be using a hammer to pound in a screw
I would say that the opposite is true; I believe you're using Greasemonkey to do the job of a server-side processing library. Greasemonkey runs in the browser and is designed to modify your web experience by running scripts on the pages you visit.
Indeed, I believe Node.js would be very well suited to this task. With libraries like jsdom and node-jquery, you can easily do JavaScript parsing over the DOM. You may also wish to take a look at node.io, a "distributed data scraping and processing framework." Finally, you may look into non-Node (but still JavaScript) based tools, such as PhantomJS and CasperJS, which can do scraping, DOM manipulation, screenshots, and more.
The question is a bit of a non sequitur.
Greasemonkey is for clients to tweak their individual browsing experience, client-side.
Node.js is for developers to deliver applications to the masses (hopefully), server-side.
For scraping data, in an automatable way, use Node.js or some server-side library (Python works well).
For "Mashups" of webpages that you browse, use Greasemonkey.
I am developing a multi-platform game/visualization framework that uses JavaScript for scripting purposes. The current Flash-based implementation, intended for use in browsers, injects framework-level scripts into the host page and executes the game scripts in that environment, marshaling calls/objects in and out of the SWF object as necessary.
This solution is working nicely and will allow alternate native (out-of-browser) framework implementations to use a dedicated JS engine (such as V8) as the scripting environment and run the scripts unaltered.
The framework uses a custom hierarchical document object model, used declaratively in XML. I'd now like to extend the model to allow runtime modification of the hierarchy. Rather than designing a new solution from scratch for tree operations and event binding, I'm looking into implementing or harnessing jQuery for this purpose.
For those of you familiar with behind-the-scenes jQuery, how extensible is it when it comes to working with alternate object models? Is it baked onto the HTML DOM, or can I wiggle my way into its internals and add support for my DOM?
Thanks for any insights.
jQuery (being built on Javascript) is built around the W3C's DOM (which presents itself as an extension built into the implementation of ECMAScript). The API for this is governed by the W3C's DOM specification. Web Browsers implement support for the DOM by exposing the API to their specific Javascript host, be it Chakra, V8, Tracemonkey etc.
From what I can see, if you can implement (or partially implement) the DOM specification which Javascript and jQuery (and other frameworks) respond to, there should be no reason why jQuery cannot be used in the way you want.
That seems like a lot of work though...