Retrieve website DOM via http request in GO

Retrieve website DOM via http request in GO - javascript

in GO I use HTTP request to get a site html and I see in some elements difference than when using Inspect in Chrome. A search in google and some reading led me to understand that what I see in Inspect is a stage called DOM which takes the raw html and runs some java scripts that add info and alter elements (go easy on me, I'm new at this ^_^).
Is there a why I can receive in GO the DOM instead of the raw html? I know I can use Chromedp, but I'm hoping for something more like some sort of an HTTP package because Chromedp a bit heavy on performance.
I would really appreciate any suggestions, thank you.

A simple HTTP request (via Go or anything else) will only ever get the raw HTML. The DOM is a browser-generated interpretation of the raw HTML. Yes, there is even something like the Shadow DOM.
JavaScript is interpreted by the browsers' JavaScript engine which applies changes to the DOM, adds event listeners and dynamically manipulates said DOM.
This is why you cannot get the DOM state you see in a browser through a HTTP request. The request does not contain all the client-side DOM manipulations done through a browsers' JavaScript engine. A request library is not a browser.
To get access to the full rendered DOM you're accustomed to see in the Developer Tools, you're going to need a more involved web scraping setup, usually involving a headless browser, like Puppeteer. However, this is written in Node.js. Given Go, you may have better luck with chromedp or cdp.

DOM stands for "Document Object Model", which is a tree of nodes where each node represents an element of the underlying document. Nodes may correspond to elements, text, comments, etc. There are many go-based DOM packages around. One you should look at is:
https://godoc.org/golang.org/x/net/html
It lets you parse HTML, and traverse elements of the document programatically.

Related

jQuery Rendering with Web Workers

I understand that Web Workers cannot access the main document DOM, but has anyone successfully tried to build a partial DOM in a Web Worker using jQuery, output the resulting HTML and then attach that to the document DOM?
Did it provide much of a performance improvement over rendering on the UI thread that is worth the extra pain of implementing this in a thread-safe way?
Would this be trying to use Web Workers for something they shouldn't be used for?

It depends a lot on what kind of HTML are you trying to generate. However there is one big problem you might have overseen. It's not just that you can't access main thread's DOM.
You cannot create HTMLElement object instances in worker at all.
So even if you go through the trouble of generating the HTML, you will then have to convert it to HTML string or JSON, parse it in the main thread and convert it to DOM object.
The only case when this is worth it is when generating the data for HTML is CPU intensive. And in that case, you can just generate the data, send to main thread and display in HTML.
This is also good approach to keep data processing and visual rendering separate.

Using Nokogiri to parse JavaScript hidden HTML

I'm trying to use Nokogiri to parse this ASCAP website to retrieve some song/artist information. Here's an example of what I'd want to query
https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z
I can't seem to access the DOM properly because the source seems to be hidden behind some kind of JavaScript. I'm pretty new to web scraping so it has been pretty difficult trying to find a way to do this. I tried using Charles to see if data was being drawn from another site, and have been using XHelper to generate accurate XPath queries.
This returns nil, where it should return "1, 2 YA'LL"
page = Nokogiri::HTML(open('https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z'))
puts page.xpath('/html/body/div[#id="desktopSearch"]/div[#id='ace']/div[#id="aceMain"]/div[#id="aceResults"]/ul[#id="ace_list"]/li[#class="nav"][1]/div[#class="workTitle"]').text

Step #1 when spidering/scraping, is to turn off the JavaScript in your browser, then look at a page. What you see at that point is what Nokogiri sees. If the data you want is visible, then odds are really good you can get at it with a parser.
At that point, do NOT rely on a browser's XPath or CSS selector list seen when you inspect an element to show you the path to the node(s) you want. Browsers do a lot of fix-ups when displaying a page, and the source view usually reflects those, including displaying data retrieved dynamically. In other words, the browser is lying to you about what it originally retrieved from a page. To work around that, use wget, curl or nokogiri http://some_URL at the command-line to retrieve the original page, then locate the node you want.
If you don't see the node you want, then you're going to need to use other tools, such as something from the Watir suite, which lets you drive a browser which understands JavaScript. A browser can retrieve a page, interpret the JavaScript, and retrieve any dynamic page content. Then you should be able to get at the markup and pass it to Nokogiri.

Used the google inspector tools to log the XMLHTTPRequests and was easily able to figure out where the data was actually being loaded from. Thanks to #NickVeys!

How to find height of a HTML element in node.js?

I am stuck with quite a tricky issue for a couple of days now. I have a auto-generated HTML page in a variable as string in node.js. I need to find the height of some HTML elements and do HTML manipulations (like tag creation, deletion, append, css attribute setting etc).
Obviously I need to make a DOM like structure of my HTML page first and then proceed.
While for the HTML manipulations I have many options like cheerio, node.io, jsdom etc but none of these allow me to find the height of the element at the node.
So after wasting quite a lot of time on it, I have decided to look for heavier solutions, something like implementing a headless browser (phantomjs etc.) at the node and drive an elements offsetHeight through plain javascript.
Can anyone tell me if it is possible to reach my objective like this? What headless browser will be best suited for this task?
If i am going in the wrong direction, then can anyone suggest me any other working solution?
At this point I am ready to try anything.
Thnx in advance!!
Note: Using javascript at the client side has many problems in my particular case because the contents of the generated HTML page are supposed to be used by the client to paste in his website. Leaving a running javascript that re-structures the HTML will make it difficult at his end.

Node's server-side HTML libraries (like cheerio and jsdom) are strictly DOM API emulation libraries. They do not attempt to actually render a document, which is necessary to compute element size and position.
If you really need to calculate the size of an element on the server, you need a headless browser like PhantomJS. It is a full WebKit renderer with a JavaScript API. It is entirely separate from node, so you either need to write a utility script using Phantom's API, or use a npm module that lets you control Phantom from node.
After reading the comments under your question, it is pretty clear that you should not be calculating heights on the server. Client-side code is the proper place to do it.

Building a website without page reload, using new HTML5 history mechanics

I was looking for a way to change browser address location without causing a full page reload, thus I got some useful information, like this:
http://spoiledmilk.com/blog/html5-changing-the-browser-url-without-refreshing-page/
and trying to get into the new HTML5 history mechanics I got also this:
HTML5/jQuery: pushState and popState - deep linking?
so a completely different idea came to mind ...
I'm asking if there could be a possible way (I don't necessarily mean "easy") to build, or if there already exists something like a "framework" or similar, to build a web project that's able to completely avoid page reloads.... thus making intensive use of ajax and/or jquery etc. (I cite this because it's what I usually work with).
I think this could improve the "user experience" when browsing such kinds of site.

"The bbUI toolkit is designed to progressively enhance its capability based on the abilities of the Web rendering engine on BB5/BB6/BB7/PlayBook and BlackBerry 10. [...] By not adding any kind of layout logic to the screen elements, bbUI can then modify the DOM in any way that it needs in order to achieve the desired result.
All DOM manipulation occurs while the HTML fragment is not attached to the live DOM. This allows DOM manipulation to occur VERY, VERY, FAST and it does not incur any WebView layout computation until the entire fragment is inserted into the DOM. Layout computation during JavaScript DOM manipulation is one of the single most expensive operations that can slow down a Web based UI.
Each screen you create is an HTML fragment that gets loaded into the application via AJAX to keep the size of the DOM small and memory usage to a minimum.'
https://github.com/blackberry/bbUI.js

GitHub is built this way. When you're within a project and you click on a link (like "bin"), it loads new content that corresponds to that link and updates the address bar URL, but the page itself doesn't reload.

Just check out http://backbonejs.org/ it should help on both of your requirement client side binding and browser history.

XML, XSLT and JavaScript

I'm having some trouble figuring out how to make the "page load" architecture of a website.
The basic idea is, that I would use XSLT to present it but instead of doing it the classic way with the XSL tags I would do it with JavaScript. Each link should therefore refer to a JavaScript function that would change the content and menus of the page.
The reason why I want to do it this way, is having the option of letting JavaScript dynamically show each page using the data provided in the first, initial XML file instead of making a "complete" server request for the specific page, which simply has too many downsides.
The basic problem of that is, that after having searched the web for a solution to access the "underlying" XML of the document with JavaScript, I only find solutions to access external XML files.
I could of course just "print" all the XML data into a JavaScript array fully declared in the document header, but I believe this would be a very, very nasty solution. And ugly, for that matter.
My questions therefore are:
Is it even possible to do what I'm
thinking of?
Would it be SEO-friendly to have all
the website pages' content loaded
initially in the XML file?
My alternative would be to dynamically load the specific page's content using AJAX on demand. However, I find it difficult to find a way that would be the least SEO-friendly. I can't imagine that a search engine would execute any JavaScript.
I'm very sorry if this is unclear, but it's really freaking me out.
Thanks in advance.

Is it even possible to do what I'm thinking of?
Sure.
Would it be SEO-friendly to have all the website pages' content loaded initially in the XML file?
No, it would be total insanity.
I can't imagine that a search engine would execute any JavaScript.
Well, quite. It's also pretty bad for accessibility: non-JS browsers, or browsers with a slight difference in JS implementation (eg new reserved words) that causes your script to have an error and boom! no page. And unless you provide proper navigation through hash links, usability will be terrible too.
All-JavaScript in-page content creation can be useful for raw web applications (infamously, GMail), but for a content-driven site it would be largely disastrous. You'd essentially have to build up the same pages from the client side for JS browsers and the server side for all other agents, at which point you've lost the advantage of doing it all on the client.
Probably better to do it like SO: primarily HTML-based, but with client-side progressive enhancement to do useful tasks like checking the server for updates and printing the “this question has new answers” announce.

maybe the following scenario works for you:
a browser requests your xml file.
once loaded, the xslt associated with the xml file is executed. result: your initial html is outputted together with a script tag.
in the javascript, an ajax call to the current location is made to get the "underlying" xml-dom. from then on, your javascript manages all the xml-processing.
you made sure that in step 3, the xml is not loaded from the server again but is taken from the browser cache.
that's it.

Develop Reference

JavaScript is the programming language of the Web.