is it possible to parse a HTML page that was fetched via a jQuery GET call in a web worker? I know that you can't access the DOM in a web worker, but I think it should be possible to pass the html to the worker and process it with a selector engine like sizzle. I don't need to manipulate the main page's DOM here. Or is sizzle also not available for use in this case?
The problem is that my processing/parsing is so heavy, that the UI hangs for about 10 seconds, so doing this stuff in a separate Thread would be a good idea. So if this is possible, how could I implement this?
Related
I understand that Web Workers cannot access the main document DOM, but has anyone successfully tried to build a partial DOM in a Web Worker using jQuery, output the resulting HTML and then attach that to the document DOM?
Did it provide much of a performance improvement over rendering on the UI thread that is worth the extra pain of implementing this in a thread-safe way?
Would this be trying to use Web Workers for something they shouldn't be used for?
It depends a lot on what kind of HTML are you trying to generate. However there is one big problem you might have overseen. It's not just that you can't access main thread's DOM.
You cannot create HTMLElement object instances in worker at all.
So even if you go through the trouble of generating the HTML, you will then have to convert it to HTML string or JSON, parse it in the main thread and convert it to DOM object.
The only case when this is worth it is when generating the data for HTML is CPU intensive. And in that case, you can just generate the data, send to main thread and display in HTML.
This is also good approach to keep data processing and visual rendering separate.
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.
We are using an HTML parser called Html Agility Pack in our .NET web services. We parse some HTML pages using this parser and extract some content. Though this parser is useful, we are looking for better productivity. We are wondering whether we can load javascript and jquery in web services and use jquery just like we use it on web pages for extracting content. This would make our job a lot easier.
If we cannot do this, Can we leverage the power of jquery in some other way? We are really curious to know a solution.
You could send the HTML content to the user's browser and parse it with jQuery there.
Then show it to the user.
Or, if needed for others, post the result back to the server to cache it.
many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations?
A took like wget is not useful here because it just downloads the original source.
Is there perhaps a way to use a web browser rendering engine?
Ideally I am after a solution that I can interface with from Python.
thanks!
The only good way I know to do such things is to automate a browser, for example via Selenium RC. If you have no idea of how to deduce that the page has finished running the relevant javascript, then, just a real live user visiting that page, you'll just have to wait a while, grab a snapshot, wait some more, grab another, and check there was no change between them to convince yourself that it's really finished.
Please see related info at stackoverflow:
screen-scraping
Screen Scraping from a web page with a lot of Javascript
I'm having some trouble figuring out how to make the "page load" architecture of a website.
The basic idea is, that I would use XSLT to present it but instead of doing it the classic way with the XSL tags I would do it with JavaScript. Each link should therefore refer to a JavaScript function that would change the content and menus of the page.
The reason why I want to do it this way, is having the option of letting JavaScript dynamically show each page using the data provided in the first, initial XML file instead of making a "complete" server request for the specific page, which simply has too many downsides.
The basic problem of that is, that after having searched the web for a solution to access the "underlying" XML of the document with JavaScript, I only find solutions to access external XML files.
I could of course just "print" all the XML data into a JavaScript array fully declared in the document header, but I believe this would be a very, very nasty solution. And ugly, for that matter.
My questions therefore are:
Is it even possible to do what I'm
thinking of?
Would it be SEO-friendly to have all
the website pages' content loaded
initially in the XML file?
My alternative would be to dynamically load the specific page's content using AJAX on demand. However, I find it difficult to find a way that would be the least SEO-friendly. I can't imagine that a search engine would execute any JavaScript.
I'm very sorry if this is unclear, but it's really freaking me out.
Thanks in advance.
Is it even possible to do what I'm thinking of?
Sure.
Would it be SEO-friendly to have all the website pages' content loaded initially in the XML file?
No, it would be total insanity.
I can't imagine that a search engine would execute any JavaScript.
Well, quite. It's also pretty bad for accessibility: non-JS browsers, or browsers with a slight difference in JS implementation (eg new reserved words) that causes your script to have an error and boom! no page. And unless you provide proper navigation through hash links, usability will be terrible too.
All-JavaScript in-page content creation can be useful for raw web applications (infamously, GMail), but for a content-driven site it would be largely disastrous. You'd essentially have to build up the same pages from the client side for JS browsers and the server side for all other agents, at which point you've lost the advantage of doing it all on the client.
Probably better to do it like SO: primarily HTML-based, but with client-side progressive enhancement to do useful tasks like checking the server for updates and printing the “this question has new answers” announce.
maybe the following scenario works for you:
a browser requests your xml file.
once loaded, the xslt associated with the xml file is executed. result: your initial html is outputted together with a script tag.
in the javascript, an ajax call to the current location is made to get the "underlying" xml-dom. from then on, your javascript manages all the xml-processing.
you made sure that in step 3, the xml is not loaded from the server again but is taken from the browser cache.
that's it.