I am trying to scrape the content of a website which seems to be working on javascript or some other technology. I am using xpath to find the content on the page. I can see the content using firebug in the browser but if i save the source or download the source code via curl/wget, content is missing from the page. How is this possible ?
thanks in advance
Some content are loaded via JS dynamically. You need to run the JS somehow, like in a headless browser like PhantomJS for several seconds in order to load dynamic content. Then run through the DOM, similar to how .html() in jQuery does it, to get the rendered content.
As far as I know, this is similar to how Opera Mini does it in their proxies before they re-encode and send it to your device:
The server sends the response back as normal — when this is received by the Opera transcoding servers, they parse the markup and styles, execute the JavaScript, and transcode the data into Opera Binary Markup Language (OBML). This OBML data is progressively loaded by Opera Mini on the user's device.
Opera Mini's entry from Wikipedia:
JavaScript will only run for a couple of seconds on the Mini server before pausing, due to resource constraints.
According to the documentation for Opera Mini 4, before the page is sent to the mobile device, its onLoad events are fired and all scripts are allowed a maximum of two seconds to execute. The setInterval and setTimeout functions are disabled, so scripts designed to wait a certain amount of time before executing will not execute at all. After the scripts have finished or the timeout is reached, all scripts are stopped and the page is compressed and sent to the mobile device.
Typically the page loads and then requests the content (ajax) which is returned as json or jsonp. This is usually pretty handy for scraping because json is even easier to parse than html.
But if you haven't done it before, it can be a challenge to figure out how to make the right ajax request.
Related
Is it possible to use C# to render an ASP.NET view on the server side and save it as a PDF, preserving all the visual elements that involve CSS and Javascript, exactly as it renders in Chrome? The Javascript includes the latest versions of the standard Bootstrap and d3 libraries, as well as code using d3 to draw SVG charts. The page's CSS heavily uses Bootstrap.
I've tried a few things including IronPdf, but it completely destroys the formatting no matter what options I have tried. The only good results I've been able to get are by actually viewing the web page in Chrome, trying to print it, and saving it as a PDF that way. I'm trying to basically get exactly the same results using backend C# code to generate the PDF, without any user interaction needed. Can this be done? If it's impossible to render perfectly as a PDF I would also be open to other visual file formats that preserve the appearance of the web page.
You can run any install of chrome in headless mode and send it a command to print:
chrome --headless --disable-gpu --print-to-pdf=file1.pdf https://www.google.co.in/
I did this a couple of years ago. I created a microservice that took a URI, a test javascript, and a massage script then returned a pdf.
Test Script:
The test script is injected into the page and is called repeatedly until it returns true. The script should verify that components are in a properly loaded state. (This could be skipped by simply using a long delay prior to printing the pdf)
Massage Script:
The massage script is not required. It is injected into the page to alter the javascript or HTML prior to printing the pdf.
I used this heavily to load the entire user JOM including all Angular data stores (NGRX) since the user context was not present in the server-side Chrome instance.
Delayed Printing:
Since this is not a feature supported by chrome, I made an endpoint on my server that would hold a GET connection indefinitely. A script referencing the endpoint was injected into each page to be printed. When the Test Script returned ready, the code would cancel the script request by changing the script tag src to an empty script file that would return. This would conclude the last item that Chrome was waiting on and the documentready event would fire thus triggering the chrome print.
In this way, I was able to control Chrome printing on complex authenticated pages at my server.
I get this message in Jmeter when I run my testplan.
<noscript>
<p>
<strong>Note:</strong> Since your browser does not support JavaScript,
you must press the Continue button once to proceed.
</p>
</noscript>
How do I get around this issue on JMeter? When I go to the link manually in Chrome, the page/charts load fine.
I asked the UI engineer how things worked and they said when we go to the webpage
The http request returns an html
The browser reads the html and requests js files.
Thanks in advance for any help.
As per JMeter Project main page:
JMeter is not a browser
JMeter is not a browser, it works at protocol level. As far as web-services and remote services are concerned, JMeter looks like a browser (or rather, multiple browsers); however JMeter does not perform all the actions supported by browsers. In particular, JMeter does not execute the Javascript found in HTML pages. Nor does it render the HTML pages as a browser does (it's possible to view the response as HTML etc., but the timings are not included in any samples, and only one sample in one thread is ever displayed at a time).
Browsers don't do any magic, they just execute HTTP requests and render responses. If JavaScript is being used to "draw" something on page - you should not be interested in it as it happens solely on client side.
If JavaScript is used for building i.e. AJAX requests - these are basically "normal" HTTP Requests which can be recorded using HTTP(S) Test Script Recorder and replayed via HTTP Request samplers.
If you cannot successfully replay your script - most likely you're missing HTTP Cookie Manager and/or need to perform correlation of dynamic parameter(s)
I'm looking to write a JS library\toolkit like Kango. Kango allows one to write JS code which executes in all major browsers. Chrome and Firefox have a nice system which allows for long-running processes to run in a background page, while also running scripts on page load (content scripts) while allowing the two to communicate via messaging.
Unfortunately, IE doesn't really have a straightforward system like this. Instead, it looks like the best solution is to use a Browser Helper Object to load the background/content scripts somehow.
I'm not exactly sure how to get started.
Is it possible for the BHO to run an invisible IE instance which runs the background script so that the background script is executed in a separate environment? If so, how would content scripts and said background page communicate?
How would I make my JS libraries available to both types of scripts? How would a response from a background script (an XMLHTTPRequest object) find its way to a content script?
Thanks in advance.
I'm working on a headless browser based on WebKit (using C++/Qt4) with JavaScript support. The main purpose for this is being able to generate a HTML spanshot of websites heavily based on JavaScript (see Backbone.js or any other JavaScript MVC).
I'm aware that there isn't any way for knowing when the page is completely loaded (please see this question) and because of that, after I get the loadFinished signal (docs here) I create a timer and start polling the DOM content (as in checking every X ms the content of the DOM) to see if there were any changes. If there werent I assume that the page was loaded and print the result. Please keep in mind that I already know this is not-near-to-perfect solution, but it's the only one I could think of. If you have any better idea please answer this question
NOTE: The timer is non-blocking, meaning that everything running inside WebKit shouldn't be affected/blocked/paused in any way.
After testing the headless browser with some pages, everything seems to work fine (or at least as expected). But here is where the heisenbug appears. The headless browser should be called from a PHP script, which should wait (blocking call) for some output and then print it.
On my test machine (Apache 2.3.14, PHP 5.4.6) running the PHP script outputs the desired result, aka, the headless browser fetches the website, runs the JavaScript and prints what a user would see; but running the same script in the production server will fetch the website, run some of the JavaScript code and print the result.
The source code of the headless browser and the PHP script I'm using can be found here.
NOTE: The timer (as you can see in the source code of the headless browser) is set to 1s, but setting a bigger amount of time doesn't fix the problem
NOTE 2: Catching all JavaScript errors doesn't show anything, so it's not because of a missing function, wrong args, or any other type of incorrect code.
I'm testing the headless browser with 2 websites.
This one is working on both my test machine and in production server, while this one works only in my test machine.
I'm more propone to think that this is some weird bug in the JavaScript code in the second website rather than in the code of the headless browser, as it generates a perfect HTML snapshot of the first website, but then again, this is a heisenbug so I'm not really sure what is causing all this.
Any ideas/comments will be appreciated. Thank you
Rather than polling for DOM changes, why not watch network requests? This seems like a safer heuristic to use. If there has been no network activity for X ms (and there are no pending requests), then assume page is fully "loaded".
I have seen this excellent firefox extension, Screengrab!. It takes a "picture" of the web page and copies it to the clipboard or saves it to a png file. I need to do so, but with a new web page, from an url I have in javascript. I can open the web page in a new window, but then I have to call the extension -not to press the control- and saves the page once the page is fully loaded.
Is it possible?
I am pretty certain that it is not possible to access any Firefox add-on through web page content. This could create privacy and/or security issues within the Firefox browser (as the user has never given you permission to access such content on their machine). For this reason, I believe Firefox add-ons run in an entirely different JavaScript context, thereby making this entirely impossible.
However, as Dmitriy's answer states, there are server-side workarounds that can be performed.
Does not look like ScreenGrab has any javascript API.
There is a PHP solution for Saving Web Page as Image.
If you need to do it from JavaScript (from client side) - you can:
Step 1: Create a PHP server app that does the trick (see the link), and that accepts JSONP call.
Step 2: Create a client side page (JavaScript) that will send a JSONP request to that PHP script. See my answer here, that will help you to create such request.