I'm working on a headless browser based on WebKit (using C++/Qt4) with JavaScript support. The main purpose for this is being able to generate a HTML spanshot of websites heavily based on JavaScript (see Backbone.js or any other JavaScript MVC).
I'm aware that there isn't any way for knowing when the page is completely loaded (please see this question) and because of that, after I get the loadFinished signal (docs here) I create a timer and start polling the DOM content (as in checking every X ms the content of the DOM) to see if there were any changes. If there werent I assume that the page was loaded and print the result. Please keep in mind that I already know this is not-near-to-perfect solution, but it's the only one I could think of. If you have any better idea please answer this question
NOTE: The timer is non-blocking, meaning that everything running inside WebKit shouldn't be affected/blocked/paused in any way.
After testing the headless browser with some pages, everything seems to work fine (or at least as expected). But here is where the heisenbug appears. The headless browser should be called from a PHP script, which should wait (blocking call) for some output and then print it.
On my test machine (Apache 2.3.14, PHP 5.4.6) running the PHP script outputs the desired result, aka, the headless browser fetches the website, runs the JavaScript and prints what a user would see; but running the same script in the production server will fetch the website, run some of the JavaScript code and print the result.
The source code of the headless browser and the PHP script I'm using can be found here.
NOTE: The timer (as you can see in the source code of the headless browser) is set to 1s, but setting a bigger amount of time doesn't fix the problem
NOTE 2: Catching all JavaScript errors doesn't show anything, so it's not because of a missing function, wrong args, or any other type of incorrect code.
I'm testing the headless browser with 2 websites.
This one is working on both my test machine and in production server, while this one works only in my test machine.
I'm more propone to think that this is some weird bug in the JavaScript code in the second website rather than in the code of the headless browser, as it generates a perfect HTML snapshot of the first website, but then again, this is a heisenbug so I'm not really sure what is causing all this.
Any ideas/comments will be appreciated. Thank you
Rather than polling for DOM changes, why not watch network requests? This seems like a safer heuristic to use. If there has been no network activity for X ms (and there are no pending requests), then assume page is fully "loaded".
Related
I want to get the INSPECT ELEMENT data of a website. Let's say Truecaller. So that i can get the Name of the person who's mobile number I searched.
But whenever i make a python script it gives me the PAGE SOURCE that does not contain the required information.
Kindly help me. I am a beginner so kindly excuse me of any mistake in the question.
TL;DR: Use Selenium (and PhantomJS)
The view page source will give you the html that was loaded when you made a request for the page (which is most likely what you are getting when you make a request from python.
Since nowadays a lot of pages load things and modify the DOM after the initial html was loaded, you will not get most of the information you want just by looking into that initial response.
To get the inspect element information you will need some sort of web browser to actually go to the page, wait for the information you want to load, and then use it. However you still want to do this in your python script.
Enter selenium, which is a tool for browser automation (mostly used for testing webpages). You can create a python script that opens a browser page and executes whatever code you write for it to do (even wait for a while and search for an after load DOM element!). Your script will still open a browser (which is kind of weird I would guess).
Enter PhantomJS, another library that you can use to have a headless browser to do all your web testing without having to rely on the actual browser UI.
Using selenium only you might achieve your goals, but with phantomjs you can do that in an even cleaner way! Good Luck.
INSPECT ELEMENT and VIEW PAGE SOURCE are not the same.
View source shows you the original HTML source of the page. When you view source from the browser, you get the HTML as it was delivered by the server, not after javascript does its thing.
The inspector shows you the DOM as it was interpreted by the browser. This includes for example changes made by javascript which cannot be seen in the HTML source.
what you see in the element inspector is not the source-code anymore.
You see a javascript manipulated version.
Instead of trying to execute all the scripts on your own which may lead into multiple problems like cross origin security and so on,
search the network tab for the actual search request and its parameters.
Then request the data from there, that is the trick.
Also it seems like you need to be logged in to search on the url you provided so you need to eventually adapt cookie/session/header and stuff, just like a request from your browser would.
So what i want to say is, better analyse where the data you look for is coming from if it is not in the source
The issue the question originates from is the following. I'm using TiddlyWiki (Classic) SPA on my Android device and usually use it with FireFox and its TiddlyFox extension for saving. For some reasons I'd like to be able to work with (and save) my TWs using other browsers, so I'm testing it with a PHP back-end (my fork of MicroTiddlyServer, but its code is not important here, I believe, + this PHP server).
In my tests I've noticed that although saving works fine, sometimes (at least when the PHP server is unloaded from memory due to this ugly Android "optimization" which seems to be not configurable) a TW is loaded from cache and because of that it is loaded as it was before the latest saving, not after.
So, what I want is to detect if the page was loaded in an ordinary way or from browser cache. Is it possible to check this via JavaScript?
As a worse alternative I can inject a timestamp via MTS and check it in a TW on load, but I'd like to avoid this complication (which involves both front-end and back-end and adds more TW file manipulation).
With the new Resource Timing Level 2 spec you can use the transfer size property to check if the page is loaded from cache:
var isCached = performance.getEntriesByType("navigation")[0].transferSize === 0;
Spec: https://www.w3.org/TR/resource-timing/#dom-performanceresourcetiming-transfersize
If you use the remote debugger in chrome you can see the network requests and determine if you item is cached or not. Firefox seems to have a remote debugger as well.
I have a SPA using Backbone + RequireJS. The document.ready event fires early enough (I believe), but it takes about 500ms for my application to boot up, that is, to make it's first GET request to my server's API. You can see this in these two images from both Firefox and Chrome (Chrome consistently takes a little longer for this operation from what I can tell):
Chrome browser:
Mozilla browser:
Does it normally take about 300-500 ms for the JavaScript to start up in your app, once the .js (in this case .js.gz) file is loaded into the runtime? My application is medium heavy so 500ms seems extreme.
In other words, the JS file is request and loaded at time X and only at time X+400ms does front-end finally get around to making a request, when it should happen as soon as possible (there is nothing else the front-end is waiting on except running the code).
Is there any good explanation for this?
The 400ms may be the time that browser used to parse and execute your script.
It may be helpful to have a look at the Timeline tab in Chrome Dev Tool, which will be much more informative for inspecting what's happening under the hood.
So I'm playing a game online on my laptop and it is a pretty simple html5 game but the optimization is nonexistant. The game with a few circles on my screen is using 85% of my cpu.
Logically, I started profiling the game and was trying to figure out if I could optimize it for my laptop. Now, what I'm wondering is how could I run the game but with a tweaked version of the JS script that is running.
If I try to save the page and run it from my laptop it of course has CORS issues so I can not access the server.
Can I somehow change the script a bit which is executing in my browser but while still staying "inside the webpage" which is running so that I may normally make XHR requests to the server?
Also, although this is not in the title of the question, can I somehow proxy the XHR request to my script but while not breaking the CORS rule?
Why is it so different if I run the same thing from my browser and from a saved HTML on my desktop? I have the same IP and am doing the same thing but from the url it "feels like" I'm running it from somewhere else. Can I somehow imitate that I'm running "from the webpage" but instead run it from a modified saved html?
You could proxy, given there isn't a cross domain protection mechanism or some sort of logging in (which complicates stuff).
What you could very well do is use a browser extension which allows you to add CSS, HTML and JavaScript.
I'm not entirely savvy on extensions so I'm not sure you can modify existing code but I'm guessing that if you can add arbitrary JS code you may very well replace the script tag containing the game for a similar personally modified script based on it. It's worth a try...
Link to getting started with chrome extensions
Update:
If you're set on doing it, proxying ammounts to requesting an URL with your application and do something with the page (html) instead of the original source. I assume you want to change the page and serve it to your browser.
With this in mind you will need the following, I dont know C# so you'll have to google around for libraries and utilities:
a way to request URLs (see link at bottom)
a way to modify the page, you need a DOM crawler
a way to start said process and serve it to your browser by hitting your own URL, meaning you need some sort of web server
I found the following question specifically on proxying with C#
First I recorded a script against my "Rich" Internett Application having Wickets and JavaScripts and it did not go very well at replay.
However, recording in URL mode solved a lot of these issues.
Why is that?
In general I assume that my script recorded in URL mode did capture things like:
web_url("bootstrap-collapse-ver-12312478469.js"
.
.
.
"RecContentType=text/JavaScript",
and these calls to i.e. JavaScript manipulating the web page made the page recognizabel the the replay because the Javascripts where actually executed during replay. In HTML mode these Javascripts where not executed (not seing them in my script after recording), and hence the page did not have the proper state for the replay to recognize it?
Is my assumption correct?
The only client types which execute JavaScript are
GUI Virtual users (QTP Operating Against Full Browsers)
Citrix|RDP operating against full browsers
TruClient
What you are seeing is the output of the executed JavaScript as a set of explicit requests.