How does google analytics collect its data?

How does google analytics collect its data? - javascript

Yes, I know you have to embed the google analytics javascript into your page.
But how is the collected information submitted to the google analytics server?
For example an AJAX request will not be possible because of the browsers security settings (cross domain scripting).
Maybe someone had already a look at the confusing google javascript code?

When html page makes a request for a ga.js file the http protocol sends big amount of data, about IP, refer, browers, language, system. There is no need to use ajax.
But still some data cant be achieved this way, so GA script puts image into html with additional parameters, take a look at this example:
http://www.google-analytics.com/__utm.gif?utmwv=4.3&utmn=1464271798&utmhn=www.example.com&utmcs=UTF-8&utmsr=1920x1200&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=10.0%20r22&utmdt=Page title&utmhid=1805038256&utmr=0&utmp=/&utmac=cookie value
This is a blank image, sometimes called a tracking pixel, that GA puts into HTML.

Some good answers here which individually tend to hit on one method or another for sending the data. There's a valuable reference which I feel is missing from the above answers, though, and covers all the methods.
Google refers to the different methods of sending data 'transport mechanisms'
From the Analytics.js documentation Google mentions the three main transport mechanisms that it uses to send data.
This specifies the transport mechanism with which hits will be sent. The options are 'beacon', 'xhr', or 'image'. By default, analytics.js will try to figure out the best method based on the hit size and browser capabilities. If you specify 'beacon' and the user's browser does not support the navigator.sendBeacon method, it will fall back to 'image' or 'xhr' depending on hit size.
One of the common and standard ways to send some of the data to Google (which is shown in Thinker's answer) is by adding the data as GET parameters to a tracking pixel. This would fall under the category which Google calls an 'image' transport.
Secondly, Google can use the 'beacon' transport method if the client's browser supports it. This is often my preferred method because it will attempt to send the information immediately. Or in Google's words:
This is useful in cases where you wish to track an event just before a user navigates away from your site, without delaying the navigation.
The 'xhr' transport mechanism is the third way that Google Analytics can send data back home, and the particular transport mechanism that is used can depend on things such as the size of the hit. (I'm not sure what other factors go into GA deciding the optimal transport mechanism to use)
In case you are curious how to force GA into using a specific transport mechanism, here is a sample code snippet which forces this event hit to be sent as a 'beacon':
ga('send', 'event', 'click', 'download-me', {transport: 'beacon'});
Hope this helps.
Also, if you are curious about this topic because you'd like to capture and send this data to your own site too, I recommend creating a binding to Google Analytics' send, which allows you to grab the payload and AJAX it to your own server.
ga(function(tracker) {
// Grab a reference to the default sendHitTask function.
originalSendHitTask = tracker.get('sendHitTask');
// Modifies sendHitTask to send a copy of the request to a local server after
// sending the normal request to www.google-analytics.com/collect.
tracker.set('sendHitTask', function(model) {
var payload = model.get('hitPayload');
originalSendHitTask(model);
var xhr = new XMLHttpRequest();
xhr.open('POST', '/index.php?task=mycollect', true);
xhr.send(payload);
});
});

Without looking at the code, I assume their data is collected from the HTTP headers they receive in the asynchronous request.
Remember that most browsers send data such as OS, platform, browser, version, locale, etc... Also they do have the IP so they can guesstimate your location. And I assume they have some sort of clever algorithm to decide whether you are a unique visitor or not.
Time on the site is probably calculated by using an onUnload() event.

Google Analytics web page provides detailed information of how Google Analytics server collect data. http://code.google.com/apis/analytics/docs/concepts/gaConceptsOverview.html
All Google Analytics data is collected and packed into the Request URL's query string and sent to Google Analytics server. The http request is made by a gif image(http://www.google-analytics.com/__utm.gif) activated by Google Analytics JS.

It's easy enough to tell by using something like Firebug's Net tab.
Ajax isn't needed - since data isn't being fetched from Google. They just encode the information in a query string, and then load a transparent gif using it.

To expand on other very good answers, Google does provide an API to track async "virtual pageviews" which are reported by website authors themselves in their scripts to Google.
_gaq.push(['_trackPageview', 'my_unique_action']);
They provide it so it is possible to track actions that are not part of regular page views and http requests.
Async tracking guide:
http://code.google.com/apis/analytics/docs/tracking/asyncUsageGuide.html#Syntax

Use the httpfox or firebug Firefox extension to figure out what HTTP requests the browser sends and what responses it receives.
I don't know how Google Analytics works, but one possibility is to make the browser download an image: <img src="http://my-analytics.com" width="1" height="1"> (with a single, transparent pixel), and log all the HTTP request headers (e.g. Referer:) on the server side.

//edit: see coment at the bottom
*Ok, find an answer during a discussion with a friend of mine :-)
The informations to google analytics are submitted in three ways:
List item
The HTTP Request can be analyzed with all informations of the http headers.
A cookie is recognized by the google analytics server.
An ajax call is done within the embeded javascript to submit such informations like display resolution, flash player version, etc.
These informations are not transmitted via the http headers.
*This is possible, because the ajax call is done in the context of the embedded javascript, so its no cross domain scripting. This was an error in reasoning by me.**

Related

Intercept browser request using chrome extension

I'm working on google chrome extension which get the page url and analyze it. How can i intercept the browser request and serve that request condionally based on some criteria. I'm surfing but could find any material.

That's going to be very tricky, if at all possible.
The closest that extensions API provide is blocking webRequest API. There, you can intercept a request and make a decision to allow it or block it, but..
You can only do that until the request is sent out. So you can only rely on the URL and maybe request headers. Even in later events (when it's too late to redirect) no point webRequest API gives access to the response itself.
You have to make the decision synchronously, which basically severely limits processing options.
What you could do (very much in theory) is always redirect the request to your own "loading" page, meanwhile trying to replicate the request yourself (near-impossible to fully do, also consider side-effects), analyze the response and then substitute the "loading" page with the real one.
It's going to be either very complicated or impossible to do in complex cases. You're basically trying to implement an intercepting proxy in a Chrome extension - it doesn't really provide the full toolset to do so.

Can I call the google maps directions API from a web page (as opposed to a server)?

I am a long time programmer (C, Python, FORTRAN), but this is my first foray into javascript and anything web, so I am learning on the fly.
So, the main question up front: Can I use the google maps directions API from a script section of a simple web page on my laptop, or does it need to be called from a server?
I have an API key and I have successfully used parts of the API that are called as functions (Map, Geometry). I am trying to use the google maps directions API, which as I understand it, you must use via a URL and an HTTP GET. Here is a sample URL that my code has constructed:
https://maps.googleapis.com/maps/api/directions/json?origin=45.0491174%2C-93.46037330000001&destination=45.48282401917292%2C-93.46037330000001&key="my key"
If I paste that URL into the address bar, it works. I get a document back with the directions info. If I execute it from inside a script section on a simple web page I am building, the response I get is:
No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'null' is therefore not allowed access.
I did some searching, both in stackoverflow and elsewhere on the web and I came across this:
http://www.html5rocks.com/en/tutorials/cors/
Per that page, I checked to make sure that withCredentials was supported and then I set withCredentials to true. This did not alter the outcome. Obviously, the API works, so I am now wondering if I have to do this from a web server and not from a simple web page to get around the cross-domain limitations. I am hoping to avoid having to set up a server since this is a one-off for my own personal use, but maybe I dont have a choice?
As an aside, does anyone have any insight into why the directions API is called via a URL rather than as a javascript function like many of the others?

For JavaScript better use the Web -> Maps JavaScript API. This helped me solve this issue without a server.
The problem is that your Web Services -> Directions API unlike e.g. the Web Services -> Geolocations API, does not provide JS XSS functionalities like server side access-control-allow-origin: * response headers or JSONP functionality. Maybe this is even a bug of Google because it seems very strange to me that one "Web Services" API server does allow JS XSS and another not.
See https://stackoverflow.com/a/26198407/3069376

To answer the main question, Yes. You can definitely use the GoogleMap Directions API inside your web page. To get you started quick and easy, follow this link . Then,
Click on the JAVASCRIPT + HTML version, copy the whole code and paste it into a text editor and save it as an html file.
Start your own local server (like node.js). Dont forget to obtain a Browser API key and set your HTTP refferer (example http://localhost:4567) in Google Developer Console or you will get errors.
Run your html file on your local server (example http://localhost:4567/myprojfolder/samplewebfile.html) .
You can do this with all Google Maps JavaScript API samples. If you're curious about setting a node.js server, there are plenty of resources online.

Differenciate Between User Requests and AJAX/Resource Requests

I'm attempting to create an app with Node.js (using http.createServer()) which will be a single page application with requests for data via XMLHttpRequest. To do this I need to be able to differentiate between a user navigating to my domain, and AJAX requests and requests generated by the browser for linked resources.
If the request is from the user I always want to return the index.html page which will handle requesting content but if the request is browser generated or AJAX and is for CSS, Javascript or other linked files I want to serve those files. Is there any way to detect this?
Looking at the request headers for the different file types I saw the referer header appeared when the request for content was generated by the page. I figured that was the solution I was looking for but that header is also set when a user clicks on a link to the page making it useless.
The only other thing which seems to change is the accept header which could sort of work but might not be a catch all solution. Any user requests always seem to have text/html as the preferred return type regardless of which url was entered. I could detect that but I'm pretty sure AJAX requests for html files would also have that accept header which would cause problems.
Is there anything I'm missing here (any headers or properties I can look for)?
Edit: I do not need the solution to protect files and I don't care about users bypassing it with their own requests. My intention is not to hide files or make them secure, but rather to keep any data that is requested within the scope of the app.
For example, if a user navigates to http://example.com/images/someimage.jpg they are instead shown the index.html file which can then show the image in a richer context and include all of the links and functionality to go with it.
TL/DR: I need to detect when someone is trying to access the app to then serve them the index page and have that send them the content they want. I also need to detect when the browser has requested resources (JS, CSS, HTML, images, etc) needed by the app to be able to actually return the resource not the index file.

In terms of HTTP protocol there are NO difference between a user-generated-query and a browser-generated-query.
Every query is just... a query.
You can make a query with a command line, with a browser, you can click a link, send some ascii text via telnet, request a proxy which will make the query for you, the server goal is never to identify how the query was requested by the user.
See for example a request made by a user on a reverse proxy cache, this query will never reach your server (response comes from the cache), the first query made to build this response could have been made by a real user or by a browser.
In terms of security trying to control that the user is never requesting data by-himself cannot be done by detecting that the query is a real human click (and search google for clickjacking if you want to be afraid). Every query that a browser can make can also be played by the user, every one, you have no way to prevent that.
Some browsers plugins are even doing pre-fetching, detecting links on the page and making the request before you do it yourself (if it's a GET query).
For ajax, some libraries like JQuery will add an X-Requested-With: XMLHttpRequest header, and this is used on most framework to detect ajax mode.
But it is more robust to depend on a location policy for that (like making your ajax queries with a /format/ajax, which could also be used on other ways (like /format/json, /format/html, or /format/csv).
Spending time on a location policy based routing is certainly more usefull.
But one thing can make a difference, POST queries are not indempotent, it means the browser cannot make a POST query without a real user interaction, because a POST query may alter the state of the session or the state of the server data (but js can make POST queries, this is just a default behavior of browsers). The browser will never automatically retrieve a POST query, so you could make a website where all users interactions are POST queries (via forms or via some js altering link clicks to send POST ajax queries instead). But I'm not that's your real goal.

Not technically an answer to the question but I found a simple solution which does what I want: prefix all app based requests with a subdomain eg. http://data.example.com/. It's then really simple to check the host header for that subdomain: if present send the resource else send the index page.

Send analytic data to different domain without response

Precondition
I own mysite.com
I do not own othersite.com, but I can embed javascript code there
Question
How to send analytic data from othersite.com to mysite.com ?
Expected : othersite.com client -> mysite.com server
Not expected : othersite.com client -> othersite.com server -> mysite.com server
Its principle seems like to be similar with Google Analytics, but I don't know the exact principle
I know that it couldn't be done by ajax due to cross-domain problem
How does it change if I own othersite.com ?
How to send analytic data without response ?
For example, Heap Analytics send analytic data without response

The default scenarion with Google Analytics (and all other Web Analytics Tools I know) is to transfer data across domains by dynamically creating an image with a source that points to the tracking server and appending user data (like unique id per user) as url parameters to the image source.
Apart from everything you send via the image source you will also get the data from the http request (ip adress, user agent etc).
For a simple system you could create a script that stores the url and http data directly to a database before it returns a (1 pixel transparent) image. If you want something scalable you would probably write the data to a log file and use some currently hyped big data technology (hadoop, hive etc) for processing.
Decoupling data collection and processing is a good idea in any case, in that allows you to more easily switch components of your tracking application for improved versions without affecting the other parts of the system.
Sending an image is reliable inasfar as it works for any browser without enabling any special configurations (cors etc). It is however rather easily blocked (users just have to block pixel images or redirect calls to your server via their host file).
If the other domain was yours you could
track via ajax
read the server logs directly or pipe them to a dashboard of your choice
If you do not have access physically to the server but the owners let you configure their name servers you could run all incoming http requests through your tracking script before redirecting them to the requested page.
I took a look at Heap Analytics. They send an image request just like the other tools:
https://heapanalytics.com/h?a=236035469&u=4184751431615606&v=2274541888&s=3701858993&b=web&z=2&h=%2F&d=heapanalytics.com&t=Heap%20%7C%20Mobile%20and%20Web%20Analytics&r=https%3A%2F%2Fwww.google.de%2F&k=Screen%20Dimensions&k=1050%20x%201680&k=Window%20Width&k=1973&k=Window%20Height&k=1039&tm=1432884624859
which returns http 200 response code and an 1 pixel transparent image, so it does not look like they "track without response" after all.

Chrome extension for blocking websites based on database blacklist

We have a database with millions of domain categorizations (storing it client side is not an option) and we want to make a chrome extension to blacklist sites based on how they are categorized in the Mysql database.
The server side stuff is easy, we post the domain, and return the category.
The tricky part is blocking requests based on the categorization. Here are a few potential implementations and why they won't (quite) work.
Idea 1:
Redirect all traffic using Chrome.webRequest to mysite.com/script.php?url=www.theoriginalurl
This script checks the database's category & either redirects them to the theoriginalurl.com or denies the request, redirecting them to www.youGotBlocked...
Have the chrome extension check the http referrer header to make sure that they came from mysite.com (unless the url is mysite.com, then do nothing).
Problems:
It doesn't seem like we can set the referrer header in PHP, so we have no way of knowing that they came from mysite.com. It seems like maybe we should be passing info via a cookie, but I haven't thought of an elegant solution involving cookies.
Idea 2:
Every time Chrome.webRequest fires make an AJAX POST request to mysite.com/categorizeURL.php with the URL to get the category. Block or allow based on the server's response.
Problems:
Either we make the request asynchronous and we can't get the response in time (their is no way that we have found to delay the callback until the server responds -- more on that here). Or we make the request synchronous, and IT WORKS!!! Except for the fact that if they can't reach our server, their entire browser locks up and they essentially need to refresh the extension to be able to access the internet again.
Other ideas?
Does anyone have other ideas for creating a blacklist via a Chrome extension? I simply refuse to believe that it is not possible.

Develop Reference

JavaScript is the programming language of the Web.