How to get raw Google Analytics data using Keen IO

If you can’t measure it, you can’t improve it

— Peter Drucker

For most web startups performance measurement starts (and sometimes ends) with Google Analytics. There is so much else to do during the initial stages of development, you’d just solve your analytics problems quickly by pasting some GA code into your template.

And that’s often enough at the beginning. Google Analytics provides data aggregation and visualization in a variety of dimensions, way more than an average startup even needs at first. But, as you grow, you come to learn that you do need more. And eventually, everyone who has been happily using GA hits the same wall: there is no way to get access to raw GA data! Well, there is a way, but it costs $150,000.

Thankfully, we have discovered a workaround for this problem using Keen IO, a custom analytics solution for developers. Read on to see how we did it.

The wall we’ve hit

Uploadcare was launched a few years ago. When I joined the project last fall, it had a substantial amount of stats collected in GA. Although the data was quite useful, it did not answer one very important question:

Where are our paying users coming from?

And OK, sure. By setting up our e-commerce system properly, we could get this answer. But we’d have to wait a half a year until enough data was collected, and we couldn’t afford to wait that long — there were decisions we needed to make right now, and we needed at least a rough estimation of this data to make them…

In this article I will describe how we resolved this issue by logging raw GA data, collecting visitor IDs, associating the IDs with registered users, and feeding the missing e-commerce transactions back into GA.

Note that you can’t do this process with any other analytics tools available on the market, like KISSMetrics or Segment, unless you are using them from the very beginning. The raw data they provide contains anonymous visitor identifiers, but it is different from GA’s cid, so there is no way to map them to transactions.

First, though, let me start by giving you a brief summary of how Google Analytics works.

GA in action

Before we begin, let me clarify that, by “GA,” I mean the new Universal Analytics from Google. It uses the same principles as its predecessor, the “traditional” GA, but the structure of the data has been changed substantially.

In the simplest case with GA, the tracking is performed by calling two JavaScript functions from every page on your website that’s visited by a user:

ga('create', 'UA-12345678–9', 'auto');
ga('send', 'pageview');

The code fragment above sends a bunch of data to the GA server in form of a query string. For a good reference guide to the data sent to the GA server, check this out, but here are the most important variables it uses:

  • cid — Anonymous client ID, created when the user first visits the website. This value is bound to the browser, and is permanently stored in a cookie.
  • dl — Document location; link to the page the user has opened.
  • dr — Document referrer; location from which the user has navigated to the current page.
  • t — Type of tracking call (“pageview”, “event”, etc.)
  • uid — Identifier of the authenticated (or recognized) user in the system.

uid is not set automatically — there is one more JavaScript call required — so actually we have 3 calls in total:

ga('create', 'UA-12345678–9', 'auto');
ga('set', '&uid', userIdFromTheDatabase);
ga('send', 'pageview');

The last call initiates a GET request to the Google Analytics server. Here is an example request from my browser to the Uploadcare documentation page: (some identifiers are obfuscated)

https://www.google-analytics.com/collect?
v=1&_v=j31&a=1314540999&
t=pageview&_s=1&ul=en-us&de=UTF-8&
dl=https%3A%2F%2Fuploadcare.com%2F&
dr=http%3A%2F%2Fjsbin.com%2Fbeyaz%2F4&
dt=Uploadcare%3A%20File%20Upload%20Widget%2C
%20Cloud%20Storage%20with%20API%2C%20and%20CDN&
sd=24-bit&sr=1366x768&vp=1366x331&je=1&fl=16.0%20r0&
_u=cACADEQZI~&jid=&
cid=198951256.142178325&uid=1234&
tid=UA-12345678–9&z=540214989

This sort of data is sent from every single page, then stored on the GA servers. A normal user, however, only has access to the aggregated visualisation of the data, not the data itself.

The solution

We developed a workaround for fixing past analytics data without having direct access to it from GA — what we do is send missing transactions to Google Analytics, associating them with the correct cid’s. As each cid contains information about the source and medium, we can then analyze where each paying user comes from.

This is a way to partially fix the past analytics data without having direct access to it.

I decided to use Keen IO for this task, as it provides the simplest way to collect and analyze any kind of denormalized structured data. It’s the right choice when you need a quick, nonstandard solution when working with Big Data.

Step 1: Collect the data

The code used to send data to GA can be found in analytics.js. As you can, see it’s minified, so in order to modify the code, we need to unminify it first. There are plenty of online tools to do so, including http://jsbeautifier.org and http://www.dirtymarkup.com.

Now, let’s look in the code to find where the data is actually sent to the GA server. There are a few such fragments in the file, which follow the following pattern:

d = window.XDomainRequest;

if (!d) return !1;

d = new d();
d.open('POST', a);
d.onerror = function() {
  c();
};
d.onload = c;

d.send(b);

where a is the URL, and b is a query string with the data. Our aim is to send the same data to Keen IO right after it’s sent to GA, so let’s add a function call:

// ...
d.send(b);
post_keen(a, b);

It isn’t necessary to send a (the GA URL), but it’s usually a good idea to collect any data that doesn’t take up too much space. Besides, it could be potentially useful in the future.:)

Here is the implementation of post_keen:

var QueryStringToJSON = function(pth) {
  var pairs, result;

  pairs = pth.split('&');
  result = {};

  pairs.forEach(function(pair) {
    pair = pair.split('=');
    result[pair[0]] = decodeURIComponent(pair[1] || '');
  });

  return JSON.parse(JSON.stringify(result));
};
window.post_keen = function(url, query) {
  var json;

  json = QueryStringToJSON(query);
  json.analytics_url = url;
  json.ip_address = '${keen.ip}';
  json.user_agent = '${keen.user_agent}';

  // Add addons for parsing Geo data and User Agent
  json.keen = {
    addons: [
      {
        name: 'keen:ip_to_geo',
        input: {
          ip: 'ip_address'
        },
        output: 'ip_geo_info'
      },
      {
        name: 'keen:ua_parser',
        input: {
          ua_string: 'user_agent'
        },
        output: 'parsed_user_agent'
      }
    ]
  };

  // Conditionally add URL parsers for "dl" and "dr":
  if (json.hasOwnProperty('dl')) {
    json.keen.addons.push({
      name: 'keen:url_parser',
      input: {
        url: 'dl'
      },
      output: 'parsed_dl'
    });
  }

  if (json.hasOwnProperty('dr')) {
    json.keen.addons.push({
      name: 'keen:url_parser',
      input: {
        url: 'dr'
      },
      output: 'parsed_dr'
    });
  }

  // Referrer parser - very useful for
  // traffic source analysis
  if (json.hasOwnProperty('dl') && json.hasOwnProperty('dr')) {
    json.keen.addons.push({
      name: 'keen:referrer_parser',
      input: {
        referrer_url: 'dr',
        page_url: 'dl'
      },
      output: 'referrer_info'
    });
  }

  keen_client.addEvent('google_analytics', json);
};

This sends a copy of the data to Keen IO in JSON format, adding extra information via Keen add-ons (user_agent, ip_address, etc.)

Step 2: Query the collected data

Keen IO provides simple to use and powerful tools to query raw or aggregated data. You can find a more detailed description of what Keen lets you query in their documentation.

In our case, the mapping between uid and cid had to be queried, which is easy to do with Keen’s select_unique query:

// Build the query
var query = {
  api_key: getApiKey(),
  event_collection: 'google_analytics',
  filters: JSON.stringify([
    {
      property_name: 'uid',
      operator: 'exists',
      property_value: true,
    }
  ]),
  target_property: 'cid',
  group_by: 'uid'
};


// Convert the JSON to query string
// (it's better to make a separate function)
var parts = [];
for(var p in query)
  if (query.hasOwnProperty(p)) {
    if (p.indexOf('[]') < 0) {
      parts.push(encodeURIComponent(p) + '=' + encodeURIComponent(query[p]));
    } else {
      for(var v in query[p]) {
        parts.push(encodeURIComponent(p) + '=' + encodeURIComponent(obj[p][v]));
      }
    }
  }
var query_string = str.join('&');

// Build the URL to query
var url = 'https://api.keen.io/3.0/projects/ID/queries/select_unique?' + query_string;

This returns a JSON object (name/value pairs), mapping each uid to the list of cids.

Fun fact: 1 in 4 registered users visit Uploadcare website using more than one device.

Step 3: Run ecommerce transactions

Now that we have mapped the connection between a unique visitor (the cid) and a user ID in our database, we can make an e-commerce tracking call back to Google Analytics for that cid. GA knows where each unique visitor originates from, so now we have the ability to get profitability stats for each channel we use here at Uploadcare, and can see which are most effective.

The result

I admit that this way of restoring the data from paying customers was not as efficient o understanding our business as I hoped, and here’s why:

Once integration is finished, users rarely visit the Uploadcare website website, because the system just works for them and rarely requires any maintenance. That’s why only 10% of paying users came back to the website during the last month.

Still, this data was enough to give us a rough idea about the profitability of some of our most valuable channels. I’m sure for other SaaS, services where the website is a part of daily user interaction, the usefulness of this method would be even greater.

One more unplanned (yet very useful) benefit

Another type of useful information we were able to extract from the raw GA data is the amount of time between the first visits of each page, which is useful in putting our sales funnel on a timeline.

Funnel timelines are a powerful tool that can help answer a number of questions about the timing of the user experience on your website, including:

  • What is the average time between steps in your funnel?
  • In how many minutes (or hours or days) does it take for half of your users to move from one funnel step to another?
  • Can users be separated into cohorts based on their funnel timing experience?

The technique to build a funnel timeline is a bit complicated — I could not find an out-of-the-box solution with any of the popular web analytics tools — and I will describe it in a future post.

Fun fact: 60% of users who created a project after registration did so in less than 5 minutes (!) from their first visit to the website. This means that this part of the funnel works really fast! At the same time, however, it shows that we don’t always do a good job persuading the “slow” decision makers, the ones who take time before they decide to sign up. Something for us to work on moving forward!

Final thoughts

Whenever you’re working with analytics, it’s important to set up the collection of your data properly. There is nothing more frustrating than realizing that you weren’t collecting enough data during the past few months, or that you weren’t collecting it on the right things. At the same time, too much data can be a burden, too; it’s difficult to decide exactly what needs to be collected.

The same thing applies when you’re collecting raw analytics data — I hope the above techniques help you collect the data most important to your business, without investing too much of your own time. They certainly helped us at Uploadcare.

This article was originally published by David Avs.

Infrastructure for user images, videos & documents