Why is it important to track PDFs?

For one of our clients, a key online goal is to maximise views of their documents, most of which are in PDF format. In order to track page views and measure conversions, like many businesses do, they are using Google Tag Manager to bring in various tracking scripts such as Google Analytics.

The complication was that the majority of their campaigns involved having links in printed publications and emails sent to users on their mailing lists that would directly link to PDF files to be downloaded. As these were direct links to files, the tracking scripts in place on the website were not able to track these downloads, leaving an incomplete picture of how many people are actually accessing these documents. While there are many simple solutions which involve tracking the clicks on links to PDF files, we would not get any information about direct pageviews using only these methods.

We have explored several solutions to this problem, which are covered in this blog post along with their individual merits and limitations. Which of these is best for your requirements, however, might not be what was best for us, and in fact might be something else entirely.

Solution 1: embedding in a HTML page

The main problem with tracking PDFs is that the tracking codes used by the client rely on JavaScript, and when requesting the PDF file from the website there is no JavaScript, just the file which is to be downloaded. A solution to this problem is to create a HTML page which contains the JavaScript, and also an embedded PDF that takes up 100% of the page height and width. This gives the browser window the appearance of displaying a PDF just like it would when you view a PDF in your browser, except that it is not a PDF file, it is a HTML page.

This satisfied all the tracking requirements, as the tracking code snippets were brought in on this page as they are on every other page, and most users would not notice any difference in the experience when reading documents. However, it would present a problem if users tried to save or print using the browser menu. What happens is that the browser will attempt to print a HTML page with a PDF embedded inside it, which is very different behaviour to what happens when you print off a PDF directly. The same problem happens when saving; the browser saves the HTML page, not the PDF. For security reasons, developers cannot override the browser’s functionality when users use these features.

Ultimately that meant this solution was not appropriate for our needs.

Pros

  • Loads almost as fast as a regular PDF request
  • Can add as much JavaScript tracking to the page as we wish, as it is HTML-based
  • Look and feel of the page seems consistent in most browsers
  • Older browsers can display a direct link to the document as a fallback if they do not support embedding PDFs

Cons

  • The browser’s save and print functions don’t work correctly (NOTE: printing issues can be overcome with a well-written print stylesheet, but saving cannot be overridden)
  • Rendering on mobile devices is unreliable

Solution 2: PDF.js

Another route we explored was pdf.js, made by Mozilla and individual contributors, which would give our PDFs a consistent look and feel across browsers, allow us to embed JavaScript tracking, and provide improved print functionality. However, its browser compatibility was unacceptable for our needs, and pdf.js still cannot overcome the issue with using the browser’s File->Save function; it still tries to save it as a HTML page (see this example to try it yourself). The pros and cons otherwise are the same as the embedded PDF solution.

Solution 3: Analytics Measurement Protocol

The measurement protocol for Google Analytics allows us to send data to Google Analytics server-side, without the need to use JavaScript. Using this we were able to send all the data to Google Analytics with a small amount of server-side code, before returning the PDF file to the user. This meant that the PDF could be viewed natively by any device which accesses it, and we were able to track all we needed.

While this was great, and seemed to solve all the issues, there was a problem. Google Analytics was not the only piece of JavaScript tracking we needed to take care of; we also have other tracking scripts on the site which are just as important to the client, and unfortunately these do not support a server-side solution. This ruled out the Analytics Measurement Protocol.

Pros

  • No ‘fake page’ HTML trickery to render the PDF. It can be viewed natively by any device.
  • All Google Analytics data can be sent this way

Cons

  • Cannot add other JavaScript tracking snippets
  • Trouble syncing up new visitors who visit the document first then a page on the website second, as the tracking ID is not retained
  • Due to byte serving, multiple requests can be made for a single document, resulting in tracking the same document several times. This would need to be turned off, or compensated for.

Solution 4: interstitial page

Finally, we created an interstitial page, which would always appear before accessing the PDF file. You may have seen this on other sites, when requesting a download for example, where a message perhaps reading something like “If your file does not load immediately, please click here” is displayed, along with a link to the file. This is commonly used for advertising space or, as in our case, to take care of some JavaScript business!

The way this would work is that you request the PDF file you want, and are given a page with some text similar to the example above, which itself redirects you to the PDF once all the JavaScript tracking has been taken care of.

An interesting problem this solution presented is that if the URL of the interstitial page and the final page you land on (the PDF) differ, then the user sharing that final link to the PDF or visiting it again would result in no tracking, as we have bypassed the interstitial page.

So, we must make it impossible to bypass, and to do this we simply redirect from the interstitial page to…itself!

What we can do server-side is check the HTTP Referer header, which will tell us whether this is a fresh request which we need to track, or a request which came from the interstitial page itself.

An example request would work like this:

1. A user makes a request for the PDF

2. The server checks the Referer, and if the Referer does not indicate that the request came from the interstitial page, then we give the user the interstitial page

3. The interstitial page loads, tracks all that it needs, then reloads itself

4. The server checks the Referer again, and sees that we have already come from the interstitial page (and therefore tracked), meaning we can return the PDF document

5. The user receives the document they requested, and the document is handled natively by the browser

One could also use a custom header to achieve the same result as the above.

Pros

  • We can add all the JavaScript code we want
  • PDFs are handled natively by whatever device requests them
  • If a user shares the link when viewing the PDF, or visits it again directly, they are tracked for these actions

Cons

  • The user has to view the interstitial page for a fraction of a second while they are being tracked before getting the document they want; however, you can add whatever you like to this page to help improve the user experience
  • Pages cannot be cached in a simple manner, as the same URL returns a different output depending on how it is requested

It has been a long journey working out the best way to handle this, and if you have the same problem as we did, you will need to choose whichever solution suits you best – each come with their pros and cons! I cannot say if there is a ‘perfect’ method, maybe there is, and I’d love to take any advice on improving this further!

At Box UK we have a strong team of bespoke software consultants with more than two decades of bespoke software development experience. If you’re interested in finding out more about how we can help you, contact us on +44 (0)20 7439 1900 or email info@boxuk.com.