Analytics Poisoning: A Short Review
OVERVIEW
Platforms like Google Analytics are part of the fundamental infrastructure of the modern web. These tools help companies assess what elements of their sites are succeeding or failing with users, and are used to guide everything from narrow product decisions to broader business strategy. Among media companies, content leaderboards and engagement statistics inform editorial decision-making and the types of content that get created on a day-to-day basis.
The reliance on these tools means that “analytics poisoning” -- the malicious distortion of website analytics -- can be a powerful tool in manipulating businesses of all shapes and sizes. Malicious actors seeking to influence the decisions of a media outlet might artificially boost certain types of content, seeking to encourage an emphasis on certain beats or boost the position of specific reporters. Unscrupulous businesses might send waves of faked traffic against parts of a competitor app, seeking to distort a rival’s understanding of users and encourage fruitless feature development.
At IPM, we are interested in building tools that confront the challenge of socio-technical security: areas where existing technical vulnerabilities can be exploited to shape and manipulate social processes. Analytics poisoning is a prototypical case of this. The inability of analytics platforms to distinguish genuine from counterfeit traffic enables an attacker to manipulate the signals that an organization uses to make business and content decisions.
This short report demonstrates the kinds of analyses that we’re figuring out how to automate and build at IPM. We’ll walk through the technical vulnerability, assess the economics of leveraging this vulnerability to manipulate an organization, and discuss potential mitigation strategies.
THE VULNERABILITY
Google Analytics (GA) is the most widely-used web analytics tool by a wide margin. Originally a technology from a company named Urchin and acquired by Google in 2005, the core functionality of GA hinges on the successful execution of a javascript function that is initiated on the completion of any page load. This tracker implementation has changed very little over the past 15 years.
We conducted a series of experiments to determine how challenging it would be for a third party to manipulate the metrics that Google Analytics shows to a user. To do so, we installed a tracker on our own demonstration website, and sent different types of inauthentic traffic against it.
Automating Traffic
The most straightforward way to send create faked traffic against a target website is through the use of Selenium, a popular browser automation suite. Using this software, we can tell a computer to simply visit certain pages or engage with certain elements on a site according to algorithmic scripts. Our tests make it clear that GA does not discriminate between authentic and Selenium browsers mounted on virtual displays, and registers Selenium-operated browser visits and authentic visits identically.
It is worth noting that default Selenium browsers are easily separable from legitimate browsers. All browsers carry with them a multitude of settings, states, default capabilities, and version metadata that can be inspected by Javascript to determine if certain features offered by a website are appropriate for the current visitor. By interrogating these values, it is possible to identify with a high likelihood that a browser is being operated by automated systems like Selenium, and is therefore likely to be inauthentic behavior.
GA does not appear to provide this mitigation. However, even if it did, tools like Selenium Stealth can make it more challenging to detect automated traffic in this way. To get a sense of the costs of faking more believable traffic, we also tested Selenium Stealth against our target website. As expected, GA failed to identify this more sophisticated method of generating faked traffic.
Faking Locations
Browser automation is only one part of the problem for a prospective analytics poisoner. While we determined that Selenium-driven browsers can easily bypass any mitigation techniques GA may provide, this is not an easily scalable avenue of poisoning: an attack would be extremely obvious if all the faked traffic originated from the same location.
There are a few ways of fanning out browser requests across numerous IPs, which GA uses to determine where visitors are coming from. Our experiment experimented with obfuscating the source of traffic through cloud IPs, VPN clients, and proxy servers, methods through which an attacker could make traffic appear to be coming from somewhere where it is not.
GA also appears to not have any mitigation for traffic flowing through these obfuscated channels. We found that requests from all three of these methods were successful in appearing in our Google Analytics dashboard. This potentially allows the attacker not only to artificially drive up engagement on select areas of a site but also to choose the locations from which the traffic appears to be coming from, as well.
As with Selenium, there are a few mitigation methods that GA could implement to avoid being fooled by this inauthentic traffic. Platforms like CloudFlare use IP reputation as an effective means of weeding out traffic that is likely to be inauthentic. In the same way that many services flag requests from TOR exit nodes, or from requests originating from AWS IPs, in principle, Google Analytics could remove a majority of bad-faith traffic through this approach.
RESULTS
There are a limitless number of potential vulnerabilities lurking out there. The key question is whether or not a specific attack is practical for a malicious actor to use against a target. Our testing tool - the IPM Vulnerability Engine - allows us to conduct analytics poisoning attacks and accurately measure the cost and time to generate faked page views through various methods. This gives us a sense of the constraints facing a bad actor seeking to pull off an analytics poisoning campaign.
We find that these attacks are cheap to accomplish and scale with little effort. The chart below shows the time and cost required to generate a single page view against a target website. Our testing suggests that it would be possible to generate a million page views from believable looking browsers (leveraging Selenium Stealth) and through a few spoofed locations (leveraging VPNs) for $200.00. With close inspection, end-users would potentially notice a small variety of IPs used for these hits - in order to avoid detection by those means, a proxy server approach would generate hits effectively indistinguishable from legitimate traffic at a cost of $17,000.
Visit Automation | |||
---|---|---|---|
Selenium | Selenium Stealth | ||
Traffic Spoofing |
Unobfuscated | $0.00004 / 25 seconds / 30 LOC | $0.00004 / 25 seconds / 40 LOC |
VPN | $0.0002 / 35 seconds / 45 LOC | $0.0002 / 35 seconds / 55 LOC | |
Proxy Servers | $0.017 / 60 seconds / 45 LOC | $0.017 / 60 seconds / 55 LOC |
While these weaknesses in GA measurement have been known for some time in the web analytics community (see example and example), we believe that it should be considered in a broader socio-technical context. The widespread use of GA throughout the web -- and the extremely minimal costs of pulling off analytics poisoning attacks -- suggests that it is a practical vector for bad actors to manipulate organizational decision-making. Our analysis also points to the significant role that changes to GA could play in mitigating this vulnerability: even rudimentary changes to distinguish between real and fake traffic on the platform could increase the cost of these attacks by orders of magnitude.