30 research outputs found
COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies
Recent privacy protections by browser vendors aim to limit the abuse of
third-party cookies for cross-site tracking. While these countermeasures
against third-party cookies are widely welcome, there are concerns that they
will result in advertisers and trackers abusing first-party cookies instead. We
provide the first empirical evidence of how first-party cookies are abused by
advertisers and trackers by conducting a differential measurement study on 10K
websites with third-party cookies allowed and blocked. We find that advertisers
and trackers implement cross-site tracking despite third-party cookie blocking
by storing identifiers, based on probabilistic and deterministic attributes, in
first-party cookies. As opposed to third-party cookies, outright first-party
cookie blocking is not practical because it would result in major breakage of
legitimate website functionality.
We propose CookieGraph, a machine learning approach that can accurately and
robustly detect first-party tracking cookies. CookieGraph detects first-party
tracking cookies with 91.06% accuracy, outperforming the state-of-the-art
CookieBlock approach by 10.28%. We show that CookieGraph is fully robust
against cookie name manipulation while CookieBlock's accuracy drops by 15.68%.
We also show that CookieGraph does not cause any major breakage while
CookieBlock causes major breakage on 8% of the websites with SSO logins. Our
deployment of CookieGraph shows that first-party tracking cookies are used on
93.43% of the 10K websites. We also find that the most prevalent first-party
tracking cookies are set by major advertising entities such as Google as well
as many specialized entities such as Criteo
Actions speak louder than words: Semi-supervised learning for browser fingerprinting detection
As online tracking continues to grow, existing anti-tracking and
fingerprinting detection techniques that require significant manual input must
be augmented. Heuristic approaches to fingerprinting detection are precise but
must be carefully curated. Supervised machine learning techniques proposed for
detecting tracking require manually generated label-sets. Seeking to overcome
these challenges, we present a semi-supervised machine learning approach for
detecting fingerprinting scripts. Our approach is based on the core insight
that fingerprinting scripts have similar patterns of API access when generating
their fingerprints, even though their access patterns may not match exactly.
Using this insight, we group scripts by their JavaScript (JS) execution traces
and apply a semi-supervised approach to detect new fingerprinting scripts. We
detail our methodology and demonstrate its ability to identify the majority of
scripts (94.9%) identified by existing heuristic techniques. We also
show that the approach expands beyond detecting known scripts by surfacing
candidate scripts that are likely to include fingerprinting. Through an
analysis of these candidate scripts we discovered fingerprinting scripts that
were missed by heuristics and for which there are no heuristics. In particular,
we identified over one hundred device-class fingerprinting scripts present on
hundreds of domains. To the best of our knowledge, this is the first time
device-class fingerprinting has been measured in the wild. These successes
illustrate the power of a sparse vector representation and semi-supervised
learning to complement and extend existing tracking detection techniques
Automated discovery of privacy violations on the web
Online tracking is increasingly invasive and ubiquitous. Tracking protection provided by browsers is often ineffective, while solutions based on voluntary cooperation, such as Do Not Track, haven't had meaningful adoption. Knowledgeable users may turn to anti-tracking tools, but even these more advanced solutions fail to fully protect against the techniques we study.
In this dissertation, we introduce OpenWPM, a platform we developed for flexible and modular web measurement. We've used OpenWPM to run large-scale studies leading to the discovery of numerous privacy violations across the web and in emails. These discoveries have curtailed the adoption of tracking techniques, and have informed policy debates and browser privacy decisions.
In particular, we present novel detection methods and results for persistent tracking techniques, including: device fingerprinting, cookie syncing, and cookie respawning. Our findings include sophisticated fingerprinting techniques never before measured in the wild. We've found that nearly every new API is misused by trackers for fingerprinting. The misuse is often invisible to users and publishers alike, and in many cases was not anticipated by API designers. We take a critical look at how the API design process can be changed to prevent such misuse in the future.
We also explore the industry of trackers which use PII-derived identifiers to track users across devices, and even into the offline world. To measure these techniques, we develop a novel bait technique, which allows us to spoof the presence of PII on a large number of sites. We show how trackers exfiltrate the spoofed PII through the abuse of browser features. We find that PII collection is not limited to the web--the act of viewing an email also leaks PII to trackers. Overall, about 30% of emails leak the recipient's email address to one or more third parties.
Finally, we study the ability of a passive eavesdropper to leverage tracking cookies for mass surveillance. If two web pages embed the same tracker, then the adversary can link visits to those pages from the same user even if the user's IP address varies. We find that the adversary can reconstruct 62-73% of a typical user's browsing history
Recommended from our members
Online Tracking: A 1-million-site Measurement and Analysis
We present the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between different sites ("cookie syncing"). Our findings include multiple sophisticated fingerprinting techniques never before measured in the wild. This measurement is made possible by our open-source web privacy measurement tool, OpenWPM, which uses an automated version of a full-fledged consumer browser. It supports parallelism for speed and scale, automatic recovery from failures of the underlying browser, and comprehensive browser instrumentation. We demonstrate our platform's strength in enabling researchers to rapidly detect, quantify, and characterize emerging online tracking behaviors
Recommended from our members
Battery Status Not Included: Assessing Privacy in Web Standards
The standardization process is core to the development of the open web. Until 2013, the process rarely included privacy review and had no formal privacy requirements. But today the importance of privacy engineering has become apparent to standards bodies such as the W3C as well as to browser vendors. Standards groups now have guidelines for privacy assessments, and are including privacy reviews in many new specifications. However, the standards community does not yet have much practical experience in assessing privacy. In this paper we systematically analyze the W3C Battery Status API to help inform future privacy assessments. We begin by reviewing its evolution — the initial specification, which only cursorily addressed privacy, the discovery of surprising privacy vulnerabilities as well as actual misuse in the wild, followed by the removal of the API from major browser engines, an unprecedented move. Next, we analyze web measurement data from late 2016 and confirm that the majority of scripts used the API for fingerprinting. Finally, we draw lessons from this affair and make recommendations for improving privacy engineering of web standards
No boundaries: data exfiltration by third parties embedded on web pages
We investigate data exfiltration by third-party scripts directly embedded on web pages. Specifically, we study three attacks: misuse of browsers’ internal login managers, social data exfiltration, and whole-DOM exfiltration. Although the possibility of these attacks was well known, we provide the first empirical evidence based on measurements of 300,000 distinct web pages from 50,000 sites. We extend OpenWPM’s instrumentation to detect and precisely attribute these attacks to specific third-party scripts. Our analysis reveals invasive practices such as inserting invisible login forms to trigger autofilling of the saved user credentials, and reading and exfiltrating social network data when the user logs in via Facebook login. Further, we uncovered password, credit card, and health data leaks to third parties due to wholesale collection of the DOM. We discuss the lessons learned from the responses to the initial disclosure of our findings and fixes that were deployed by the websites, browser vendors, third-party libraries and privacy protection tools