12 research outputs found
TrackSign-labeled web tracking dataset
Recent studies show that more than 95% of the websites available on the Internet contain at least one of the so-called web tracking systems. These systems are specialized in identifying their users by means of a plethora of different methods. Some of them (e.g., cookies) are very well known by most Internet users. However, the percentage of websites including more "obscure" and privacy-threatening systems, such as fingerprinting methods identifying a user's computer, is constantly increasing. Detecting those methods on today's Internet is very difficult, as almost any website modifies its content dynamically and minimizes its code in order to speed up loading times. This minimization and dynamicity render the website code unreadable by humans. Thus, the research community is constantly looking for new ways to discover unknown web tracking systems running under the hood.
In this paper, we present a new dataset containing tracking information for more than 76 million URLs and 45 million online resources, extracted from 1.5 million popular websites. The tracking labeling process was done using a state-of-the-art discovery web tracking algorithm called TrackSign. The dataset also contains information about online security and the relation between the domains, the loaded URLs, and the online resource behind each URL. This information can be useful for different kinds of experiments, such as locating privacy-threatening resources, identifying security threats, or determining characteristics of the URL network graph.This publication is part of the Spanish I+D+i project TRAINER-A (ref.∼PID2020-118011GB-C21), funded by MCIN/ AEI/10.13039/501100011033. This work is also supported by the Catalan Institution for Research and Advanced Studies (ICREA Academia).Peer ReviewedPostprint (published version
ASTrack: Automatic detection and removal of web tracking code with minimal functionality loss
Recent advances in web technologies make it more difficult than ever to detect and block web tracking systems. In this work, we propose ASTrack, a novel approach to web tracking detection and removal. ASTrack uses an abstraction of the code structure based on Abstract Syntax Trees to selectively identify web tracking functionality shared across multiple web services. This new methodology allows us to: (i) effectively detect web tracking code even when using evasion techniques (e.g., obfuscation, minification, or webpackaging); and (ii) safely
remove those portions of code related to tracking purposes without affecting the legitimate functionality of the website. Our evaluation with the top 10k most popular Internet domains shows that ASTrack can detect web tracking with high precision (98%), while discovering about 50k tracking code pieces and more than 3,400 new tracking URLs not previously recognized by most popular privacy-preserving tools (e.g., uBlock Origin). Moreover, ASTrack achieved a 36% reduction in functionality loss in comparison with the filter lists, one of the safest options available. Using a novel methodology that combines computer vision and manual inspection, we estimate that full functionality is preserved in more than 97% of the websites.This publication is part of the Spanish I+D+i project TRAINER-A (ref. PID2020-118011GB-C21), funded by MCIN/ AEI/10.13039/501100011033. This work is also par-tially supported by the NII internship program.Peer ReviewedPostprint (author's final draft
Detecting and analyzing mouse tracking in the wild
Nowadays, most websites collect personal information about their users in order to identify them and personalize their services. Among the tools used to that end, fingerprinting is one of the most advanced and precise methods, given the huge amount of features they can collect and combine to build a robust identifier of the user. Although many fingerprinting techniques have recently been studied in the literature, the use and prevalence of mouse tracking, a method that collects information about the computer pointer, is still unexplored in detail. In this work, we propose a new methodology to detect this tracking method and measure its actual usage on the top 80,000 most popular websites. Our results show that about 1.2% of the analyzed websites use some sort of mouse tracking, including some popular websites within the top-1k ranking.This publication is part of the Spanish I+D+i project TRAINER-A (ref. PID2020-118011GB-C21), funded by MCIN/ AEI/10.13039/501100011033. This work is also supported by the Catalan Institution for Research and Advanced Studies (ICREA Academia).Peer ReviewedPostprint (author's final draft
Amazon Alexa traffic traces
The number of devices that make up the Internet of Things (IoT) has been increasing every year, including smart speakers such as Amazon Echo devices. These devices have become very popular around the world where users with a smart speaker are estimated to be about 83 million in 2020. However, there has also been great concern about how they can affect the privacy and security of their users [1]. Responding to voice commands requires devices to continuously listen for the corresponding wake word, with the privacy implications that this entails. Additionally, the interactions that users may have with the virtual assistant can reveal private information about the user. In this document we publicly share two datasets that can help conduct privacy and security studies from the Amazon Echo Dot smart speaker. The included data contains 300.000 raw PCAP traces containing all the communications between the device and Amazon servers from 100 different voice commands on two different languages. The data can be used to train machine learning algorithms in order to find patterns that can characterize both, the voice commands and people using the device as well as Alexa as the device generating the traffic.Peer ReviewedPostprint (published version
ePrivo.eu: An online service for automatic web tracking discovery
Given the pervasiveness of web tracking practices on the Internet, many countries are developing and enforcing new privacy regulations to ensure the rights of their citizens. However, discovering websites that do not comply with those regulations is becoming very challenging, given the dynamic nature of the web or the use of obfuscation techniques. This work presents ePrivo, a new online service that can help Internet users, website owners, and regulators inspect how privacy-friendly a given website is. The system explores all the content of the website, including traffic from third parties and dynamically modified content. The ePrivo service combines different state-of-the-art tracking detection and classification methods, including TrackSign, to discover both previously known and zero-day tracking methods. After 6 months of service, ePrivo detected the largest browsing history trackers and more than 40k domains including cookies with a lifespan longer than one year, which is forbidden in some countries.This work was supported in part by the Spanish I+D+i Project TRAINER-A, funded by MCIN/AEI/10.13039/501100011033, under Grant PID2020-118011GB-C21; and in part by the Catalan Institution for Research and Advanced Studies (ICREA Academia).Peer ReviewedPostprint (published version
Demystifying content-blockers: Measuring their impact on performance and quality of experience
With the evolution of the online advertisement and tracking ecosystem, content-blockers have become the reference tool for improving the security, privacy and browsing experience when surfing the Internet. It is also commonly believed that using content-blockers to stop unsolicited content decreases the time needed for loading websites. In this work, we perform a large-scale study on the actual improvements of using content-blockers in terms of performance and quality of experience. For measuring it, we analyze the page size and loading times of the 100K most popular websites, as well as the most relevant QoE metrics, such as the Speed Index, Time to Interactive or the Cumulative Layout Shift, for the subset of the top 10K of them. Our experiments show that using content-blockers results in small improvements in terms of performance. However, contrary to popular belief, this has a negligible impact in terms of loading time and quality of experience. Moreover, in the case of small and lightweight websites, the overhead introduced by content-blockers can even result in decreased performance. Finally, we evaluate the improvement in terms of QoE based on the Mean Opinion Score (MOS) and find that two of the three studied content-blockers present an overall decrease between 3% and 5% instead of the expected improvement.This publication is part of the Spanish I+D+i project TRAINER-A (ref. PID2020-118011GB-C21), funded by MCIN/ AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft
A novel approach to web tracking detection and removal with minimal functionality loss
Tesi en modalitat de compendi de publicacionsIn reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Universitat Politècnica de Catalunya's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Tesi amb menció de Doctorat Internacional(English) echnologies are extensively used to collect huge amounts of personal information from our online activity, including the things we search for, the sites we visit, the people we contact, or the products we buy. Although it is commonly believed that such big data sets are mainly used for targeted advertising, some recent works have revealed that they are actually exploited for many other purposes, including price discrimination, financial assessment, determination of insurance coverage, background scanning, and even identity theft. Contrary to popular belief, such information is not only collected by big Internet players, such as Google and Facebook, but also by more shady and unknown companies called data brokers. Data brokers are pervasive on the current Internet, and their only purpose is to silently collect and aggregate large amounts of personal information. This information is then used to build individual profiles (often of low quality) about us, which are sold to the highest bidder without our explicit knowledge or the option to revise their correctness. The main objective of this thesis is to research new countermeasures that can decrease or completely block web tracking systems running in the background. For this purpose, the thesis has three objectives: (i) develop a new measurement system that can collect information in a bigdata environment such as the Internet; (ii) research new methodologies to automatically detect unknown web tracking technologies while minimizing website functionality loss; and (iii) apply the research results obtained in order to create actual tools that can be useful for the different actors concerned about privacy. This manuscript presents a compendium of publications that address all the objectives presented.
In order to handle the first objective, we developed a new framework called Online Resource Mapper, which is able to collect complete online data sets, including millions of websites with al their internal URLs and online resources. In addition, we also study the impact in performance and quality of experience of content blockers, the most popular current privacy-protection
technology. The second objective is tackled by means of three different publications presenting new ways to discover unknown web tracking systems. The first work presents an alternative to content blockers that potentially fixes their biggest vulnerability: the lack of adaptation to detect new web tracking URLs not present in the pattern lists. Our solution uses a deep neural network to discover with a 97% accuracy patterns that can be used to detect new tracking URLs. Going one step further, instead of looking at the URL, we decide to inspect the actual code of the website to discover not only tracking based on URL similarity, but completely new web tracking methods. Our first proposal, called TrackSign, uses the combination of a heuristic code partition model with a novel three-layer network graph in order to discover new web tracking systems. Our method achieves 92% detection accuracy, and it is one of the first approaches to do it in an automatic and generic fashion. Our last publication on the topic presents an evolution of TrackSign, called ASTrack, that addresses its main vulnerability, the false negatives obtained when websites obfuscate their internal resources with code renaming techniques. ASTrack uses the structure of the code instead of the code itself to identify web tracking systems shared by multiple websites. Moreover, ASTrack can exclusively remove the web tracking code while maintaining the rest of it intact, which minimizes functionality loss problems as a result of blocking complete resources. Lastly, in order to address the third objective, we shared publicly the results obtained during our experiments first in the form of a web tracking data set including about 75 million URLs and 45 million labeled online resources, and secondly in a new ePrivacy observatory called ePrivo.(Español) Las tecnologías de rastreo web se utilizan ampliamente para recopilar enormes cantidades de información personal de nuestra actividad en línea, incluidas las cosas que buscamos, los sitios que visitamos, las personas con las que contactamos o los productos que compramos. Aunque comúnmente se cree que estos datos se utilizan para la publicidad dirigida, algunos trabajos recientes han revelado que en realidad se explotan para muchos otros fines, como la discriminación de precios, la evaluación financiera, la determinación de la cobertura de seguros, la exploración de antecedentes e incluso el robo de identidad. Contrariamente a la creencia popular, esta información no sólo es recopilada por los grandes actores de Internet, como Google y Facebook, sino también por empresas más turbias y desconocidas llamadas corredores de datos. Los corredores de datos están omnipresentes en la Internet actual, y su único propósito es recopilar silenciosamente grandes cantidades de información personal. Esta información se utiliza después para construir perfiles individuales (a menudo de baja calidad) sobre nosotros, que se venden al mejor postor sin nuestro conocimiento explícito ni la opción de revisar su exactitud.
El principal objetivo de esta tesis es investigar nuevas contramedidas que puedan disminuir o bloquear por completo los sistemas de rastreo web que se ejecutan en segundo plano. Para ello, la tesis tiene tres objetivos: (i) desarrollar un nuevo sistema de medición que pueda recoger información en un entorno de bigdata como es Internet; (ii) investigar nuevas metodologías para detectar automáticamente tecnologías de rastreo web desconocidas minimizando la pérdida de funcionalidad del sitio web; y (iii) aplicar los resultados de investigación obtenidos para crear herramientas reales que puedan ser útiles. Este manuscrito presenta un compendio de publicaciones que abordan todos los objetivos planteados.
Para tratar el primer objetivo, desarrollamos Online Resource Mapper, una herramienta que es capaz de recopilar todo el contenido de millones de sitios web. El segundo objetivo se aborda mediante tres publicaciones diferentes que presentan nuevas formas de descubrir sistemas de seguimiento web desconocidos. El primer trabajo presenta una alternativa a los bloqueadores de contenido que soluciona potencialmente su mayor vulnerabilidad: la falta de adaptación para detectar URL de rastreo no presentes en las listas de patrones. Nuestra solución utiliza una red neuronal profunda para descubrir con una precisión del 97% patrones que pueden utilizarse para detectar nuevas URL de rastreo. Yendo un paso más allá decidimos inspeccionar el código real del sitio web para descubrir métodos de seguimiento web completamente nuevos. Nuestra primera propuesta, denominada TrackSign, utiliza la combinación de un modelo heurístico de partición de código con un novedoso grafo de red de tres capas para descubrir nuevos sistemas de rastreo web. Nuestro método alcanza una precisión de detección del 92%, y es uno de los primeros enfoques en hacerlo de forma automática y genérica. Nuestra última publicación sobre el tema presenta una evolución de TrackSign, denominada ASTrack, que aborda su principal vulnerabilidad, la ofuscación de recursos internos con técnicas de renombrado de código. ASTrack utiliza la estructura del código en lugar del propio código para identificar los sistemas de seguimiento web compartidos por varios sitios web. Además, ASTrack puede eliminar exclusivamente el código de rastreo manteniendo intacto el resto, lo que minimiza los problemas de pérdida de funcionalidad como consecuencia del bloqueo de recursos completos.Por último, para abordar el tercer objetivo, compartimos públicamente los resultados obtenidos durante nuestros experimentos primero en forma de un conjunto de datos con 75M de URL y 45M de recursos etiquetados, y en segundo lugar en un nuevo observatorio de ePrivacidad denominado ePrivo.DOCTORAT EN ARQUITECTURA DE COMPUTADORS (Pla 2012
TrackSign: guided Web tracking discovery
Current web tracking practices pose a constant threat to the privacy of Internet users. As a result, the research community has recently proposed different tools to combat well-known tracking methods. However, the early detection of new, previously unseen tracking systems is still an open research problem. In this paper, we present TrackSign, a novel approach to discover new web tracking methods. The main idea behind TrackSign is the use of code fingerprinting to identify common pieces of code shared across multiple domains. To detect tracking fingerprints, TrackSign builds a novel 3-mode network graph that captures the relationship between fingerprints, resources and domains. We evaluated TrackSign with the top-100K most popular Internet domains, including almost 1M web resources from more than 5M HTTP requests. Our results show that our method can detect new web tracking resources with high precision (over 92%). TrackSign was able to detect 30K new trackers, more than 10K new tracking resources and 270K new tracking URLs, not yet detected by most popular blacklists. Finally, we also validate the effectiveness of TrackSign with more than 20 years of historical data from the Internet Archive.This work was supported by the Spanish MINECO under contract TEC2017-90034-C2-1-R (ALLIANCE).Peer ReviewedPostprint (author's final draft
Demystifying content-blockers: A large-scale study of actual performance gains
With the evolution of the online advertisement and tracking ecosystem, content-filtering has become the reference tool for improving the security, privacy and browsing experience when surfing the Internet. It is also commonly believed that using content-blockers to stop unsolicited content decreases the time needed for loading websites. In this work, we perform a large-scale study with the 100K most popular websites on the actual performance improvements of using content-blockers. We focus our study on two relevant metrics for measuring the browsing performance; page size and loading time. Our results show that using such tools results in small improvements in terms of page size but, contrary to popular belief, it has a negligible impact in terms of loading time. We also find that, in the case of small and lightweight websites, the use of content-blockers can even result in increased loading times.This work was supported by the Spanish MINECO under contract TEC2017-90034-C2-1-R (ALLIANCE).Peer ReviewedPostprint (author's final draft
Network measurements for web tracking analysis and detection: A tutorial
Digital society has developed to a point where it is nearly impossible for a user to know what it is happening in the background when using the Internet. To understand it, it is necessary to perform network measurements not only at the network layer (e.g., IP, ICMP), but also at the application layer (e.g., HTTP). For example, opening a single website can trigger a cascade of requests to different servers and services to obtain the resources embedded inside it. This process is becoming so complex that, to explore only one website, the number of communications can explode easily from tens to hundreds depending on the website. Inside those communications, there is an ever-increasing portion dedicated to web tracking, a controversial practice from the security and privacy perspective [1].This work was supported by the Spanish MINECO under contract TEC2017-90034-C2-1-R (ALLIANCE).Peer ReviewedPostprint (author's final draft