808 research outputs found
Evaluation of Anonymized ONS Queries
Electronic Product Code (EPC) is the basis of a pervasive infrastructure for
the automatic identification of objects on supply chain applications (e.g.,
pharmaceutical or military applications). This infrastructure relies on the use
of the (1) Radio Frequency Identification (RFID) technology to tag objects in
motion and (2) distributed services providing information about objects via the
Internet. A lookup service, called the Object Name Service (ONS) and based on
the use of the Domain Name System (DNS), can be publicly accessed by EPC
applications looking for information associated with tagged objects. Privacy
issues may affect corporate infrastructures based on EPC technologies if their
lookup service is not properly protected. A possible solution to mitigate these
issues is the use of online anonymity. We present an evaluation experiment that
compares the of use of Tor (The second generation Onion Router) on a global
ONS/DNS setup, with respect to benefits, limitations, and latency.Comment: 14 page
Privacy in trajectory micro-data publishing : a survey
We survey the literature on the privacy of trajectory micro-data, i.e.,
spatiotemporal information about the mobility of individuals, whose collection
is becoming increasingly simple and frequent thanks to emerging information and
communication technologies. The focus of our review is on privacy-preserving
data publishing (PPDP), i.e., the publication of databases of trajectory
micro-data that preserve the privacy of the monitored individuals. We classify
and present the literature of attacks against trajectory micro-data, as well as
solutions proposed to date for protecting databases from such attacks. This
paper serves as an introductory reading on a critical subject in an era of
growing awareness about privacy risks connected to digital services, and
provides insights into open problems and future directions for research.Comment: Accepted for publication at Transactions for Data Privac
Vector representation of Internet domain names using Word embedding techniques
Word embeddings is a well-known set of techniques widely used in
natural language processing ( NLP ). This thesis explores the use of word
embeddings in a new scenario. A vector space model ( VSM) for Internet
domain names ( DNS) is created by taking core ideas from NLP techniques
and applying them to real anonymized DNS log queries from a large
Internet Service Provider ( ISP) . The main goal is to find semantically
similar domains only using information of DNS queries without any other
knowledge about the content of those domains.
A set of transformations through a detailed preprocessing pipeline
with eight specific steps is defined to move the original problem to a
problem in the NLP field. Once the preprocessing pipeline is applied and
the DNS log files are transformed to a standard text corpus, we show that
state-of-the-art techniques for word embeddings can be successfully
applied in order to build what we called a DNS-VSM (a vector space model
for Internet domain names).
Different word embeddings techniques are evaluated in this work:
Word2Vec (with Skip-Gram and CBOW architectures), App2Vec (with a
CBOW architecture and adding time gaps between DNS queries), and
FastText (which includes sub-word information).
The obtained results are compared using various metrics from Information
Retrieval theory and the quality of the learned vectors is validated with a
third party source, namely, similar sites service offered by Alexa Internet,
Inc2 .
Due to intrinsic characteristics of domain names, we found that FastText is
the best option for building a vector space model for DNS. Furthermore, its
performance (considering the top 3 most similar learned vectors to each
domain) is compared against two baseline methods: Random Guessing
(returning randomly any domain name from the dataset) and Zero Rule
(returning always the same most popular domains), outperforming both of
them considerably.
The results presented in this work can be useful in many
engineering activities, with practical application in many areas. Some
examples include websites recommendations based on similar sites,
competitive analysis, identification of fraudulent or risky sites,
parental-control systems, UX improvements (based on recommendations,
spell correction, etc.), click-stream analysis, representation and clustering
of users navigation profiles, optimization of cache systems in recursive
DNS resolvers (among others).
Finally, as a contribution to the research community a set of vectors
of the DNS-VSM trained on a similar dataset to the one used in this thesis
is released and made available for download through the github page in
[1]. With this we hope that further work and research can be done using
these vectors.La vectorización de palabras es un conjunto de técnicas bien
conocidas y ampliamente usadas en el procesamiento del lenguaje natural
( PLN ). Esta tesis explora el uso de vectorización de palabras en un nuevo
escenario. Un modelo de espacio vectorial ( VSM) para nombres de
dominios de Internet ( DNS ) es creado tomando ideas fundamentales de
PLN, l as cuales son aplicadas a consultas reales anonimizadas de logs de
DNS de un gran proveedor de servicios de Internet ( ISP) . El objetivo
principal es encontrar dominios relacionados semánticamente solamente
usando información de consultas DNS sin ningún otro conocimiento sobre
el contenido de esos dominios.
Un conjunto de transformaciones a través de un detallado pipeline
de preprocesamiento con ocho pasos especÃficos es definido para llevar el
problema original a un problema en el campo de PLN. Una vez aplicado el
pipeline de preprocesamiento y los logs de DNS son transformados a un
corpus de texto estándar, se muestra que es posible utilizar con éxito
técnicas del estado del arte respecto a vectorización de palabras para
construir lo que denominamos un DNS-VSM (un modelo de espacio
vectorial para nombres de dominio de Internet).
Diferentes técnicas de vectorización de palabras son evaluadas en
este trabajo: Word2Vec (con arquitectura Skip-Gram y CBOW) , App2Vec
(con arquitectura CBOW y agregando intervalos de tiempo entre consultas
DNS ), y FastText (incluyendo información a nivel de sub-palabra).
Los resultados obtenidos se comparan usando varias métricas de la teorÃa
de Recuperación de Información y la calidad de los vectores aprendidos
es validada por una fuente externa, un servicio para obtener sitios
similares ofrecido por Alexa Internet, Inc .
Debido a caracterÃsticas intrÃnsecas de los nombres de dominio,
encontramos que FastText es la mejor opción para construir un modelo de
espacio vectorial para DNS . Además, su performance es comparada
contra dos métodos de lÃnea base: Random Guessing (devolviendo
cualquier nombre de dominio del dataset de forma aleatoria) y Zero Rule
(devolviendo siempre los mismos dominios más populares), superando a
ambos de manera considerable.
Los resultados presentados en este trabajo pueden ser útiles en
muchas actividades de ingenierÃa, con aplicación práctica en muchas
áreas. Algunos ejemplos incluyen recomendaciones de sitios web, análisis
competitivo, identificación de sitios riesgosos o fraudulentos, sistemas de
control parental, mejoras de UX (basada en recomendaciones, corrección
ortográfica, etc.), análisis de flujo de clics, representación y clustering de
perfiles de navegación de usuarios, optimización de sistemas de cache en
resolutores de DNS recursivos (entre otros).
Por último, como contribución a la comunidad académica, un
conjunto de vectores del DNS-VSM entrenado sobre un juego de datos
similar al utilizado en esta tesis es liberado y hecho disponible para
descarga a través de la página github en [1]. Con esto esperamos a que
más trabajos e investigaciones puedan realizarse usando estos vectores
FAIR sharing of health data: a systematic review of applicable solutions
PurposeData sharing is essential in health science research. This has also been acknowledged by governments and institutions who have set-up a number of regulations, laws, and initiatives to facilitate it. A large number of initiatives has been trying to address data sharing issues. With the development of the FAIR principles, a set of detailed criteria for evaluating the relevance of such solutions is now available. This article intends to help researchers to choose a suitable solution for sharing their health data in a FAIR way.MethodsWe conducted a systematic literature review of data sharing platforms adapted to health science research. We selected these platforms through a query on Scopus, PubMed, and Web of Science and filtered them based on specific exclusion criteria. We assessed their relevance by evaluating their: implementation of the FAIR principles, ease of use by researchers, ease of implementation by institutions, and suitability for handling Individual Participant Data (IPD).ResultsWe categorized the 35 identified solutions as being either online or on-premises software platforms. Interoperability was the main obstacle for the solutions regarding the fulfilment of the FAIR principles. Additionally, we identified which solutions address sharing of IPD and anonymization issues. Vivli and Dataverse were identified as the two most all-round solutions for sharing health science data in a FAIR way.ConclusionsAlthough no solution is perfectly adapted to share all type of health data, there are work-arounds and interesting solutions to make health research data FAIR
A systematic overview on methods to protect sensitive data provided for various analyses
In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters
Evaluation of Semantic Web Technologies for Storing Computable Definitions of Electronic Health Records Phenotyping Algorithms
Electronic Health Records are electronic data generated during or as a
byproduct of routine patient care. Structured, semi-structured and unstructured
EHR offer researchers unprecedented phenotypic breadth and depth and have the
potential to accelerate the development of precision medicine approaches at
scale. A main EHR use-case is defining phenotyping algorithms that identify
disease status, onset and severity. Phenotyping algorithms utilize diagnoses,
prescriptions, laboratory tests, symptoms and other elements in order to
identify patients with or without a specific trait. No common standardized,
structured, computable format exists for storing phenotyping algorithms. The
majority of algorithms are stored as human-readable descriptive text documents
making their translation to code challenging due to their inherent complexity
and hinders their sharing and re-use across the community. In this paper, we
evaluate the two key Semantic Web Technologies, the Web Ontology Language and
the Resource Description Framework, for enabling computable representations of
EHR-driven phenotyping algorithms.Comment: Accepted American Medical Informatics Association Annual Symposium
201
Privacy in trajectory micro-data publishing: a survey
International audienceWe survey the literature on the privacy of trajectory micro-data, i.e., spatiotemporal information about the mobility of individuals, whose collection is becoming increasingly simple and frequent thanks to emerging information and communication technologies. The focus of our review is on privacy-preserving data publishing (PPDP), i.e., the publication of databases of trajectory micro-data that preserve the privacy of the monitored individuals. We classify and present the literature of attacks against trajectory micro-data, as well as solutions proposed to date for protecting databases from such attacks. This paper serves as an introductory reading on a critical subject in an era of growing awareness about privacy risks connected to digital services, and provides insights into open problems and future directions for research
Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs
To design effective vaccine policies, policymakers need detailed data about
who has been vaccinated, who is holding out, and why. However, existing data in
the US are insufficient: reported vaccination rates are often delayed or
missing, and surveys of vaccine hesitancy are limited by high-level questions
and self-report biases. Here, we show how large-scale search engine logs and
machine learning can be leveraged to fill these gaps and provide novel insights
about vaccine intentions and behaviors. First, we develop a vaccine intent
classifier that can accurately detect when a user is seeking the COVID-19
vaccine on search. Our classifier demonstrates strong agreement with CDC
vaccination rates, with correlations above 0.86, and estimates vaccine intent
rates to the level of ZIP codes in real time, allowing us to pinpoint more
granular trends in vaccine seeking across regions, demographics, and time. To
investigate vaccine hesitancy, we use our classifier to identify two groups,
vaccine early adopters and vaccine holdouts. We find that holdouts, compared to
early adopters matched on covariates, are 69% more likely to click on untrusted
news sites. Furthermore, we organize 25,000 vaccine-related URLs into a
hierarchical ontology of vaccine concerns, and we find that holdouts are far
more concerned about vaccine requirements, vaccine development and approval,
and vaccine myths, and even within holdouts, concerns vary significantly across
demographic groups. Finally, we explore the temporal dynamics of vaccine
concerns and vaccine seeking, and find that key indicators emerge when
individuals convert from holding out to preparing to accept the vaccine
- …