    Detecting Term Relationships to Improve Textual Document Sanitization

    Nowadays, the publication of textual documents provides critical benefits to scientific research and business scenarios where information analysis plays an essential role. Nevertheless, the possible existence of identifying or confidential data in this kind of documents motivates the use of measures to sanitize sensitive information before being published, while keeping the innocuous data unmodified. Several automatic sanitization mechanisms can be found in the literature; however, most of them evaluate the sensitivity of the textual terms considering them as independent variables. At the same time, some authors have shown that there are important information disclosure risks inherent to the existence of relationships between sanitized and non-sanitized terms. Therefore, neglecting term relationships in document sanitization represents a serious privacy threat. In this paper, we present a general-purpose method to automatically detect semantically related terms that may enable disclosure of sensitive data. The foundations of Information Theory and a corpus as large as the Web are used to assess the degree relationship between textual terms according to the amount of information they provide from each other. Preliminary evaluation results show that our proposal significantly improves the detection recall of current sanitization schemes, which reduces the disclosure risk

    Utility-Preserving Anonymization of Textual Documents

    Cada dia els éssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i vídeos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informació per tal de millorar llurs serveis o bé per a propòsits comercials. Tanmateix, si les dades recollides contenen informació personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecció adequada dels subjectes de les dades. Els mecanismes de preservació de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials. S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs numèrics i categòrics; en canvi, la protecció automàtica de dades textuals no estructurades ha rebut molta menys atenció. En general, l'anonimització de dades textuals exigeix, primer, detectar trossos del text que poden revelar informació sensible i, després, emmascarar aquests trossos mitjançant supressió o generalització. En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les tècniques existents basades en etiquetatge de seqüències. Després, estenem aquestes tècniques per alinear-les millor amb el risc de revelació i amb les exigències de privadesa. Finalment, proposem un marc complet basat en models d'immersió de paraules que captura un concepte més ampli de protecció de dades i que forneix una protecció flexible guiada per les exigències de privadesa. També recorrem a les ontologies per preservar la utilitat del text emmascarat, és a dir, la seva semàntica i la seva llegibilitat. La nostra experimentació extensa i detallada mostra que els nostres mètodes superen els mètodes existents a l'hora de proporcionar anonimització robusta tot preservant raonablement la utilitat del text protegit.Cada día las personas añadimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y vídeos. Las organizaciones que recogen dichos datos los usan para extraer información para mejorar sus servicios o para propósitos comerciales. Sin embargo, si los datos recogidos contienen información personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protección adecuada de los sujetos de los datos. Los mecanismos de protección de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales. Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos numéricos y categóricos; en cambio, la protección automática de datos textuales no estructurados ha recibido mucha menos atención. En general, la anonimización de datos textuales requiere, primero, detectar trozos de texto que puedan revelar información sensible, para luego enmascarar dichos trozos mediante supresión o generalización. En este trabajo empleamos varias tecnologías para anonimizar documentos textuales. Primero mejoramos las técnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noción de riesgo de revelación y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersión de palabras que captura una noción más amplia de protección de datos y ofrece protección flexible guiada por los requisitos de privacidad. También recurrimos a las ontologías para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentación extensa y detallada muestra que nuestros métodos superan a los existentes a la hora de proporcionar una anonimización más robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization. In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcome

    Misusability Measure Based Sanitization of Big Data for Privacy Preserving MapReduce Programming

    Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy serving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming

    Exploiting contextual information in attacking set-generalized transactions

    Transactions are records that contain a set of items about individuals. For example, items browsed by a customer when shopping online form a transaction. Today, many activities are carried out on the Internet, resulting in a large amount of transaction data being collected. Such data are often shared and analyzed to improve business and services, but they also contain private information about individuals that must be protected. Techniques have been proposed to sanitize transaction data before their release, and set-based generalization is one such method. In this article, we study how well set-based generalization can protect transactions. We propose methods to attack set-generalized transactions by exploiting contextual information that is available within the released data. Our results show that set-based generalization may not provide adequate protection for transactions, and up to 70% of the items added into the transactions during generalization to obfuscate original data can be detected by our methods with a precision over 80%

    Analysis and classification of privacy-sensitive content in social media posts

    User-generated contents often contain private information, even when they are shared publicly on social media and on the web in general. Although many filtering and natural language approaches for automatically detecting obscenities or hate speech have been proposed, determining whether a shared post contains sensitive information is still an open issue. The problem has been addressed by assuming, for instance, that sensitive contents are published anonymously, on anonymous social media platforms or with more restrictive privacy settings, but these assumptions are far from being realistic, since the authors of posts often underestimate or overlook their actual exposure to privacy risks. Hence, in this paper, we address the problem of content sensitivity analysis directly, by presenting and characterizing a new annotated corpus with around ten thousand posts, each one annotated as sensitive or non-sensitive by a pool of experts. We characterize our data with respect to the closely-related problem of self-disclosure, pointing out the main differences between the two tasks. We also present the results of several deep neural network models that outperform previous naive attempts of classifying social media posts according to their sensitivity, and show that state-of-the-art approaches based on anonymity and lexical analysis do not work in realistic application scenarios

    Ontology-based Access Control in Open Scenarios: Applications to Social Networks and the Cloud

    La integració d'Internet a la societat actual ha fet possible compartir fàcilment grans quantitats d'informació electrònica i recursos informàtics (que inclouen maquinari, serveis informàtics, etc.) en entorns distribuïts oberts. Aquests entorns serveixen de plataforma comuna per a usuaris heterogenis (per exemple, empreses, individus, etc.) on es proporciona allotjament d'aplicacions i sistemes d'usuari personalitzades; i on s'ofereix un accés als recursos compartits des de qualsevol lloc i amb menys esforços administratius. El resultat és un entorn que permet a individus i empreses augmentar significativament la seva productivitat. Com ja s'ha dit, l'intercanvi de recursos en entorns oberts proporciona importants avantatges per als diferents usuaris, però, també augmenta significativament les amenaces a la seva privacitat. Les dades electròniques compartides poden ser explotades per tercers (per exemple, entitats conegudes com "Data Brokers"). Més concretament, aquestes organitzacions poden agregar la informació compartida i inferir certes característiques personals sensibles dels usuaris, la qual cosa pot afectar la seva privacitat. Una manera de del.liar aquest problema consisteix a controlar l'accés dels usuaris als recursos potencialment sensibles. En concret, la gestió de control d'accés regula l'accés als recursos compartits d'acord amb les credencials dels usuaris, el tipus de recurs i les preferències de privacitat dels propietaris dels recursos/dades. La gestió eficient de control d'accés és crucial en entorns grans i dinàmics. D'altra banda, per tal de proposar una solució viable i escalable, cal eliminar la gestió manual de regles i restriccions (en la qual, la majoria de les solucions disponibles depenen), atès que aquesta constitueix una pesada càrrega per a usuaris i administradors . Finalment, la gestió del control d'accés ha de ser intuïtiu per als usuaris finals, que en general no tenen grans coneixements tècnics.La integración de Internet en la sociedad actual ha hecho posible compartir fácilmente grandes cantidades de información electrónica y recursos informáticos (que incluyen hardware, servicios informáticos, etc.) en entornos distribuidos abiertos. Estos entornos sirven de plataforma común para usuarios heterogéneos (por ejemplo, empresas, individuos, etc.) donde se proporciona alojamiento de aplicaciones y sistemas de usuario personalizadas; y donde se ofrece un acceso ubicuo y con menos esfuerzos administrativos a los recursos compartidos. El resultado es un entorno que permite a individuos y empresas aumentar significativamente su productividad. Como ya se ha dicho, el intercambio de recursos en entornos abiertos proporciona importantes ventajas para los distintos usuarios, no obstante, también aumenta significativamente las amenazas a su privacidad. Los datos electrónicos compartidos pueden ser explotados por terceros (por ejemplo, entidades conocidas como “Data Brokers”). Más concretamente, estas organizaciones pueden agregar la información compartida e inferir ciertas características personales sensibles de los usuarios, lo cual puede afectar a su privacidad. Una manera de paliar este problema consiste en controlar el acceso de los usuarios a los recursos potencialmente sensibles. En concreto, la gestión de control de acceso regula el acceso a los recursos compartidos de acuerdo con las credenciales de los usuarios, el tipo de recurso y las preferencias de privacidad de los propietarios de los recursos/datos. La gestión eficiente de control de acceso es crucial en entornos grandes y dinámicos. Por otra parte, con el fin de proponer una solución viable y escalable, es necesario eliminar la gestión manual de reglas y restricciones (en la cual, la mayoría de las soluciones disponibles dependen), dado que ésta constituye una pesada carga para usuarios y administradores. Por último, la gestión del control de acceso debe ser intuitivo para los usuarios finales, que por lo general carecen de grandes conocimientos técnicos.Thanks to the advent of the Internet, it is now possible to easily share vast amounts of electronic information and computer resources (which include hardware, computer services, etc.) in open distributed environments. These environments serve as a common platform for heterogeneous users (e.g., corporate, individuals etc.) by hosting customized user applications and systems, providing ubiquitous access to the shared resources and requiring less administrative efforts; as a result, they enable users and companies to increase their productivity. Unfortunately, sharing of resources in open environments has significantly increased the privacy threats to the users. Indeed, shared electronic data may be exploited by third parties, such as Data Brokers, which may aggregate, infer and redistribute (sensitive) personal features, thus potentially impairing the privacy of the individuals. A way to palliate this problem consists on controlling the access of users over the potentially sensitive resources. Specifically, access control management regulates the access to the shared resources according to the credentials of the users, the type of resource and the privacy preferences of the resource/data owners. The efficient management of access control is crucial in large and dynamic environments such as the ones described above. Moreover, in order to propose a feasible and scalable solution, we need to get rid of manual management of rules/constraints (in which most available solutions rely) that constitutes a serious burden for the users and the administrators. Finally, access control management should be intuitive for the end users, who usually lack technical expertise, and they may find access control mechanism more difficult to understand and rigid to apply due to its complex configuration settings

    De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks

    Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. The details included in these documents make it possible to get to know the patient better, to better manage him or her, to better study the pathologies, to accurately remunerate the associated medical acts\ldots All this seems to be (at least partially) within reach of today by artificial intelligence techniques. However, for obvious reasons of privacy protection, the designers of these AIs do not have the legal right to access these documents as long as they contain identifying data. De-identifying these documents, i.e. detecting and deleting all identifying information present in them, is a legally necessary step for sharing this data between two complementary worlds. Over the last decade, several proposals have been made to de-identify documents, mainly in English. While the detection scores are often high, the substitution methods are often not very robust to attack. In French, very few methods are based on arbitrary detection and/or substitution rules. In this paper, we propose a new comprehensive de-identification method dedicated to French-language medical documents. Both the approach for the detection of identifying elements (based on deep learning) and their substitution (based on differential privacy) are based on the most proven existing approaches. The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. The whole approach has been evaluated on a French language medical dataset of a French public hospital and the results are very encouraging