96 research outputs found

    Fuzz-classification (p, l)-Angel: An enhanced hybrid artificial intelligence based fuzzy logic for multiple sensitive attributes against privacy breaches

    Get PDF
    The inability of traditional privacy-preserving models to protect multiple datasets based on sensitive attributes has prompted researchers to propose models such as SLOMS, SLAMSA, (p, k)-Angelization, and (p, l)-Angelization, but these were found to be insufficient in terms of robust privacy and performance. (p, l)-Angelization was successful against different privacy disclosures, but it was not efficient. To the best of our knowledge, no robust privacy model based on fuzzy logic has been proposed to protect the privacy of sensitive attributes with multiple records. In this paper, we suggest an improved version of (p, l)-Angelization based on a hybrid AI approach and privacy-preserving approach like Generalization. Fuzz-classification (p, l)-Angel uses artificial intelligence based fuzzy logic for classification, a high-dimensional segmentation technique for segmenting quasi-identifiers and multiple sensitive attributes. We demonstrate the feasibility of the proposed solution by modelling and analyzing privacy violations using High-Level Petri Nets. The results of the experiment demonstrate that the proposed approach produces better results in terms of efficiency and utility

    Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

    Full text link
    Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation

    Towards privacy preserving cooperative cloud based intrusion detection systems

    Full text link
    Les systèmes infonuagiques deviennent de plus en plus complexes, dynamiques et vulnérables aux attaques. Par conséquent, il est de plus en plus difficile pour qu'un seul système de détection d'intrusion (IDS) basé sur le cloud puisse repérer toutes les menaces, en raison des lacunes de connaissances sur les attaques et leurs conséquences. Les études récentes dans le domaine de la cybersécurité ont démontré qu'une coopération entre les IDS d'un nuage pouvait apporter une plus grande efficacité de détection dans des systèmes informatiques aussi complexes. Grâce à cette coopération, les IDS d'un nuage peuvent se connecter et partager leurs connaissances afin d'améliorer l'exactitude de la détection et obtenir des bénéfices communs. L'anonymat des données échangées par les IDS constitue un élément crucial de l'IDS coopérative. Un IDS malveillant pourrait obtenir des informations confidentielles d'autres IDS en faisant des conclusions à partir des données observées. Pour résoudre ce problème, nous proposons un nouveau système de protection de la vie privée pour les IDS en nuage. Plus particulièrement, nous concevons un système uniforme qui intègre des techniques de protection de la vie privée dans des IDS basés sur l'apprentissage automatique pour obtenir des IDS qui respectent les informations personnelles. Ainsi, l'IDS permet de cacher des informations possédant des données confidentielles et sensibles dans les données partagées tout en améliorant ou en conservant la précision de la détection. Nous avons mis en œuvre un système basé sur plusieurs techniques d'apprentissage automatique et de protection de la vie privée. Les résultats indiquent que les IDS qui ont été étudiés peuvent détecter les intrusions sans utiliser nécessairement les données initiales. Les résultats (c'est-à-dire qu'aucune diminution significative de la précision n'a été enregistrée) peuvent être obtenus en se servant des nouvelles données générées, analogues aux données de départ sur le plan sémantique, mais pas sur le plan synthétique.Cloud systems are becoming more sophisticated, dynamic, and vulnerable to attacks. Therefore, it's becoming increasingly difficult for a single cloud-based Intrusion Detection System (IDS) to detect all attacks, because of limited and incomplete knowledge about attacks and their implications. The recent works on cybersecurity have shown that a co-operation among cloud-based IDSs can bring higher detection accuracy in such complex computer systems. Through collaboration, cloud-based IDSs can consult and share knowledge with other IDSs to enhance detection accuracy and achieve mutual benefits. One fundamental barrier within cooperative IDS is the anonymity of the data the IDS exchanges. Malicious IDS can obtain sensitive information from other IDSs by inferring from the observed data. To address this problem, we propose a new framework for achieving a privacy-preserving cooperative cloud-based IDS. Specifically, we design a unified framework that integrates privacy-preserving techniques into machine learning-based IDSs to obtain privacy-aware cooperative IDS. Therefore, this allows IDS to hide private and sensitive information in the shared data while improving or maintaining detection accuracy. The proposed framework has been implemented by considering several machine learning and privacy-preserving techniques. The results suggest that the consulted IDSs can detect intrusions without the need to use the original data. The results (i.e., no records of significant degradation in accuracy) can be achieved using the newly generated data, similar to the original data semantically but not synthetically

    Incremental k-Anonymous microaggregation in large-scale electronic surveys with optimized scheduling

    Get PDF
    Improvements in technology have led to enormous volumes of detailed personal information made available for any number of statistical studies. This has stimulated the need for anonymization techniques striving to attain a difficult compromise between the usefulness of the data and the protection of our privacy. k-Anonymous microaggregation permits releasing a dataset where each person remains indistinguishable from other k–1 individuals, through the aggregation of demographic attributes, otherwise a potential culprit for respondent reidentification. Although privacy guarantees are by no means absolute, the elegant simplicity of the k-anonymity criterion and the excellent preservation of information utility of microaggregation algorithms has turned them into widely popular approaches whenever data utility is critical. Unfortunately, high-utility algorithms on large datasets inherently require extensive computation. This work addresses the need of running k-anonymous microaggregation efficiently with mild distortion loss, exploiting the fact that the data may arrive over an extended period of time. Specifically, we propose to split the original dataset into two portions that will be processed subsequently, allowing the first process to start before the entire dataset is received, while leveraging the superlinearity of the microaggregation algorithms involved. A detailed mathematical formulation enables us to calculate the optimal time for the fastest anonymization, as well as for minimum distortion under a given deadline. Two incremental microaggregation algorithms are devised, for which extensive experimentation is reported. The theoretical methodology presented should prove invaluable in numerous data-collection applications, including largescale electronic surveys in which computation is possible as the data comes in.Peer ReviewedPostprint (published version

    Structure preserving estimators to update socio-economic indicators in small areas

    Get PDF
    Official statistics are intended to support decision makers by providing reliable information on different population groups, identifying what their needs are and where they are located. This allows, for example, to better guide public policies and focus resources on the population most in need. Statistical information must have some characteristics to be useful for this purpose. This data must be reliable, up-to-date and also disaggregated at different domain levels, e.g., geographically or by sociodemographic groups (Eurostat, 2017). Statistical data producers (e.g., national statistical offices) face great challenges in delivering statistics with these three characteristics, mainly due to lack of resources. Population censuses collect data on demographic, economic and social aspects of all persons in a country which makes information at all domains of interest available. They quickly become outdated since they are carried out only every 10 years, especially in developing countries. Furthermore, administrative data sources in many countries have not enough quality to produce statistics that are reliable and comparable with other relevant sources. On the contrary, national surveys are conducted more frequently than censuses and offer the possibility of studying more complex topics. Due to their sample sizes, direct estimates are only published based on domains where the estimates reach a specific level of precision. These domains are called planned domains or large areas in this thesis, and the domains in which direct estimates cannot be produced due to lack of sample size or low precision will be called small areas or domains. Small area estimation (SAE) methods have been proposed as a solution to produce reliable estimates in small domains. These methods allow improving the precision of direct estimates, as well as providing reliable information in domains where the sample size is zero or where direct estimates cannot be obtained by combining data from censuses and surveys (Rao and Molina, 2015). Thereby, the variables obtained from both data sources are assumed to be highly correlated but the census actually may be outdated. In these cases, structure preservation estimation (SPREE) methods offer a solution when the target indicator is a categorical variable, with at least two categories (for example, the labor market status of an individual can be categorised as: ‘employed’, ‘unemployed’, and ‘out of labor force’). The population counts are arranged in contingency tables: by rows (domains of interest) and columns (the categories of the variable of interest) (Purcell and Kish, 1980). These types of estimators are studied in Part I of this work. In Chapter 1, SPREE methods are applied to produce postcensal population counts for the indicators that make up the ‘health’ dimension of the multidimensional poverty index (MPI) defined by Costa Rica. This case study is also used to illustrate the functionalities of the R spree package. It is a user-friendly tool designed to produce updated point and uncertainty estimates based on three different approaches: SPREE (Purcell and Kish, 1980), generalised SPREE (GSPREE) (Zhang and Chambers, 2004), and multivariate SPREE (MSPREE) (Luna-Hernández, 2016). SPREE-type estimators help to update population counts by preserving the census structure and relying on new and updated totals that are usually provided by recent survey data. However, two scenarios can jeopardise the use of standard SPREE methods: a) the indicator of interest is not available in the census data e.g., income or expenditure information to estimate monetary based poverty indicators, and b) the total margins are not reliable, for instance, when changes in the population distribution between areas are not captured correctly by the surveys or when some domains are not selected in the sample. Chapters 2 and 3 offer a solution for these cases, respectively. Chapter 2 presents a two-step procedure that allows obtaining reliable and updated estimates for small areas when the variable of interest is not available in the census. The first step is to obtain the population counts for the census year using a well-known small-area estimation approach: the empirical best prediction (EBP) (Molina and Rao, 2010) method. Then, the result of this procedure is used as input to proceed with the update for postcensal years by implementing the MSPREE (Luna-Hernández, 2016) method. This methodology is applied to the case of local areas in Costa Rica, where incidence of poverty (based on income) is estimated and updated for postcensal years (2012-2017). Chapter 3 deals with the second scenario where the population totals in local areas provided by the survey data are strengthened by including satellite imagery as an auxiliary source. These new margins are used as input in the SPREE procedure. In the case study in this paper, annual updates of the MPI for female-headed households in Senegal are produced. While the use of satellite imagery and other big data sources can improve the reliability of small-area estimates, access to survey data that can be matched with these novel sources is restricted for confidentiality reasons. Therefore, a data dissemination strategy for micro-level survey data is proposed in the paper presented in Part II. This strategy aims to help statistical data producers to improve the trade-off between privacy risk and utility of the data that they release for research purposes

    Anonimização de Dados em Educação

    Get PDF
    Interest in data privacy is not only growing, but the quantity of data collected is also increasing. This data, which is collected and stored electronically, contains information related with all aspects of our lives, frequently containing sensitive information, such as financial records, activity in social networks, location traces collected by our mobile phones and even medical records. Consequently, it becomes paramount to assure the best protection for this data, so that no harm is done to individuals even if the data is to become publicly available. To achieve it, it is necessary to avoid the linkage between records in a dataset and a real world individual. Despite some attributes, such as gender and age, though alone they can not identify a corresponding individual, their combination with other datasets can lead to the existence of unique records in the dataset and a consequent linkage to a real world individual. Therefore, with data anonymization, it is possible to assure, with various degrees of protection, that said linkage is avoided the best we can. However, this process can have a decline in data utility as consequence. In this work, we explore the terminology and some of the techniques that can be used during the process of data anonymization. Moreover, we show the effects of said techniques on information loss, data utility and re-identification risk, when applied to a dataset with personal information collected from college graduated students. Finally, and once the results are presented, we perform an analysis and comparative discussion of the obtained results.Hoje em dia é possível observar que tanto a preocupação com a privacidade dos dados pessoais como a quantidade de dados recolhidos estão a aumentar. Estes dados, recolhidos e armazenados eletronicamente, contêm informação relacionada com todos os aspetos das nossas vidas, informação essa muitas vezes sensível, tal como registos financeiros, atividade em redes sociais, rastreamento de dispositivos móveis e até registos médicos. Consequentemente, torna-se vital assegurar a proteção destes dados para que, mesmo se tornados públicos, não causem danos pessoais aos indivíduos envolvidos. Para isso, é necessário evitar que registos nos dados sejam associados a indivíduos reais. Apesar de atributos, como o género e a idade, singularmente não conseguirem identificar o individuo correspondente, a sua combinação com outros conjuntos de dados, pode levar à existência de um registo único no conjunto de dados e consequente associação a um individuo. Com a anonimização dos dados, é possível assegurar, com variados graus de proteção, que essa associação a um individuo real seja evitada ao máximo. Contudo, este processo pode ter como consequência uma diminuição na utilidade dos dados. Com este trabalho, exploramos a terminologia e algumas das técnicas que podem ser utilizadas no processo de anonimização de dados. Mostramos também os efeitos dessas várias técnicas tanto na perda de informação e utilidade dos dados, como no risco de re-identificação associado, quando aplicadas a um conjunto de dados com informação pessoal recolhida a alunos que conluíram o ensino superior. No final, e uma vez feita a apresentação dos resultados, é feita uma análise e discussão comparativa dos resultados obtidos

    Privacy-Preserving Reengineering of Model-View-Controller Application Architectures Using Linked Data

    Get PDF
    When a legacy system’s software architecture cannot be redesigned, implementing additional privacy requirements is often complex, unreliable and costly to maintain. This paper presents a privacy-by-design approach to reengineer web applications as linked data-enabled and implement access control and privacy preservation properties. The method is based on the knowledge of the application architecture, which for the Web of data is commonly designed on the basis of a model-view-controller pattern. Whereas wrapping techniques commonly used to link data of web applications duplicate the security source code, the new approach allows for the controlled disclosure of an application’s data, while preserving non-functional properties such as privacy preservation. The solution has been implemented and compared with existing linked data frameworks in terms of reliability, maintainability and complexity

    Privacy-Protecting Techniques for Behavioral Data: A Survey

    Get PDF
    Our behavior (the way we talk, walk, or think) is unique and can be used as a biometric trait. It also correlates with sensitive attributes like emotions. Hence, techniques to protect individuals privacy against unwanted inferences are required. To consolidate knowledge in this area, we systematically reviewed applicable anonymization techniques. We taxonomize and compare existing solutions regarding privacy goals, conceptual operation, advantages, and limitations. Our analysis shows that some behavioral traits (e.g., voice) have received much attention, while others (e.g., eye-gaze, brainwaves) are mostly neglected. We also find that the evaluation methodology of behavioral anonymization techniques can be further improved
    corecore