3,422 research outputs found

    Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

    Full text link
    System logs are a valuable source of information for the analysis and understanding of systems behavior for the purpose of improving their performance. Such logs contain various types of information, including sensitive information. Information deemed sensitive can either directly be extracted from system log entries by correlation of several log entries, or can be inferred from the combination of the (non-sensitive) information contained within system logs with other logs and/or additional datasets. The analysis of system logs containing sensitive information compromises data privacy. Therefore, various anonymization techniques, such as generalization and suppression have been employed, over the years, by data and computing centers to protect the privacy of their users, their data, and the system as a whole. Privacy-preserving data resulting from anonymization via generalization and suppression may lead to significantly decreased data usefulness, thus, hindering the intended analysis for understanding the system behavior. Maintaining a balance between data usefulness and privacy preservation, therefore, remains an open and important challenge. Irreversible encoding of system logs using collision-resistant hashing algorithms, such as SHAKE-128, is a novel approach previously introduced by the authors to mitigate data privacy concerns. The present work describes a study of the applicability of the encoding approach from earlier work on the system logs of a production high performance computing system. Moreover, a metric is introduced to assess the data usefulness of the anonymized system logs to detect and identify the failures encountered in the system.Comment: 11 pages, 3 figures, submitted to 17th IEEE International Symposium on Parallel and Distributed Computin

    RANDOMIZATION BASED PRIVACY PRESERVING CATEGORICAL DATA ANALYSIS

    Get PDF
    The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive infor- mation of individuals, the public has expressed a deep concern about their privacy. Privacy-preserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining. This dissertation investigates data utility and privacy of randomization-based mod- els in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for associ- ation rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective associ- ation measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting re- quirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient so- lutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomiza- tion approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss

    Geolocation with respect to persona privacy for the Allergy Diary app - a MASK study

    Get PDF
    Background: Collecting data on the localization of users is a key issue for the MASK (Mobile Airways Sentinel network: the Allergy Diary) App. Data anonymization is a method of sanitization for privacy. The European Commission's Article 29 Working Party stated that geolocation information is personal data. To assess geolocation using the MASK method and to compare two anonymization methods in the MASK database to find an optimal privacy method. Methods: Geolocation was studied for all people who used the Allergy Diary App from December 2015 to November 2017 and who reported medical outcomes. Two different anonymization methods have been evaluated: Noise addition (randomization) and k-anonymity (generalization). Results: Ninety-three thousand one hundred and sixteen days of VAS were collected from 8535 users and 54,500 (58. 5%) were geolocalized, corresponding to 5428 users. Noise addition was found to be less accurate than k-anonymity using MASK data to protect the users' life privacy. Discussion: k-anonymity is an acceptable method for the anonymization of MASK data and results can be used for other databases.Peer reviewe

    When Enough is Enough: Location Tracking, Mosaic Theory, and Machine Learning

    Get PDF
    Since 1967, when it decided Katz v. United States, the Supreme Court has tied the right to be free of unwanted government scrutiny to the concept of reasonable xpectations of privacy.[1] An evaluation of reasonable expectations depends, among other factors, upon an assessment of the intrusiveness of government action. When making such assessment historically the Court has considered police conduct with clear temporal, geographic, or substantive limits. However, in an era where new technologies permit the storage and compilation of vast amounts of personal data, things are becoming more complicated. A school of thought known as “mosaic theory” has stepped into the void, ringing the alarm that our old tools for assessing the intrusiveness of government conduct potentially undervalue privacy rights. Mosaic theorists advocate a cumulative approach to the evaluation of data collection. Under the theory, searches are “analyzed as a collective sequence of steps rather than as individual steps.”[2] The approach is based on the recognition that comprehensive aggregation of even seemingly innocuous data reveals greater insight than consideration of each piece of information in isolation. Over time, discrete units of surveillance data can be processed to create a mosaic of habits, relationships, and much more. Consequently, a Fourth Amendment analysis that focuses only on the government’s collection of discrete units of trivial data fails to appreciate the true harm of long-term surveillance—the composite. In the context of location tracking, the Court has previously suggested that the Fourth Amendment may (at some theoretical threshold) be concerned with the accumulated information revealed by surveillance.[3] Similarly, in the Court’s recent decision in United States v. Jones, a majority of concurring justices indicated willingness to explore such an approach.[4] However, in general, the Court has rejected any notion that technological enhancement matters to the constitutional treatment of location tracking.[5] Rather, it has found that such surveillance in public spaces, which does not require physical trespass, is equivalent to a human tail and thus not regulated by the Fourth Amendment. In this way, the Court has avoided quantitative analysis of the amendment’s protections. The Court’s reticence is built on the enticingly direct assertion that objectivity under the mosaic theory is impossible. This is true in large part because there has been no rationale yet offered to objectively distinguish relatively short-term monitoring from its counterpart of greater duration. As Justice Scalia recently observed in Jones: “it remains unexplained why a 4-week investigation is ‘surely’ too long.”[6] This article suggests that by combining the lessons of machine learning with the mosaic theory and applying the pairing to the Fourth Amendment we can see the contours of a response. Machine learning makes clear that mosaics can be created. Moreover, there are also important lessons to be learned on when that is the case. Machine learning is the branch of computer science that studies systems that can draw inferences from collections of data, generally by means of mathematical algorithms. In a recent competition called “The Nokia Mobile Data Challenge,”[7] researchers evaluated machine learning’s applicability to GPS and cell phone tower data. From a user’s location history alone, the researchers were able to estimate the user’s gender, marital status, occupation and age.[8] Algorithms developed for the competition were also able to predict a user’s likely future location by observing past location history. The prediction of a user’s future location could be even further improved by using the location data of friends and social contacts.[9] Machine learning of the sort on display during the Nokia competition seeks to harness the data deluge of today’s information society by efficiently organizing data, finding statistical regularities and other patterns in it, and making predictions therefrom. Machine learning algorithms are able to deduce information—including information that has no obvious linkage to the input data—that may otherwise have remained private due to the natural limitations of manual and human-driven investigation. Analysts can “train” machine learning programs using one dataset to find similar characteristics in new datasets. When applied to the digital “bread crumbs” of data generated by people, machine learning algorithms can make targeted personal predictions. The greater the number of data points evaluated, the greater the accuracy of the algorithm’s results. In five parts, this article advances the conclusion that the duration of investigations is relevant to their substantive Fourth Amendment treatment because duration affects the accuracy of the predictions. Though it was previously difficult to explain why an investigation of four weeks was substantively different from an investigation of four hours, we now have a better understanding of the value of aggregated data when viewed through a machine learning lens. In some situations, predictions of startling accuracy can be generated with remarkably few data points. Furthermore, in other situations accuracy can increase dramatically above certain thresholds. For example, a 2012 study found the ability to deduce ethnicity moved sideways through five weeks of phone data monitoring, jumped sharply to a new plateau at that point, and then increased sharply again after twenty-eight weeks.[10] More remarkably, the accuracy of identification of a target’s significant other improved dramatically after five days’ worth of data inputs.[11] Experiments like these support the notion of a threshold, a point at which it makes sense to draw a Fourth Amendment line. In order to provide an objective basis for distinguishing between law enforcement activities of differing duration the results of machine learning algorithms can be combined with notions of privacy metrics, such as k-anonymity or l-diversity. While reasonable minds may dispute the most suitable minimum accuracy threshold, this article makes the case that the collection of data points allowing predictions that exceed selected thresholds should be deemed unreasonable searches in the absence of a warrant.[12] Moreover, any new rules should take into account not only the data being collected but also the foreseeable improvements in the machine learning technology that will ultimately be brought to bear on it; this includes using future algorithms on older data. In 2001, the Supreme Court asked “what limits there are upon the power of technology to shrink the realm of guaranteed privacy.”[13] In this piece, we explore an answer and investigate what lessons there are in the power of technology to protect the realm of guaranteed privacy. After all, as technology takes away, it also gives. The objective understanding of data compilation and analysis that is revealed by machine learning provides important Fourth Amendment insights. We should begin to consider these insights more closely. [1] Katz v. United States, 389 U.S. 347, 361 (1967) (Harlan, J., concurring). [2] Orin Kerr, The Mosaic Theory of the Fourth Amendment, 111 Mich. L. Rev. 311, 312 (2012). [3] United States v. Knotts, 460 U.S. 276, 284 (1983). [4] Justice Scalia writing for the majority left the question open. United States v. Jones, 132 S. Ct. 945, 954 (2012) (“It may be that achieving the same result [as in traditional surveillance] through electronic means, without an accompanying trespass, is an unconstitutional invasion of privacy, but the present case does not require us to answer that question.”). [5] Compare Knotts, 460 U.S. at 276 (rejecting the contention that an electronic beeper should be treated differently than a human tail) and Smith v. Maryland, 442 U.S. 735, 744 (1979) (approving the warrantless use of a pen register in part because the justices were “not inclined to hold that a different constitutional result is required because the telephone company has decided to automate”) with Kyllo v. United States, 533 U.S. 27, 33 (2001) (recognizing that advances in technology affect the degree of privacy secured by the Fourth Amendment). [6] United States v. Jones, 132 S.Ct. 945 (2012); see also Kerr, 111 Mich. L. Rev. at 329-330. [7] See Nokia Research Center, Mobile Data Challenge 2012 Workshop, http://research.nokia.com/page/12340. [8] Demographic Attributes Prediction on the Real-World Mobile Data, Sanja Brdar, Dubravko Culibrk & Vladimir Crnojevic, Nokia Mobile Data Challenge Workshop 2012. [9] Interdependence and Predictability of Human Mobility and Social Interactions, Manlio de Domenico, Antonio Lima & Mirco Musolesi, Nokia Mobile Data Challenge Workshop 2012. [10] See Yaniv Altshuler, Nadav Aharony, Michael Fire, Yuval Elovici, Alex Pentland, Incremental Learning with Accuracy Prediction of Social and Individual Properties from Mobile-Phone Data, WS3P, IEEE Social Computing (2012), Figure 10. [11] Id., Figure 9. [12] Admittedly, there are differing views on sources of authority beyond the Constitution that might justify location tracking. See, e.g., Stephanie K. Pell & Christopher Soghoian, Can You See Me Now? Toward Reasonable Standards for Law Enforcement Access to Location Data That Congress Could Enact, 27 Berkeley Tech. L.J. 117 (2012). [13] Kyllo, 533 U.S. at 34

    Gender identity inclusion in the workplace: broadening diversity management research and practice through the case of transgender employees in the UK

    Get PDF
    Based on 14 in-depth interviews, this paper explores the unique workplace experiences of transgender individuals in the UK employment context. The paper identifies gender identity diversity as a key blind spot in HRM and diversity management research and practice. The findings reveal the range of workplace challenges experienced by transgender employees. Major findings are that discriminatory effects are often occupation- and industry-specific; transition is a period where many transgender workers suffer due to lack of proper organisational support; and expertise deficits exist in supporting and accommodating transgender employees’ needs. In unpacking these experiences, the paper demonstrates the distinctive dimensions of challenges faced by transgender employees, revealing the need for conceptually expanding how we frame diversity and diversity management. Our findings identify the necessity for an emic approach not only to researching diversity but also to devising organisational diversity strategies. The paper provides recommendations for HRM policy and practice in order to develop a more sophisticated approach to achieving inclusion

    Analysis and improvement of security and privacy techniques for genomic information

    Get PDF
    The purpose of this thesis is to review the current literature of privacy preserving techniques for genomic information on the last years. Based on the analysis, we propose a long-term classification system for the reviewed techniques. We also develop a security improvement proposal for the Beacon system without hindering research utility
    • 

    corecore