4,396 research outputs found
Pattern-sensitive Time-series Anonymization and its Application to Energy-Consumption Data
Time series anonymization is an important problem. One prominent example of time series are energy consumption records, which might reveal details of the daily routine of a household. Existing privacy approaches for time series, e.g., from the field of trajectory anonymization, assume that every single value of a time series contains sensitive information and reduce the data quality very much. In contrast, we consider time series where it is combinations of tuples that represent personal information. We propose (n; l; k)-anonymity, geared to anonymization of time-series data with minimal information loss, assuming that an adversary may learn a few data points. We propose several heuristics to obtain (n; l; k)-anonymity, and we evaluate our approach both with synthetic and real data. Our experiments confirm that it is sufficient to modify time series only moderately in order to fulfill meaningful privacy requirements
Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints
Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k; k^m)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosiscodes. To realize our approach, we develop an algorithm that enforces (k; k^m)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200; 000 electronic health recordsshow the effectiveness and efficiency of our algorithm
Assessing Data Usefulness for Failure Analysis in Anonymized System Logs
System logs are a valuable source of information for the analysis and
understanding of systems behavior for the purpose of improving their
performance. Such logs contain various types of information, including
sensitive information. Information deemed sensitive can either directly be
extracted from system log entries by correlation of several log entries, or can
be inferred from the combination of the (non-sensitive) information contained
within system logs with other logs and/or additional datasets. The analysis of
system logs containing sensitive information compromises data privacy.
Therefore, various anonymization techniques, such as generalization and
suppression have been employed, over the years, by data and computing centers
to protect the privacy of their users, their data, and the system as a whole.
Privacy-preserving data resulting from anonymization via generalization and
suppression may lead to significantly decreased data usefulness, thus,
hindering the intended analysis for understanding the system behavior.
Maintaining a balance between data usefulness and privacy preservation,
therefore, remains an open and important challenge. Irreversible encoding of
system logs using collision-resistant hashing algorithms, such as SHAKE-128, is
a novel approach previously introduced by the authors to mitigate data privacy
concerns. The present work describes a study of the applicability of the
encoding approach from earlier work on the system logs of a production high
performance computing system. Moreover, a metric is introduced to assess the
data usefulness of the anonymized system logs to detect and identify the
failures encountered in the system.Comment: 11 pages, 3 figures, submitted to 17th IEEE International Symposium
on Parallel and Distributed Computin
Asymptotic Loss in Privacy due to Dependency in Gaussian Traces
The rapid growth of the Internet of Things (IoT) necessitates employing
privacy-preserving techniques to protect users' sensitive information. Even
when user traces are anonymized, statistical matching can be employed to infer
sensitive information. In our previous work, we have established the privacy
requirements for the case that the user traces are instantiations of discrete
random variables and the adversary knows only the structure of the dependency
graph, i.e., whether each pair of users is connected. In this paper, we
consider the case where data traces are instantiations of Gaussian random
variables and the adversary knows not only the structure of the graph but also
the pairwise correlation coefficients. We establish the requirements on
anonymization to thwart such statistical matching, which demonstrate the
significant degree to which knowledge of the pairwise correlation coefficients
further significantly aids the adversary in breaking user anonymity.Comment: IEEE Wireless Communications and Networking Conferenc
Sharing Computer Network Logs for Security and Privacy: A Motivation for New Methodologies of Anonymization
Logs are one of the most fundamental resources to any security professional.
It is widely recognized by the government and industry that it is both
beneficial and desirable to share logs for the purpose of security research.
However, the sharing is not happening or not to the degree or magnitude that is
desired. Organizations are reluctant to share logs because of the risk of
exposing sensitive information to potential attackers. We believe this
reluctance remains high because current anonymization techniques are weak and
one-size-fits-all--or better put, one size tries to fit all. We must develop
standards and make anonymization available at varying levels, striking a
balance between privacy and utility. Organizations have different needs and
trust other organizations to different degrees. They must be able to map
multiple anonymization levels with defined risks to the trust levels they share
with (would-be) receivers. It is not until there are industry standards for
multiple levels of anonymization that we will be able to move forward and
achieve the goal of widespread sharing of logs for security researchers.Comment: 17 pages, 1 figur
- …