379 research outputs found
On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection
With advances seen in deep learning, voice-based applications are burgeoning,
ranging from personal assistants, affective computing, to remote disease
diagnostics. As the voice contains both linguistic and paralinguistic
information (e.g., vocal pitch, intonation, speech rate, loudness), there is
growing interest in voice anonymization to preserve speaker privacy and
identity. Voice privacy challenges have emerged over the last few years and
focus has been placed on removing speaker identity while keeping linguistic
content intact. For affective computing and disease monitoring applications,
however, the paralinguistic content may be more critical. Unfortunately, the
effects that anonymization may have on these systems are still largely unknown.
In this paper, we fill this gap and focus on one particular health monitoring
application: speech-based COVID-19 diagnosis. We test two popular anonymization
methods and their impact on five different state-of-the-art COVID-19 diagnostic
systems using three public datasets. We validate the effectiveness of the
anonymization methods, compare their computational complexity, and quantify the
impact across different testing scenarios for both within- and across-dataset
conditions. Lastly, we show the benefits of anonymization as a data
augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss
seen with anonymized data.Comment: 11 pages, 10 figure
Privacy Preserving Sensitive Data Publishing using (k,n,m) Anonymity Approach
Open Science movement has enabled extensive knowledge sharing by making research publications, software, data and samples available to the society and researchers. The demand for data sharing is increasing day by day due to the tremendous knowledge hidden in the digital data that is generated by humans and machines. However, data cannot be published as such due to the information leaks that can occur by linking the published data with other publically available datasets or with the help of some background knowledge. Various anonymization techniques have been proposed by researchers for privacy preserving sensitive data publishing. This paper proposes a (k,n,m) anonymity approach for sensitive data publishing by making use of the traditional k-anonymity technique. The selection of quasi identifiers is automated in this approach using graph theoretic algorithms and is further enhanced by choosing similar quasi identifiers based on the derived and composite attributes. The usual method of choosing a single value of ‘k’ is modified in this technique by selecting different values of ‘k’ for the same dataset based on the risk of exposure and sensitivity rank of the sensitive attributes. The proposed anonymity approach can be used for sensitive big data publishing after applying few extension mechanisms. Experimental results show that the proposed technique is practical and can be implemented efficiently on a plethora of datasets
Spectral anonymization of data
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 87-96).Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.by Thomas Anton Lasko.Ph.D
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey
In graph machine learning, data collection, sharing, and analysis often
involve multiple parties, each of which may require varying levels of data
security and privacy. To this end, preserving privacy is of great importance in
protecting sensitive information. In the era of big data, the relationships
among data entities have become unprecedentedly complex, and more applications
utilize advanced data structures (i.e., graphs) that can support network
structures and relevant attribute information. To date, many graph-based AI
models have been proposed (e.g., graph neural networks) for various domain
tasks, like computer vision and natural language processing. In this paper, we
focus on reviewing privacy-preserving techniques of graph machine learning. We
systematically review related works from the data to the computational aspects.
We first review methods for generating privacy-preserving graph data. Then we
describe methods for transmitting privacy-preserved information (e.g., graph
model parameters) to realize the optimization-based computation when data
sharing among multiple parties is risky or impossible. In addition to
discussing relevant theoretical methodology and software tools, we also discuss
current challenges and highlight several possible future research opportunities
for privacy-preserving graph machine learning. Finally, we envision a unified
and comprehensive secure graph machine learning system.Comment: Accepted by SIGKDD Explorations 2023, Volume 25, Issue
Fair Evaluation of Global Network Aligners
Biological network alignment identifies topologically and functionally
conserved regions between networks of different species. It encompasses two
algorithmic steps: node cost function (NCF), which measures similarities
between nodes in different networks, and alignment strategy (AS), which uses
these similarities to rapidly identify high-scoring alignments. Different
methods use both different NCFs and different ASs. Thus, it is unclear whether
the superiority of a method comes from its NCF, its AS, or both. We already
showed on MI-GRAAL and IsoRankN that combining NCF of one method and AS of
another method can lead to a new superior method. Here, we evaluate MI-GRAAL
against newer GHOST to potentially further improve alignment quality. Also, we
approach several important questions that have not been asked systematically
thus far. First, we ask how much of the node similarity information in NCF
should come from sequence data compared to topology data. Existing methods
determine this more-less arbitrarily, which could affect the resulting
alignment(s). Second, when topology is used in NCF, we ask how large the size
of the neighborhoods of the compared nodes should be. Existing methods assume
that larger neighborhood sizes are better.
We find that MI-GRAAL's NCF is superior to GHOST's NCF, while the performance
of the methods' ASs is data-dependent. Thus, the combination of MI-GRAAL's NCF
and GHOST's AS could be a new superior method for certain data. Also, which
amount of sequence information is used within NCF does not affect alignment
quality, while the inclusion of topological information is crucial. Finally,
larger neighborhood sizes are preferred, but often, it is the second largest
size that is superior, and using this size would decrease computational
complexity.
Together, our results give several general recommendations for a fair
evaluation of network alignment methods.Comment: 19 pages. 10 figures. Presented at the 2014 ISMB Conference, July
13-15, Boston, M
- …