35 research outputs found

    A Simple Algorithm for Estimating Distribution Parameters from nn-Dimensional Randomized Binary Responses

    Full text link
    Randomized response is attractive for privacy preserving data collection because the provided privacy can be quantified by means such as differential privacy. However, recovering and analyzing statistics involving multiple dependent randomized binary attributes can be difficult, posing a significant barrier to use. In this work, we address this problem by identifying and analyzing a family of response randomizers that change each binary attribute independently with the same probability. Modes of Google's Rappor randomizer as well as applications of two well-known classical randomized response methods, Warner's original method and Simmons' unrelated question method, belong to this family. We show that randomizers in this family transform multinomial distribution parameters by an iterated Kronecker product of an invertible and bisymmetric 2×22 \times 2 matrix. This allows us to present a simple and efficient algorithm for obtaining unbiased maximum likelihood parameter estimates for kk-way marginals from randomized responses and provide theoretical bounds on the statistical efficiency achieved. We also describe the efficiency - differential privacy tradeoff. Importantly, both randomization of responses and the estimation algorithm are simple to implement, an aspect critical to technologies for privacy protection and security.Comment: Accepted at Information Security - 21th International Conference, ISC 2018. Adapted to meet article length requirements. Fixed typo. Results unchange

    Parallel Feature Selection Using Only Counts

    Get PDF
    Count queries belong to a class of summary statistics routinely used in basket analysis, inventory tracking, and study cohort finding. In this article, we demonstrate how it is possible to use simple count queries for parallelizing sequential data mining algorithms. Specifically, we parallelize a published algorithm for finding minimum sets of discriminating features and demonstrate that the parallel speedup is close to the expected optimum.&nbsp

    Approximation properties of haplotype tagging

    Get PDF
    BACKGROUND: Single nucleotide polymorphisms (SNPs) are locations at which the genomic sequences of population members differ. Since these differences are known to follow patterns, disease association studies are facilitated by identifying SNPs that allow the unique identification of such patterns. This process, known as haplotype tagging, is formulated as a combinatorial optimization problem and analyzed in terms of complexity and approximation properties. RESULTS: It is shown that the tagging problem is NP-hard but approximable within 1 + ln((n(2 )- n)/2) for n haplotypes but not approximable within (1 - ε) ln(n/2) for any ε > 0 unless NP ⊂ DTIME(n(log log n)). A simple, very easily implementable algorithm that exhibits the above upper bound on solution quality is presented. This algorithm has running time O([Image: see text] (2m - p + 1)) ≤ O(m(n(2 )- n)/2) where p ≤ min(n, m) for n haplotypes of size m. As we show that the approximation bound is asymptotically tight, the algorithm presented is optimal with respect to this asymptotic bound. CONCLUSION: The haplotype tagging problem is hard, but approachable with a fast, practical, and surprisingly simple algorithm that cannot be significantly improved upon on a single processor machine. Hence, significant improvement in computatational efforts expended can only be expected if the computational effort is distributed and done in parallel

    Perceptions of molecular epidemiology studies of HIV among stakeholders

    Get PDF
    Background: Advances in viral sequence analysis make it possible to track the spread of infectious pathogens, such as HIV, within a population. When used to study HIV, these analyses (i.e., molecular epidemiology) potentially allow inference of the identity of individual research subjects. Current privacy standards are likely insufficient for this type of public health research. To address this challenge, it will be important to understand how stakeholders feel about the benefits and risks of such research. Design and Methods: To better understand perceived benefits and risks of these research methods, in-depth qualitative interviews were conducted with HIV-infected individuals, individuals at high-risk for contracting HIV, and professionals in HIV care and prevention. To gather additional perspectives, attendees to a public lecture on molecular epidemiology were asked to complete an informal questionnaire. Results: Among those interviewed and polled, there was near unanimous support for using molecular epidemiology to study HIV. Questionnaires showed strong agreement about benefits of molecular epidemiology, but diverse attitudes regarding risks. Interviewees acknowledged several risks, including privacy breaches and provocation of anti-gay sentiment. The interviews also demonstrated a possibility that misunderstandings about molecular epidemiology may affect how risks and benefits are evaluated. Conclusions: While nearly all study participants agree that the benefits of HIV molecular epidemiology outweigh the risks, concerns about privacy must be addressed to ensure continued trust in research institutions and willingness to participate in research

    Applying a decision support system in clinical practice: Results from melanoma diagnosis

    Get PDF
    Abstract The work reported in this paper investigates the use of a decision-support tool for the diagnosis of pigmented skin lesions in a real-world clinical trial with 511 patients and 3827 lesion evaluations. We analyzed a number of outcomes of the trial, such as direct comparison of system performance in laboratory and clinical setting, the performance of physicians using the system compared to a control dermatologist without the system, and repeatability of system recommendations. The results show that system performance was significantly less in the real-world setting compared to the laboratory setting (c-index of 0.87 vs. 0.94, p = 0.01). Dermatologists using the system achieved a combined sensitivity of 85% and combined specificity of 95%. We also show that the process of acquiring lesion images using digital dermoscopy devices needs to be standardized before sufficiently high repeatability of measurements can be assured

    The Tension between Anonymity and Privacy

    Get PDF
    Privacy in the context of information and data is often defined in terms of anonymity, particularly in regulations such as the GDPR. Operationally, it is appealing to define privacy in terms of computable data properties as this makes it possible to verify compliance. A well known example of privacy defined as such is k-anonymity. At the same time, uncertainty regarding real-world privacy is increasing with the amount of data collected about us all. We present arguments for why focusing on anonymity or computable properties of data is not very helpful in this regard. In particular, we count exploitable failures of privacy defined in terms of computable properties of n-bit data and conclude that these counterexamples to protection cannot be rare

    Differential privacy for symmetric log-concave mechanisms

    Full text link
    Adding random noise to database query results is an important tool for achieving privacy. A challenge is to minimize this noise while still meeting privacy requirements. Recently, a sufficient and necessary condition for (ϵ,δ)(\epsilon, \delta)-differential privacy for Gaussian noise was published. This condition allows the computation of the minimum privacy-preserving scale for this distribution. We extend this work and provide a sufficient and necessary condition for (ϵ,δ)(\epsilon, \delta)-differential privacy for all symmetric and log-concave noise densities. Our results allow fine-grained tailoring of the noise distribution to the dimensionality of the query result. We demonstrate that this can yield significantly lower mean squared errors than those incurred by the currently used Laplace and Gaussian mechanisms for the same ϵ\epsilon and δ\delta.Comment: AISTATS 2022, v2 corrects typo

    A Note on the Hardness of the k-Ambiguity Problem

    No full text
    We address the problem of minimal information loss in order to k-ambiguate data, a problem related to disclosure control in disseminated data. We show that this problem is NP-hard by considering cell suppression as the ambiguation mechanism. On the way we prove that the minimum k-union problem (aka. minimum k-coverage, aka. maximum k-intersection), which is the problem of selecting k sets from a collection of n sets such that the cardinality of their union is the minimum, is NP-hard. Shown is also that if the cardinality of the sets in the collection is bounded by a constant, this restricted problem is in APX
    corecore