45 research outputs found

    Supporting Regularized Logistic Regression Privately and Efficiently

    Full text link
    As one of the most popular statistical and machine learning models, logistic regression with regularization has found wide adoption in biomedicine, social sciences, information technology, and so on. These domains often involve data of human subjects that are contingent upon strict privacy regulations. Increasing concerns over data privacy make it more and more difficult to coordinate and conduct large-scale collaborative studies, which typically rely on cross-institution data sharing and joint analysis. Our work here focuses on safeguarding regularized logistic regression, a widely-used machine learning model in various disciplines while at the same time has not been investigated from a data security and privacy perspective. We consider a common use scenario of multi-institution collaborative studies, such as in the form of research consortia or networks as widely seen in genetics, epidemiology, social sciences, etc. To make our privacy-enhancing solution practical, we demonstrate a non-conventional and computationally efficient method leveraging distributing computing and strong cryptography to provide comprehensive protection over individual-level and summary data. Extensive empirical evaluation on several studies validated the privacy guarantees, efficiency and scalability of our proposal. We also discuss the practical implications of our solution for large-scale studies and applications from various disciplines, including genetic and biomedical studies, smart grid, network analysis, etc

    False Discovery Rate Control for High Dimensional Dependent Data with an Application to Large-Scale Genetic Association Studies

    Get PDF
    Large-scale genetic association studies are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single SNP analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferonni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. Motivated by an application for analysis of data from the genetic association studies, we consider the problem of false discovery rate (FDR) control under the high dimensional multivariate normal model. Using the compound decision rule framework, we develop an optimal joint oracle procedure and propose to use a marginal procedure to approximate the optimal joint optimal procedure. We show that the marginal plug-in procedure is asymptotically optimal under mild conditions. Our results indicate that the multiple testing procedure developed under the independent model is not only valid but also asymptotically optimal for the high dimensional multivariate normal data under some weak dependency. We evaluate various procedures using simulation studies and demonstrate its application to a genome-wide association study of neuroblastoma (NB). The proposed procedure identified a few more genetic variants that are potentially associated with NB than the standard p-value-based FDR controlling procedure

    Sample size and power analysis for sparse signal recovery in genome-wide association studies

    No full text
    SUMMARY Genome-wide association studies have successfully identified hundreds of novel genetic variants associated with many complex human diseases. However, there is a lack of rigorous work on evaluating the statistical power for identifying these variants. In this paper, we consider sparse signal identification in genome-wide association studies and present two analytical frameworks for detailed analysis of the statistical power for detecting and identifying the disease-associated variants. We present an explicit sample size formula for achieving a given false non-discovery rate while controlling the false discovery rate based on an optimal procedure. Sparse genetic variant recovery is also considered and a boundary condition is established in terms of sparsity and signal strength for almost exact recovery of both disease-associated variants and nondiseaseassociated variants. A data-adaptive procedure is proposed to achieve this bound. The analytical results are illustrated with a genome-wide association study of neuroblastoma

    Computational efficiency on evaluation datasets.

    No full text
    <p>Computational efficiency on evaluation datasets.</p

    Overview of our secure framework for regularized logistic regression.

    No full text
    <p>Each institution (possessing private data) locally computes summary statistics from its own data, and submits encrypted aggregates following a strong cryptographic scheme [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156479#pone.0156479.ref030" target="_blank">30</a>]. The Computation Centers securely aggregate the encryptions and conduct model estimation, from which the model adjustment feedback will be sent back as necessary. This iterative process continues until model convergence.</p
    corecore