Search CORE

6 research outputs found

ODACH: A One-shot Distributed Algorithm for Cox model with heterogeneous multi-center data

Author: Bian Jiang
Chen Yong
Duan Rui
Kranzler Henry R
Luo Chongliang
Naj Adam C
Publication venue: Digital Commons@Becker
Publication date: 22/04/2022
Field of study

We developed a One-shot Distributed Algorithm for Cox proportional-hazards model to analyze Heterogeneous multi-center time-to-event data (ODACH) circumventing the need for sharing patient-level information across sites. This algorithm implements a surrogate likelihood function to approximate the Cox log-partial likelihood function that is stratified by site using patient-level data from a lead site and aggregated information from other sites, allowing the baseline hazard functions and the distribution of covariates to vary across sites. Simulation studies and application to a real-world opioid use disorder study showed that ODACH provides estimates close to the pooled estimator, which analyzes patient-level data directly from all sites via a stratified Cox model. Compared to the estimator from meta-analysis, the inverse variance-weighted average of the site-specific estimates, ODACH estimator demonstrates less susceptibility to bias, especially when the event is rare. ODACH is thus a valuable privacy-preserving and communication-efficient method for analyzing multi-center time-to-event data

Digital Commons@Becker

Distributed estimation of spiked eigenvalues in spiked population models

Author: Hu Jiang
Yan Lu
Publication venue
Publication date: 22/10/2023
Field of study

The proliferation of science and technology has led to the prevalence of voluminous data sets that are distributed across multiple machines. It is an established fact that conventional statistical methodologies may be unfeasible in the analysis of such massive data sets due to prohibitively long computing durations, memory constraints, communication overheads, and confidentiality considerations. In this paper, we propose distributed estimators of the spiked eigenvalues in spiked population models. The consistency and asymptotic normality of the distributed estimators are derived, and the statistical error analysis of the distributed estimators is provided as well. Compared to the estimation from the full sample, the proposed distributed estimation shares the same order of convergence. Simulation study and real data analysis indicate that the proposed distributed estimation and testing procedures have excellent properties in terms of estimation accuracy and stability as well as transmission efficiency

arXiv.org e-Print Archive

Multi-Task Learning with Summary Statistics

Author: Duan Rui
Knight Parker
Publication venue
Publication date: 08/02/2024
Field of study

Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields.Comment: NeurIPS 2023, final versio

arXiv.org e-Print Archive

Learning a high-dimensional classification rule using auxiliary outcomes

Author: Liang Muxuan
Park Jaeyoung
Zhong Xiang
Publication venue
Publication date: 10/11/2020
Field of study

Correlated outcomes are common in many practical problems. Based on a decomposition of estimation bias into two types, within-subspace and against-subspace, we develop a robust approach to estimating the classification rule for the outcome of interest with the presence of auxiliary outcomes in high-dimensional settings. The proposed method includes a pooled estimation step using all outcomes to gain efficiency, and a subsequent calibration step using only the outcome of interest to correct both types of biases. We show that when the pooled estimator has a low estimation error and a sparse against-subspace bias, the calibrated estimator can achieve a lower estimation error than that when using only the single outcome of interest. An inference procedure for the calibrated estimator is also provided. Simulations and a real data analysis are conducted to justify the superiority of the proposed method.Comment: 19 pages, 3 figure

arXiv.org e-Print Archive

Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis

Author: Chen Yuanxing
Fang Kuangnan
Ma Shuangge
Zhang Qingzhao
Publication venue
Publication date: 04/11/2022
Field of study

In diverse fields ranging from finance to omics, it is increasingly common that data is distributed and with multiple individual sources (referred to as ``clients'' in some studies). Integrating raw data, although powerful, is often not feasible, for example, when there are considerations on privacy protection. Distributed learning techniques have been developed to integrate summary statistics as opposed to raw data. In many of the existing distributed learning studies, it is stringently assumed that all the clients have the same model. To accommodate data heterogeneity, some federated learning methods allow for client-specific models. In this article, we consider the scenario that clients form clusters, those in the same cluster have the same model, and different clusters have different models. Further considering the clustering structure can lead to a better understanding of the ``interconnections'' among clients and reduce the number of parameters. To this end, we develop a novel penalization approach. Specifically, group penalization is imposed for regularized estimation and selection of important variables, and fusion penalization is imposed to automatically cluster clients. An effective ADMM algorithm is developed, and the estimation, selection, and clustering consistency properties are established under mild conditions. Simulation and data analysis further demonstrate the practical utility and superiority of the proposed approach

arXiv.org e-Print Archive