6 research outputs found
ODACH: A One-shot Distributed Algorithm for Cox model with heterogeneous multi-center data
We developed a One-shot Distributed Algorithm for Cox proportional-hazards model to analyze Heterogeneous multi-center time-to-event data (ODACH) circumventing the need for sharing patient-level information across sites. This algorithm implements a surrogate likelihood function to approximate the Cox log-partial likelihood function that is stratified by site using patient-level data from a lead site and aggregated information from other sites, allowing the baseline hazard functions and the distribution of covariates to vary across sites. Simulation studies and application to a real-world opioid use disorder study showed that ODACH provides estimates close to the pooled estimator, which analyzes patient-level data directly from all sites via a stratified Cox model. Compared to the estimator from meta-analysis, the inverse variance-weighted average of the site-specific estimates, ODACH estimator demonstrates less susceptibility to bias, especially when the event is rare. ODACH is thus a valuable privacy-preserving and communication-efficient method for analyzing multi-center time-to-event data
Distributed estimation of spiked eigenvalues in spiked population models
The proliferation of science and technology has led to the prevalence of
voluminous data sets that are distributed across multiple machines. It is an
established fact that conventional statistical methodologies may be unfeasible
in the analysis of such massive data sets due to prohibitively long computing
durations, memory constraints, communication overheads, and confidentiality
considerations. In this paper, we propose distributed estimators of the spiked
eigenvalues in spiked population models. The consistency and asymptotic
normality of the distributed estimators are derived, and the statistical error
analysis of the distributed estimators is provided as well. Compared to the
estimation from the full sample, the proposed distributed estimation shares the
same order of convergence. Simulation study and real data analysis indicate
that the proposed distributed estimation and testing procedures have excellent
properties in terms of estimation accuracy and stability as well as
transmission efficiency
Multi-Task Learning with Summary Statistics
Multi-task learning has emerged as a powerful machine learning paradigm for
integrating data from multiple sources, leveraging similarities between tasks
to improve overall model performance. However, the application of multi-task
learning to real-world settings is hindered by data-sharing constraints,
especially in healthcare settings. To address this challenge, we propose a
flexible multi-task learning framework utilizing summary statistics from
various sources. Additionally, we present an adaptive parameter selection
approach based on a variant of Lepski's method, allowing for data-driven tuning
parameter selection when only summary statistics are available. Our systematic
non-asymptotic analysis characterizes the performance of the proposed methods
under various regimes of the sample complexity and overlap. We demonstrate our
theoretical findings and the performance of the method through extensive
simulations. This work offers a more flexible tool for training related models
across various domains, with practical implications in genetic risk prediction
and many other fields.Comment: NeurIPS 2023, final versio
Learning a high-dimensional classification rule using auxiliary outcomes
Correlated outcomes are common in many practical problems. Based on a
decomposition of estimation bias into two types, within-subspace and
against-subspace, we develop a robust approach to estimating the classification
rule for the outcome of interest with the presence of auxiliary outcomes in
high-dimensional settings. The proposed method includes a pooled estimation
step using all outcomes to gain efficiency, and a subsequent calibration step
using only the outcome of interest to correct both types of biases. We show
that when the pooled estimator has a low estimation error and a sparse
against-subspace bias, the calibrated estimator can achieve a lower estimation
error than that when using only the single outcome of interest. An inference
procedure for the calibrated estimator is also provided. Simulations and a real
data analysis are conducted to justify the superiority of the proposed method.Comment: 19 pages, 3 figure
Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis
In diverse fields ranging from finance to omics, it is increasingly common
that data is distributed and with multiple individual sources (referred to as
``clients'' in some studies). Integrating raw data, although powerful, is often
not feasible, for example, when there are considerations on privacy protection.
Distributed learning techniques have been developed to integrate summary
statistics as opposed to raw data. In many of the existing distributed learning
studies, it is stringently assumed that all the clients have the same model. To
accommodate data heterogeneity, some federated learning methods allow for
client-specific models. In this article, we consider the scenario that clients
form clusters, those in the same cluster have the same model, and different
clusters have different models. Further considering the clustering structure
can lead to a better understanding of the ``interconnections'' among clients
and reduce the number of parameters. To this end, we develop a novel
penalization approach. Specifically, group penalization is imposed for
regularized estimation and selection of important variables, and fusion
penalization is imposed to automatically cluster clients. An effective ADMM
algorithm is developed, and the estimation, selection, and clustering
consistency properties are established under mild conditions. Simulation and
data analysis further demonstrate the practical utility and superiority of the
proposed approach