Search CORE

23 research outputs found

Robust and Efficient Semi-supervised Learning for Ising Model

Author: Liu Molei
Wu Daiqing
Publication venue
Publication date: 25/11/2023
Field of study

In biomedical studies, it is often desirable to characterize the interactive mode of multiple disease outcomes beyond their marginal risk. Ising model is one of the most popular choices serving for this purpose. Nevertheless, learning efficiency of Ising models can be impeded by the scarcity of accurate disease labels, which is a prominent problem in contemporary studies driven by electronic health records (EHR). Semi-supervised learning (SSL) leverages the large unlabeled sample with auxiliary EHR features to assist the learning with labeled data only and is a potential solution to this issue. In this paper, we develop a novel SSL method for efficient inference of Ising model. Our method first models the outcomes against the auxiliary features, then uses it to project the score function of the supervised estimator onto the EHR features, and incorporates the unlabeled sample to augment the supervised estimator for variance reduction without introducing bias. For the key step of conditional modeling, we propose strategies that can effectively leverage the auxiliary EHR information while maintaining moderate model complexity. In addition, we introduce approaches including intrinsic efficient updates and ensemble, to overcome the potential misspecification of the conditional model that may cause efficiency loss. Our method is justified by asymptotic theory and shown to outperform existing SSL methods through simulation studies. We also illustrate its utility in a real example about several key phenotypes related to frequent ICU admission on MIMIC-III data set

arXiv.org e-Print Archive

Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies

Author: Cai Tianxi
Guo Zijian
Liu Molei
Liu Yue
Publication venue
Publication date: 01/09/2023
Field of study

Surrogate variables in electronic health records (EHR) and biobank data play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels. We develop a novel approach named SASH for {\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional integrative regression. It is a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. {To facilitate stable and efficient knowledge extraction from the surrogates, our method first obtains a preliminary supervised estimator, and then uses it to assist training a regularized single index model (SIM) for the surrogates. Interestingly, through a chain of convex and properly penalized sparse regressions that approximate the SIM loss with bias-correction, our method avoids the local minima issue of the SIM training, and fully eliminates the impact of the preliminary estimator's large error. In addition, it protects individual-level information through summary-statistics-based data aggregation across the local sites, leveraging a similar idea of bias-corrected approximation for SIM.} Through simulation studies, we demonstrate that our method outperforms existing approaches on finite samples. Finally, we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale data sets from UK and Mass General Brigham biobanks, where only a small fraction of subjects in one site has been labeled via chart reviewing

arXiv.org e-Print Archive

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Author: Cai Tianxi
Liu Molei
Neykov Matey
Zhang Yichi
Publication venue: DigitalCommons@URI
Publication date: 01/01/2022
Field of study

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold-standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital

DigitalCommons@URI

Doubly Robust Augmented Model Accuracy Transfer Inference with High Dimensional Features

Author: Cai Tianxi
Li Mengyan
Liu Molei
Zhou Doudou
Publication venue
Publication date: 08/11/2022
Field of study

Due to label scarcity and covariate shift happening frequently in real-world studies, transfer learning has become an essential technique to train models generalizable to some target populations using existing labeled source data. Most existing transfer learning research has been focused on model estimation, while there is a paucity of literature on transfer inference for model accuracy despite its importance. We propose a novel

\mathbf{D}

oubly

\mathbf{R}

obust

\mathbf{A}

ugmented

\mathbf{M}

odel

\mathbf{A}

ccuracy

\mathbf{T}

ransfer

\mathbf{I}

nferen

\mathbf{C}

e (DRAMATIC) method for point and interval estimation of commonly used classification performance measures in an unlabeled target population using labeled source data. Specifically, DRAMATIC derives and evaluates the risk model for a binary response

Y

against some low dimensional predictors

\mathbf{A}

on the target population, leveraging

Y

from source data only and high dimensional adjustment features

\mathbf{X}

from both the source and target data. The proposed estimators are doubly robust in the sense that they are

n^{1/2}

consistent when at least one model is correctly specified and certain model sparsity assumptions hold. Simulation results demonstrate that the point estimation have negligible bias and the confidence intervals derived by DRAMATIC attain satisfactory empirical coverage levels. We further illustrate the utility of our method to transfer the genetic risk prediction model and its accuracy evaluation for type II diabetes across two patient cohorts in Mass General Brigham (MGB) collected using different sampling mechanisms and at different time points

arXiv.org e-Print Archive

Recommended from our members

Diversity and scale: Genetic architecture of 2068 traits in the VA Million Veteran Program

Author: Assimes Themistocles L
Begoli Edmon
Bick Alexander G
Brunette Charles A
Cai Tianxi
Carroll Robert J
Casas Juan P
Cho Kelly
Clifford Royce
Cohen Jeremy
Conery Mitchell
Costa Lauren
Damrauer Scott
Davies Laura
Deak Joseph D
Devineni Poornima
Dochtermann Daniel R
Duvall Scott
Garcon Helene
Gaziano J Michael
Gelernter Joel
Goethert Ian
Grant Struan FA
Guare Lindsay
Heise David A
Ho Yuk-Lam
Honerlaw Jacqueline
Huffman Jennifer E
Hung Adriana
Iyengar Sudha K
Joseph Jacob
Justice Amy
Kember Rachel
Kim Youngdae
Kranzler Henry
Kripke Colleen M
Levey Daniel
Liao Katherine P
Linares Franciel
Liu Molei
Luoh Shiuh-Wen
Madduri Ravi K
Merritt Victoria C
Moser Jennifer
Muralidhar Sumitra
Murray Michael
Nandi Tarak Nath
O'Donnell Christopher J
Overstreet Cassie
Panickan Vidul Ayakulangara
Polimanti Renato
Posner Daniel C
Pyarajan Saiju
Ramoni Rachel
Rodriguez Alex
Roussos Panos
Sangar Rahul
Shakt Gabrielle
Shi Yunling
Sun Yan V
Tipton Ryan
Tourassi Georgia
Tsao Noah
Tsao Philip
Venkatesh Sanan
Verma Anurag
Voight Benjamin F
Voloudakis Georgios
Wang Xuan
Whitbourne Stacey
Zhou Wei
Publication venue: eScholarship, University of California
Publication date: 19/07/2024
Field of study

One of the justifiable criticisms of human genetic studies is the underrepresentation of participants from diverse populations. Lack of inclusion must be addressed at-scale to identify causal disease factors and understand the genetic causes of health disparities. We present genome-wide associations for 2068 traits from 635,969 participants in the Department of Veterans Affairs Million Veteran Program, a longitudinal study of diverse United States Veterans. Systematic analysis revealed 13,672 genomic risk loci; 1608 were only significant after including non-European populations. Fine-mapping identified causal variants at 6318 signals across 613 traits. One-third (n = 2069) were identified in participants from non-European populations. This reveals a broadly similar genetic architecture across populations, highlights genetic insights gained from underrepresented groups, and presents an extensive atlas of genetic associations

eScholarship - University of California

Assessing Heterogeneous Risk of Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data

Author: Cai Tianxi
Guo Xinzhou
Liu Molei
Wang Jingshen
Wei Waverly
Wu Chong
Publication venue
Publication date: 13/05/2022
Field of study

There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). However, because existing clinical studies with limited sample sizes often suffer from selection bias issues, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, building on the biobank and electronic health record data in the Partner Health System, we introduce a new data analysis pipeline from a biological perspective and a novel statistical methodology that address the limitations in existing studies to: (i) systematically examine heterogeneous treatment effects of stain use on T2D risk, (ii) uncover which patient subgroup is most vulnerable to T2D after taking statins, and (iii) assess the replicability and statistical significance of the most vulnerable subgroup via bootstrap calibration. Our proposed bootstrap calibration approach delivers asymptotically sharp confidence intervals and debiased estimates for the treatment effect of the most vulnerable subgroup in the presence of possibly high-dimensional covariates. By implementing our proposed approach, we find that females with high T2D genetic risk at baseline are indeed at high risk of developing T2D due to statin use, which provides evidences to support future clinical decisions with respect to statin use.Comment: 31 pages, 2 figures, 6 table

arXiv.org e-Print Archive