21 research outputs found
Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction
We develop scalable randomized kernel methods for jointly associating data
from multiple sources and simultaneously predicting an outcome or classifying a
unit into one of two or more classes. The proposed methods model nonlinear
relationships in multiview data together with predicting a clinical outcome and
are capable of identifying variables or groups of variables that best
contribute to the relationships among the views. We use the idea that random
Fourier bases can approximate shift-invariant kernel functions to construct
nonlinear mappings of each view and we use these mappings and the outcome
variable to learn view-independent low-dimensional representations. Through
simulation studies, we show that the proposed methods outperform several other
linear and nonlinear methods for multiview data integration. When the proposed
methods were applied to gene expression, metabolomics, proteomics, and
lipidomics data pertaining to COVID-19, we identified several molecular
signatures forCOVID-19 status and severity. Results from our real data
application and simulations with small sample sizes suggest that the proposed
methods may be useful for small sample size problems. Availability: Our
algorithms are implemented in Pytorch and interfaced in R and would be made
available at: https://github.com/lasandrall/RandMVLearn.Comment: 24 pages, 5 figures, 4 table
mvlearnR and Shiny App for multiview learning
The package mvlearnR and accompanying Shiny App is intended for integrating
data from multiple sources or views or modalities (e.g. genomics, proteomics,
clinical and demographic data). Most existing software packages for multiview
learning are decentralized and offer limited capabilities, making it difficult
for users to perform comprehensive integrative analysis. The new package wraps
statistical and machine learning methods and graphical tools, providing a
convenient and easy data integration workflow. For users with limited
programming language, we provide a Shiny Application to facilitate data
integration anywhere and on any device. The methods have potential to offer
deeper insights into complex disease mechanisms.
Availability and Implementation: mvlearnR is available from the following
GitHub repository: https://github.com/lasandrall/mvlearnR. The web application
is hosted on shinyapps.io and available at:
https://multi-viewlearn.shinyapps.io/MultiView_Modeling
Interpretable Deep Learning Methods for Multiview Learning
Technological advances have enabled the generation of unique and
complementary types of data or views (e.g. genomics, proteomics, metabolomics)
and opened up a new era in multiview learning research with the potential to
lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable
Deep Learning Method for Multiview Learning) for learning nonlinear
relationships in data from multiple views while achieving feature selection.
iDeepViewLearn combines deep learning flexibility with the statistical benefits
of data and knowledge-driven feature selection, giving interpretable results.
Deep neural networks are used to learn view-independent low-dimensional
embedding through an optimization problem that minimizes the difference between
observed and reconstructed data, while imposing a regularization penalty on the
reconstructed data. The normalized Laplacian of a graph is used to model
bilateral relationships between variables in each view, therefore, encouraging
selection of related variables. iDeepViewLearn is tested on simulated and two
real-world data, including breast cancer-related gene expression and
methylation data. iDeepViewLearn had competitive classification results and
identified genes and CpG sites that differentiated between individuals who died
from breast cancer and those who did not. The results of our real data
application and simulations with small to moderate sample sizes suggest that
iDeepViewLearn may be a useful method for small-sample-size problems compared
to other deep learning methods for multiview learning
Derivation of a Protein Risk Score for Cardiovascular Disease Among a Multiracial and Multiethnic HIV+ Cohort
Background Cardiovascular disease risk prediction models underestimate CVD risk in people living with HIV (PLWH). Our goal is to derive a risk score based on protein biomarkers that could be used to predict CVD in PLWH. Methods and Results In a matched case-control study, we analyzed normalized protein expression data for participants enrolled in 1 of 4 trials conducted by INSIGHT (International Network for Strategic Initiatives in Global HIV Trials). We used dimension reduction, variable selection and resampling methods, and multivariable conditional logistic regression models to determine candidate protein biomarkers and to generate a protein score for predicting CVD in PLWH. We internally validated our findings using bootstrap. A protein score that was derived from 8 proteins (including HGF [hepatocyte growth factor] and interleukin-6) was found to be associated with an increased risk of CVD after adjustment for CVD and HIV factors (odds ratio: 2.17 [95% CI: 1.58-2.99]). The protein score improved CVD prediction when compared with predicting CVD risk using the individual proteins that comprised the protein score. Individuals with a protein score above the median score were 3.10 (95% CI, 1.83-5.41) times more likely to develop CVD than those with a protein score below the median score. Conclusions A panel of blood biomarkers may help identify PLWH at a high risk for developing CVD. If validated, such a score could be used in conjunction with established factors to identify CVD at-risk individuals who might benefit from aggressive risk reduction, ultimately shedding light on CVD pathogenesis in PLWH
AI is a viable alternative to high throughput screening: a 318-target study
: High throughput screening (HTS) is routinely used to identify bioactive small molecules. This requires physical compounds, which limits coverage of accessible chemical space. Computational approaches combined with vast on-demand chemical libraries can access far greater chemical space, provided that the predictive accuracy is sufficient to identify useful molecules. Through the largest and most diverse virtual HTS campaign reported to date, comprising 318 individual projects, we demonstrate that our AtomNet® convolutional neural network successfully finds novel hits across every major therapeutic area and protein class. We address historical limitations of computational screening by demonstrating success for target proteins without known binders, high-quality X-ray crystal structures, or manual cherry-picking of compounds. We show that the molecules selected by the AtomNet® model are novel drug-like scaffolds rather than minor modifications to known bioactive compounds. Our empirical results suggest that computational methods can substantially replace HTS as the first step of small-molecule drug discovery
Recommended from our members
Global burden of 288 causes of death and life expectancy decomposition in 204 countries and territories and 811 subnational locations, 1990–2021: a systematic analysis for the Global Burden of Disease Study 2021
BACKGROUND Regular, detailed reporting on population health by underlying cause of death is fundamental for public health decision making. Cause-specific estimates of mortality and the subsequent effects on life expectancy worldwide are valuable metrics to gauge progress in reducing mortality rates. These estimates are particularly important following large-scale mortality spikes, such as the COVID-19 pandemic. When systematically analysed, mortality rates and life expectancy allow comparisons of the consequences of causes of death globally and over time, providing a nuanced understanding of the effect of these causes on global populations. METHODS The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 cause-of-death analysis estimated mortality and years of life lost (YLLs) from 288 causes of death by age-sex-location-year in 204 countries and territories and 811 subnational locations for each year from 1990 until 2021. The analysis used 56 604 data sources, including data from vital registration and verbal autopsy as well as surveys, censuses, surveillance systems, and cancer registries, among others. As with previous GBD rounds, cause-specific death rates for most causes were estimated using the Cause of Death Ensemble model-a modelling tool developed for GBD to assess the out-of-sample predictive validity of different statistical models and covariate permutations and combine those results to produce cause-specific mortality estimates-with alternative strategies adapted to model causes with insufficient data, substantial changes in reporting over the study period, or unusual epidemiology. YLLs were computed as the product of the number of deaths for each cause-age-sex-location-year and the standard life expectancy at each age. As part of the modelling process, uncertainty intervals (UIs) were generated using the 2·5th and 97·5th percentiles from a 1000-draw distribution for each metric. We decomposed life expectancy by cause of death, location, and year to show cause-specific effects on life expectancy from 1990 to 2021. We also used the coefficient of variation and the fraction of population affected by 90% of deaths to highlight concentrations of mortality. Findings are reported in counts and age-standardised rates. Methodological improvements for cause-of-death estimates in GBD 2021 include the expansion of under-5-years age group to include four new age groups, enhanced methods to account for stochastic variation of sparse data, and the inclusion of COVID-19 and other pandemic-related mortality-which includes excess mortality associated with the pandemic, excluding COVID-19, lower respiratory infections, measles, malaria, and pertussis. For this analysis, 199 new country-years of vital registration cause-of-death data, 5 country-years of surveillance data, 21 country-years of verbal autopsy data, and 94 country-years of other data types were added to those used in previous GBD rounds. FINDINGS The leading causes of age-standardised deaths globally were the same in 2019 as they were in 1990; in descending order, these were, ischaemic heart disease, stroke, chronic obstructive pulmonary disease, and lower respiratory infections. In 2021, however, COVID-19 replaced stroke as the second-leading age-standardised cause of death, with 94·0 deaths (95% UI 89·2-100·0) per 100 000 population. The COVID-19 pandemic shifted the rankings of the leading five causes, lowering stroke to the third-leading and chronic obstructive pulmonary disease to the fourth-leading position. In 2021, the highest age-standardised death rates from COVID-19 occurred in sub-Saharan Africa (271·0 deaths [250·1-290·7] per 100 000 population) and Latin America and the Caribbean (195·4 deaths [182·1-211·4] per 100 000 population). The lowest age-standardised death rates from COVID-19 were in the high-income super-region (48·1 deaths [47·4-48·8] per 100 000 population) and southeast Asia, east Asia, and Oceania (23·2 deaths [16·3-37·2] per 100 000 population). Globally, life expectancy steadily improved between 1990 and 2019 for 18 of the 22 investigated causes. Decomposition of global and regional life expectancy showed the positive effect that reductions in deaths from enteric infections, lower respiratory infections, stroke, and neonatal deaths, among others have contributed to improved survival over the study period. However, a net reduction of 1·6 years occurred in global life expectancy between 2019 and 2021, primarily due to increased death rates from COVID-19 and other pandemic-related mortality. Life expectancy was highly variable between super-regions over the study period, with southeast Asia, east Asia, and Oceania gaining 8·3 years (6·7-9·9) overall, while having the smallest reduction in life expectancy due to COVID-19 (0·4 years). The largest reduction in life expectancy due to COVID-19 occurred in Latin America and the Caribbean (3·6 years). Additionally, 53 of the 288 causes of death were highly concentrated in locations with less than 50% of the global population as of 2021, and these causes of death became progressively more concentrated since 1990, when only 44 causes showed this pattern. The concentration phenomenon is discussed heuristically with respect to enteric and lower respiratory infections, malaria, HIV/AIDS, neonatal disorders, tuberculosis, and measles. INTERPRETATION Long-standing gains in life expectancy and reductions in many of the leading causes of death have been disrupted by the COVID-19 pandemic, the adverse effects of which were spread unevenly among populations. Despite the pandemic, there has been continued progress in combatting several notable causes of death, leading to improved global life expectancy over the study period. Each of the seven GBD super-regions showed an overall improvement from 1990 and 2021, obscuring the negative effect in the years of the pandemic. Additionally, our findings regarding regional variation in causes of death driving increases in life expectancy hold clear policy utility. Analyses of shifting mortality trends reveal that several causes, once widespread globally, are now increasingly concentrated geographically. These changes in mortality concentration, alongside further investigation of changing risks, interventions, and relevant policy, present an important opportunity to deepen our understanding of mortality-reduction strategies. Examining patterns in mortality concentration might reveal areas where successful public health interventions have been implemented. Translating these successes to locations where certain causes of death remain entrenched can inform policies that work to improve life expectancy for people everywhere. FUNDING Bill & Melinda Gates Foundation
Incorporating biological information in sparse principal component analysis with application to genomic data
Abstract Background Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In this article, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection. Results Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma. Conclusions The proposed sparse PCA methods Fused and Grouped sparse PCA can effectively incorporate prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insights on molecular underpinnings of complex diseases