3,460 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Recommended from our members
Methods for Using Biomarker Information in Randomized Clinical Trials
Advances in high-throughput biological technologies have led to large numbers of potentially predictive biomarkers becoming routinely measured in modern clinical trials. Biomarkers which influence treatment efficacy may be used to find subgroups of patients who are most likely to benefit from a new treatment. Consequently, there is a growing interest in better approaches to identify biomarker signatures and utilize the biomarker information in clinical trials.
The first focus of this thesis is on developing methods for detecting biomarker-treatment interactions in large-scale trials. Traditional interaction analysis, using regression models to test biomarker-treatment interactions one biomarker at a time, may suffer from poor power when there is a large multiple testing burden. I adapt recently proposed two-stage interaction detecting procedures for application in randomized clinical trials. I propose two new stage 1 multivariate screening strategies using lasso and ridge regressions to account for correlations among biomarkers. For these new multivariate screening strategies, I prove the asymptotic between-stage independence, required for family-wise error rate control. Simulation and real data application results are presented which demonstrate greater power of the new strategies compared with previously existing approaches.
The second focus of this thesis is on developing methods for utilizing biomarker information during the course of a randomized clinical trial to improve the informativeness of results. Under the adaptive signature design (ASD) framework, I propose two new classifiers that more efficiently leverage biomarker signatures to select a subgroup of patients who are most likely to benefit from the new treatment. I provide analytical arguments and demonstrate through simulations that these two proposed classification criteria can provide at least as good, and sometimes significantly greater power than the originally proposed ASD classifier.
Third, I focus on an important issue in the statistical analysis of interactions for binary outcomes, which is pertinent to both topics above. Testing for biomarker-treatment interactions with logistic regression can suffer from an elevated number of type I errors due to the asymptotic bias of the interaction regression coefficient under model misspecification. I analyze this problem in the randomized clinical trial setting and propose two new de-biasing procedures, which can offer improved family-wise error rate control in various simulated scenarios.
Finally, I summarize the main contributions from the work above, discuss some practical limitations as well as their real world value, and prioritize future directions of research building upon the work in this thesis.Medical Research Council, grant ID: MR/R502303/
Optimization Algorithms for Computational Systems Biology
Computational systems biology aims at integrating biology and computational methods to gain a better understating of biological phenomena. It often requires the assistance of global optimization to adequately tune its tools. This review presents three powerful methodologies for global optimization that fit the requirements of most of the computational systems biology applications, such as model tuning and biomarker identification. We include the multi-start approach for least squares methods, mostly applied for fitting experimental data. We illustrate Markov Chain Monte Carlo methods, which are stochastic techniques here applied for fitting experimental data when a model involves stochastic equations or simulations. Finally, we present Genetic Algorithms, heuristic nature-inspired methods that are applied in a broad range of optimization applications, including the ones in systems biology
Graph-Theoretical Tools for the Analysis of Complex Networks
We are currently experiencing an explosive growth in data collection technology that threatens to dwarf the commensurate gains in computational power predicted by Moore’s Law. At the same time, researchers across numerous domain sciences are finding success using network models to represent their data. Graph algorithms are then applied to study the topological structure and tease out latent relationships between variables. Unfortunately, the problems of interest, such as finding dense subgraphs, are often the most difficult to solve from a computational point of view. Together, these issues motivate the need for novel algorithmic techniques in the study of graphs derived from large, complex, data sources. This dissertation describes the development and application of graph theoretic tools for the study of complex networks. Algorithms are presented that leverage efficient, exact solutions to difficult combinatorial problems for epigenetic biomarker detection and disease subtyping based on gene expression signatures. Extensive testing on publicly available data is presented supporting the efficacy of these approaches. To address efficient algorithm design, a study of the two core tenets of fixed parameter tractability (branching and kernelization) is considered in the context of a parallel implementation of vertex cover. Results of testing on a wide variety of graphs derived from both real and synthetic data are presented. It is shown that the relative success of kernelization versus branching is found to be largely dependent on the degree distribution of the graph. Throughout, an emphasis is placed upon the practicality of resulting implementations to advance the limits of effective computation
Potential Alzheimer\u27s Disease Plasma Biomarkers
In this series of studies, we examined the potential of a variety of blood-based plasma biomarkers for the identification of Alzheimer\u27s disease (AD) progression and cognitive decline. With the end goal of studying these biomarkers via mixture modeling, we began with a literature review of the methodology. An examination of the biomarkers with demographics and other health factors found evidence of minimal risk of confounding along the causal pathway from biomarkers to cognitive performance. Further study examined the usefulness of linear combinations of biomarkers, achieved via partial least squares (PLS) analysis, as predictors of various cognitive assessment scores and clinical cognitive diagnosis. The identified biomarker linear combinations were not effective at predicting cognitive outcomes. The final study of our biomarkers utilized mixture modeling through the extension of group-based trajectory modeling (GBTM). We modeled five biomarkers, covering a range of functions within the body, to identify distinct trajectories over time. Final models showed statistically significant differences in baseline risk factors and cognitive assessments between developmental trajectories of the biomarker outcomes. This course of study has added valuable information to the field of plasma biomarker research in relation to Alzheimer’s disease and cognitive decline
Multi-modality machine learning predicting Parkinson's disease
Personalized medicine promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multimodal data is key moving forward. We build upon previous work to deliver multimodal predictions of Parkinson's disease (PD) risk and systematically develop a model using GenoML, an automated ML package, to make improved multi-omic predictions of PD, validated in an external cohort. We investigated top features, constructed hypothesis-free disease-relevant networks, and investigated drug-gene interactions. We performed automated ML on multimodal data from the Parkinson's progression marker initiative (PPMI). After selecting the best performing algorithm, all PPMI data was used to tune the selected model. The model was validated in the Parkinson's Disease Biomarker Program (PDBP) dataset. Our initial model showed an area under the curve (AUC) of 89.72% for the diagnosis of PD. The tuned model was then tested for validation on external data (PDBP, AUC 85.03%). Optimizing thresholds for classification increased the diagnosis prediction accuracy and other metrics. Finally, networks were built to identify gene communities specific to PD. Combining data modalities outperforms the single biomarker paradigm. UPSIT and PRS contributed most to the predictive power of the model, but the accuracy of these are supplemented by many smaller effect transcripts and risk SNPs. Our model is best suited to identifying large groups of individuals to monitor within a health registry or biobank to prioritize for further testing. This approach allows complex predictive models to be reproducible and accessible to the community, with the package, code, and results publicly available
- …