Search CORE

15,836 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Recommended from our members

Statistical methods for the study of etiologic heterogeneity

Author: Zabor Emily Craig
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Traditionally, cancer epidemiologists have investigated the causes of disease under the premise that patients with a certain site of disease can be treated as a single entity. Then risk factors associated with the disease are identified through case-control or cohort studies for the disease as a whole. However, with the rise of molecular and genomic profiling, in recent years biologic subtypes have increasingly been identified. Once subtypes are known, it is natural to ask the question of whether they share a common etiology, or in fact arise from distinct sets of risk factors, a concept known as etiologic heterogeneity. This dissertation seeks to evaluate methods for the study of etiologic heterogeneity in the context of cancer research and with a focus on methods for case-control studies. First, a number of existing regression-based methods for the study of etiologic heterogeneity in the context of pre-defined subtypes are compared using a data example and simulation studies. This work found that a standard polytomous logistic regression approach performs at least as well as more complex methods, and is easy to implement in standard software. Next, simulation studies investigate the statistical properties of an approach that combines the search for the most etiologically distinct subtype solution from high dimensional tumor marker data with estimation of risk factor effects. The method performs well when appropriate up-front selection of tumor markers is performed, even when there is confounding structure or high-dimensional noise. And finally, an application to a breast cancer case-control study demonstrates the usefulness of the novel clustering approach to identify a more risk heterogeneous class solution in breast cancer based on a panel of gene expression data and known risk factors

Columbia University Academic Commons

A Distance-Based Test of Association Between Paired Heterogeneous Genomic Data

Author: Curry Edward
Minas Christopher
Montana Giovanni
Publication venue
Publication date: 27/03/2013
Field of study

Due to rapid technological advances, a wide range of different measurements can be obtained from a given biological sample including single nucleotide polymorphisms, copy number variation, gene expression levels, DNA methylation and proteomic profiles. Each of these distinct measurements provides the means to characterize a certain aspect of biological diversity, and a fundamental problem of broad interest concerns the discovery of shared patterns of variation across different data types. Such data types are heterogeneous in the sense that they represent measurements taken at very different scales or described by very different data structures. We propose a distance-based statistical test, the generalized RV (GRV) test, to assess whether there is a common and non-random pattern of variability between paired biological measurements obtained from the same random sample. The measurements enter the test through distance measures which can be chosen to capture particular aspects of the data. An approximate null distribution is proposed to compute p-values in closed-form and without the need to perform costly Monte Carlo permutation procedures. Compared to the classical Mantel test for association between distance matrices, the GRV test has been found to be more powerful in a number of simulation settings. We also report on an application of the GRV test to detect biological pathways in which genetic variability is associated to variation in gene expression levels in ovarian cancer samples, and present results obtained from two independent cohorts

arXiv.org e-Print Archive

Crossref

King's Research Portal

A statistical framework for joint eQTL analysis in multiple tissues

Author: Flutre Timothée
Pritchard Jonathan
Stephens Matthew
Wen Xiaoquan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 19/12/2012
Field of study

Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely-adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues this framework increases power to detect eQTLs that are present in more than one tissue compared with "tissue-by-tissue" analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells and fibroblasts we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR=0.05). Further the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.Comment: Summitted to PLoS Genetic

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

ProdInra

FigShare