125,453 research outputs found
MULTI-DIMENSIONAL ANALYSIS APPROACHES FOR HETEROGENEOUS SINGLE-CELL DATA
Improvements in experimental techniques have led to an explosion of information in biology research. The increasing number of measurements comes with challenges in analyzing resulting data, as well as opportunities to obtain deeper insights of biological systems. Conventional average based methods are unfit to analyze high dimensional datasets since they fail to take full advantage of such rich information. More importantly, they are not able to capture the heterogeneity that is prevalent in biological systems. Sophisticated algorithms that are able to utilize all available measurements simultaneously are hence emerging rapidly. These algorithms excel at making full use of information within datasets and revealing detailed heterogeneity.
However, there are several important disadvantages of existing algorithms. First, specific knowledge in statistics or machine learning is required to appropriately interpret and tune parameters in these algorithms for future use. This may result in misusage and misinterpretation. Second, using all measurements with equal weighting runs the risk of noise contamination. In addition, information overload has become more common in biology research, with a large volume of irrelevant measurements. Third, regardless of the quality of measurements, analysis methods that simultaneously use a large number of measurements need to avoid the “curse of dimensionality”, which warns that distance estimation and nearest neighbor estimation are not meaningful in high dimensional space. However, most current sophisticated algorithms involve distance estimation and/or nearest neighbor estimation.
In this dissertation, my goal is to build analysis methods that are complex enough to capture heterogeneity and at the same time output results in a format that is easy to interpret and familiar to biologists and medical researchers. I tackle the dimension reduction problem by finding not the best subspace but dividing them into multiple subspaces and examine them one by one. I demonstrate my methods with three types of datasets: image-based high-throughput screening data, flow cytometry data, and mass cytometry data. From each dataset, I was able to discover new biological insights as well as re-validate well-established findings with my methods
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Data-driven modelling of biological multi-scale processes
Biological processes involve a variety of spatial and temporal scales. A
holistic understanding of many biological processes therefore requires
multi-scale models which capture the relevant properties on all these scales.
In this manuscript we review mathematical modelling approaches used to describe
the individual spatial scales and how they are integrated into holistic models.
We discuss the relation between spatial and temporal scales and the implication
of that on multi-scale modelling. Based upon this overview over
state-of-the-art modelling approaches, we formulate key challenges in
mathematical and computational modelling of biological multi-scale and
multi-physics processes. In particular, we considered the availability of
analysis tools for multi-scale models and model-based multi-scale data
integration. We provide a compact review of methods for model-based data
integration and model-based hypothesis testing. Furthermore, novel approaches
and recent trends are discussed, including computation time reduction using
reduced order and surrogate models, which contribute to the solution of
inference problems. We conclude the manuscript by providing a few ideas for the
development of tailored multi-scale inference methods.Comment: This manuscript will appear in the Journal of Coupled Systems and
Multiscale Dynamics (American Scientific Publishers
Understanding Health and Disease with Multidimensional Single-Cell Methods
Current efforts in the biomedical sciences and related interdisciplinary
fields are focused on gaining a molecular understanding of health and disease,
which is a problem of daunting complexity that spans many orders of magnitude
in characteristic length scales, from small molecules that regulate cell
function to cell ensembles that form tissues and organs working together as an
organism. In order to uncover the molecular nature of the emergent properties
of a cell, it is essential to measure multiple cell components simultaneously
in the same cell. In turn, cell heterogeneity requires multiple cells to be
measured in order to understand health and disease in the organism. This review
summarizes current efforts towards a data-driven framework that leverages
single-cell technologies to build robust signatures of healthy and diseased
phenotypes. While some approaches focus on multicolor flow cytometry data and
other methods are designed to analyze high-content image-based screens, we
emphasize the so-called Supercell/SVM paradigm (recently developed by the
authors of this review and collaborators) as a unified framework that captures
mesoscopic-scale emergence to build reliable phenotypes. Beyond their specific
contributions to basic and translational biomedical research, these efforts
illustrate, from a larger perspective, the powerful synergy that might be
achieved from bringing together methods and ideas from statistical physics,
data mining, and mathematics to solve the most pressing problems currently
facing the life sciences.Comment: 25 pages, 7 figures; revised version with minor changes. To appear in
J. Phys.: Cond. Mat
- …