8,661 research outputs found
From Cellular Characteristics to Disease Diagnosis: Uncovering Phenotypes with Supercells
Cell heterogeneity and the inherent complexity due to the interplay of multiple molecular processes within the cell pose difficult challenges for current single-cell biology. We introduce an approach that identifies a disease phenotype from multiparameter single-cell measurements, which is based on the concept of ‘‘supercell statistics’’, a single-cell-based averaging procedure followed by a machine learning classification scheme. We are able to assess the optimal tradeoff between the number of single cells averaged and the number of measurements needed to capture phenotypic differences between healthy and diseased patients, as well as between different diseases that are difficult to diagnose otherwise. We apply our approach to two kinds of single-cell datasets, addressing the diagnosis of a premature aging disorder using images of cell nuclei, as well as the phenotypes of two non-infectious uveitides (the ocular manifestations of Behc¸et’s disease and sarcoidosis) based on multicolor flow cytometry. In the former case, one nuclear shape measurement taken over a group of 30 cells is sufficient to classify samples as healthy or diseased, in agreement with usual laboratory practice. In the latter, our method is able to identify a minimal set of 5 markers that accurately predict Behc¸et’s disease and sarcoidosis. This is the first time that a quantitative phenotypic distinction between these two diseases has been achieved. To obtain this clear phenotypic signature, about one hundred CD8+ T cells need to be measured. Although the molecular markers identified have been reported to be important players in autoimmune disorders, this is the first report pointing out that CD8+ T cells can be used to distinguish two systemic inflammatory diseases. Beyond these specific cases, the approach proposed here is applicable to datasets generated by other kinds of state-of-the-art and forthcoming single-cell technologies, such as multidimensional mass cytometry, single-cell gene expression, and single-cell full genome sequencing techniques.Fil: Candia, Julian Marcelo. University of Maryland; Estados Unidos. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - La Plata. Instituto de FÃsica de LÃquidos y Sistemas Biológicos. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de FÃsica de LÃquidos y Sistemas Biológicos; ArgentinaFil: Maunu, Ryan. University of Maryland; Estados UnidosFil: Driscoll, Meghan. University of Maryland; Estados UnidosFil: Biancotto, Angélique. National Institutes of Health; Estados UnidosFil: Dagur, Pradeep. National Institutes of Health; Estados UnidosFil: McCoy Jr., J Philip. National Institutes of Health; Estados UnidosFil: Nida Sen, H.. National Institutes of Health; Estados UnidosFil: Wei, Lai. National Institutes of Health; Estados UnidosFil: Maritan, Amos. Università di Padova; ItaliaFil: Cao, Kan. University of Maryland; Estados UnidosFil: Nussenblatt, Robert B. National Institutes of Health; Estados UnidosFil: Banavar, Jayanth R.. University of Maryland; Estados UnidosFil: Losert, Wolfgang. University of Maryland; Estados Unido
Understanding Health and Disease with Multidimensional Single-Cell Methods
Current efforts in the biomedical sciences and related interdisciplinary
fields are focused on gaining a molecular understanding of health and disease,
which is a problem of daunting complexity that spans many orders of magnitude
in characteristic length scales, from small molecules that regulate cell
function to cell ensembles that form tissues and organs working together as an
organism. In order to uncover the molecular nature of the emergent properties
of a cell, it is essential to measure multiple cell components simultaneously
in the same cell. In turn, cell heterogeneity requires multiple cells to be
measured in order to understand health and disease in the organism. This review
summarizes current efforts towards a data-driven framework that leverages
single-cell technologies to build robust signatures of healthy and diseased
phenotypes. While some approaches focus on multicolor flow cytometry data and
other methods are designed to analyze high-content image-based screens, we
emphasize the so-called Supercell/SVM paradigm (recently developed by the
authors of this review and collaborators) as a unified framework that captures
mesoscopic-scale emergence to build reliable phenotypes. Beyond their specific
contributions to basic and translational biomedical research, these efforts
illustrate, from a larger perspective, the powerful synergy that might be
achieved from bringing together methods and ideas from statistical physics,
data mining, and mathematics to solve the most pressing problems currently
facing the life sciences.Comment: 25 pages, 7 figures; revised version with minor changes. To appear in
J. Phys.: Cond. Mat
Special issue on bio-ontologies and phenotypes
The bio-ontologies and phenotypes special issue includes eight papers selected from the 11 papers presented at the Bio-Ontologies SIG (Special Interest Group) and the Phenotype Day at ISMB (Intelligent Systems for Molecular Biology) conference in Boston in 2014. The selected papers span a wide range of topics including the automated re-use and update of ontologies, quality assessment of ontological resources, and the systematic description of phenotype variation, driven by manual, semi- and fully automatic means
Understanding Learned Models by Identifying Important Features at the Right Resolution
In many application domains, it is important to characterize how complex
learned models make their decisions across the distribution of instances. One
way to do this is to identify the features and interactions among them that
contribute to a model's predictive accuracy. We present a model-agnostic
approach to this task that makes the following specific contributions. Our
approach (i) tests feature groups, in addition to base features, and tries to
determine the level of resolution at which important features can be
determined, (ii) uses hypothesis testing to rigorously assess the effect of
each feature on the model's loss, (iii) employs a hierarchical approach to
control the false discovery rate when testing feature groups and individual
base features for importance, and (iv) uses hypothesis testing to identify
important interactions among features and feature groups. We evaluate our
approach by analyzing random forest and LSTM neural network models learned in
two challenging biomedical applications.Comment: First two authors contributed equally to this work, Accepted for
presentation at the Thirty-Third AAAI Conference on Artificial Intelligence
(AAAI-19
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Recommended from our members
Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method.
Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual's susceptibility to cancer with a measure of probability. Of the triad of cancer-causing factors (inherited genomic susceptibility, environmental factors, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole-genome variation data. However, genome-wide association studies have so far showed limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals' inherited genomic susceptibility to acquire the most likely phenotype among a panel of 20 major common cancer types plus 1 "healthy" type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the phenotypes of 5,919 individuals of "white" ethnic population in this study, (i) the portion of the cohort of a cancer type who acquired the observed type due to mostly inherited genomic susceptibility factors ranges from about 33 to 88% (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%), and (ii) on an individual level, the method also predicts individuals' inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer
Selection of important variables by statistical learning in genome-wide association analysis
Genetic analysis of complex diseases demands novel analytical methods to interpret data collected on thousands of variables by genome-wide association studies. The complexity of such analysis is multiplied when one has to consider interaction effects, be they among the genetic variations (G × G) or with environment risk factors (G × E). Several statistical learning methods seem quite promising in this context. Herein we consider applications of two such methods, random forest and Bayesian networks, to the simulated dataset for Genetic Analysis Workshop 16 Problem 3. Our evaluation study showed that an iterative search based on the random forest approach has the potential in selecting important variables, while Bayesian networks can capture some of the underlying causal relationships
Deep learning for health outcome prediction
Modern medical data contains rich information that allows us to make new types of inferences to predict health outcomes. However, the complexity of modern medical data has rendered many classical analysis approaches insufficient.
Machine learning with deep neural networks enables computational models to process raw data and learn useful representations with multiple levels of abstraction.
In this thesis, I present novel deep learning methods for health outcome prediction from brain MRI and genomic data.
I show that a deep neural network can learn a biomarker from structural brain MRI and that this biomarker provides a useful measure for investigating brain and systemic health, can augment neuroradiological research and potentially serve as a decision-support tool in clinical environments. I also develop two tensor methods for deep neural networks: the first, tensor dropout, for improving the robustness of deep neural networks, and the second, Kronecker machines, for combining multiple sources of data to improve prediction accuracy. Finally, I present a novel deep learning method for predicting polygenic risk scores from genome sequences by leveraging both local and global interactions between genetic variants.
These contributions demonstrate the benefits of using deep learning for health outcome prediction in both research and clinical settings.Open Acces
The context-dependence of mutations: a linkage of formalisms
Defining the extent of epistasis - the non-independence of the effects of
mutations - is essential for understanding the relationship of genotype,
phenotype, and fitness in biological systems. The applications cover many areas
of biological research, including biochemistry, genomics, protein and systems
engineering, medicine, and evolutionary biology. However, the quantitative
definitions of epistasis vary among fields, and its analysis beyond just
pairwise effects remains obscure in general. Here, we show that different
definitions of epistasis are versions of a single mathematical formalism - the
weighted Walsh-Hadamard transform. We discuss that one of the definitions, the
backgound-averaged epistasis, is the most informative when the goal is to
uncover the general epistatic structure of a biological system, a description
that can be rather different from the local epistatic structure of specific
model systems. Key issues are the choice of effective ensembles for averaging
and to practically contend with the vast combinatorial complexity of mutations.
In this regard, we discuss possible approaches for optimally learning the
epistatic structure of biological systems.Comment: 6 pages, 3 figures, supplementary informatio
- …