319 research outputs found
Recommended from our members
A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis
Background: Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Results: A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm. Conclusions: The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data
The Area Under the ROC Curve as a Measure of Clustering Quality
The Area Under the the Receiver Operating Characteristics (ROC) Curve,
referred to as AUC, is a well-known performance measure in the supervised
learning domain. Due to its compelling features, it has been employed in a
number of studies to evaluate and compare the performance of different
classifiers. In this work, we explore AUC as a performance measure in the
unsupervised learning domain, more specifically, in the context of cluster
analysis. In particular, we elaborate on the use of AUC as an internal/relative
measure of clustering quality, which we refer to as Area Under the Curve for
Clustering (AUCC). We show that the AUCC of a given candidate clustering
solution has an expected value under a null model of random clustering
solutions, regardless of the size of the dataset and, more importantly,
regardless of the number or the (im)balance of clusters under evaluation. In
addition, we elaborate on the fact that, in the context of internal/relative
clustering validation as we consider, AUCC is actually a linear transformation
of the Gamma criterion from Baker and Hubert (1975), for which we also formally
derive a theoretical expected value for chance clusterings. We also discuss the
computational complexity of these criteria and show that, while an ordinary
implementation of Gamma can be computationally prohibitive and impractical for
most real applications of cluster analysis, its equivalence with AUCC actually
unveils a much more efficient algorithmic procedure. Our theoretical findings
are supported by experimental results. These results show that, in addition to
an effective and robust quantitative evaluation provided by AUCC, visual
inspection of the ROC curves themselves can be useful to further assess a
candidate clustering solution from a broader, qualitative perspective as well.Comment: 37 pages, 5 figures, submitted for publicatio
Classification of tissues and disease subtypes using whole-genome signatures
Development and application of microarray technology in biological research has led to compilation of expression and sequence data on a genome-wide scale. Given the volume of data produced and the complexity of gene regulatory mechanisms, it can be difficult to extract meaningful biological information. Classification can be used to reduce the complexity through the detection of genes, genetic loci or conditions that share common attributes and the identification of gene expression patterns or genotypes associated with phenotype. In the study of cancer, supervised classification has been applied to identify gene expression biomarkers of different disease states. Clinically validated biomarkers are valuable indicators for diagnosis and guiding therapeutic strategy. We developed an iterative machine learning algorithm to compare the predictive value of biomarker sets chosen by supervised classification against sets selected randomly from known disease-related genes. Both supervised classification and feature selection based on prior knowledge resulted in discriminative classification of molecular phenotypes in breast cancer and lymphoma. Compilation of gene expression data has led to the identification of genes with bimodal, or switch-like, expression patterns. We used unsupervised, supervised and model-based classification methods to investigate the biological relevance of bimodal expression patterns and to evaluate their potential for class discovery and prediction. Both model-based and supervised classification resulted in the accurate classification of samples by tissue phenotype or infectious disease. Functional enrichment analysis indicates switch-like genes are involved in tissue-specific or immune response functions. Taken together, this evidence supports the assertion that bimodal expression patterns are biologically relevant. Clinical relevance of bimodal expression patterns was investigated in an association study of genotypes of families affected by autism. A subset of neural-specific switch-like genes was used to identify candidate gene regions which may contain genetic variants associated with autism risk. A two-stage family-based association test detected an autism susceptibility locus in the q26 region of chromosome 10. The coding region of the fibroblast growth factor receptor 2 (FGFR2) gene is 80 kilobases downstream from the identified locus. Altered expression of FGFR2 may be a contributing genetic factor in development of autism. Identification of the susceptibility locus provides motivation for novel hypotheses concerning the molecular basis of autism. In addition, we provide a method for integration of gene expression and genotype data that may lead to the identification of disease-related polymorphisms in other disorders.Ph.D., Biomedical Engineering -- Drexel University, 200
Computational methods for breast cancer diagnosis, prognosis, and treatment prediction
The research presented here develops a robust reliability algorithm for the identification of reliable protein interactions that can be incorporated with a gene expression dataset to improve the algorithm performance, and novel breast cancer based diagnostic, prognostic and treatment prediction algorithms, respectively, which take into account the existing issues in order to provide a fair estimation of their performance
Graphical Models for Multivariate Time-Series
Gaussian graphical models have received much attention in the last years, due
to their flexibility and expression power. In particular, lots of interests have
been devoted to graphical models for temporal data, or dynamical graphical
models, to understand the relation of variables evolving in time. While powerful
in modelling complex systems, such models suffer from computational
issues both in terms of convergence rates and memory requirements, and may
fail to detect temporal patterns in case the information on the system is partial.
This thesis comprises two main contributions in the context of dynamical
graphical models, tackling these two aspects: the need of reliable and fast
optimisation methods and an increasing modelling power, which are able to
retrieve the model in practical applications. The first contribution consists in a
forward-backward splitting (FBS) procedure for Gaussian graphical modelling
of multivariate time-series which relies on recent theoretical studies ensuring
global convergence under mild assumptions. Indeed, such FBS-based implementation
achieves, with fast convergence rates, optimal results with respect
to ground truth and standard methods for dynamical network inference. The
second main contribution focuses on the problem of latent factors, that influence
the system while hidden or unobservable. This thesis proposes the novel
latent variable time-varying graphical lasso method, which is able to take into
account both temporal dynamics in the data and latent factors influencing
the system. This is fundamental for the practical use of graphical models,
where the information on the data is partial. Indeed, extensive validation of
the method on both synthetic and real applications shows the effectiveness of
considering latent factors to deal with incomplete information
Methodological contributions to the challenges and opportunities of high dimensional clustering in the context of single-cell data
With the sequencing of single cells it is possible to measure gene expression of each single-cell in contrast to bulk sequencing which enables only average gene expression. This procedure provides access to read counts for each single cell and allows the development of methods such that single cells are automatically allocated to cell types. The determination of cell types is decisive for the analysis of diseases and to understand human health based on the genetic profile of single cells. It is of common use that cell types are allocated using clustering procedures that have been developed explicitly for single-cell data. For that purpose the single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483-486, 2017) is part of the leading clustering methods in this context and is also of relevance for the following contributions.
This PhD thesis aims at the development of appropriate analysis techniques for the clustering of high-dimensional single-cell data and their reliable validation. It also provides a simulation framework for the investigation of the influence of distorted measurements of single cells towards clustering performance. We further incorporate cluster indices as informative weights into the regularized regression, which allows a soft filtering of variables
Recommended from our members
Computational meta'omics for microbial community studies
Complex microbial communities are an integral part of the Earth's ecosystem and of our bodies in health and disease. In the last two decades, culture-independent approaches have provided new insights into their structure and function, with the exponentially decreasing cost of high-throughput sequencing resulting in broadly available tools for microbial surveys. However, the field remains far from reaching a technological plateau, as both computational techniques and nucleotide sequencing platforms for microbial genomic and transcriptional content continue to improve. Current microbiome analyses are thus starting to adopt multiple and complementary meta'omic approaches, leading to unprecedented opportunities to comprehensively and accurately characterize microbial communities and their interactions with their environments and hosts. This diversity of available assays, analysis methods, and public data is in turn beginning to enable microbiome-based predictive and modeling tools. We thus review here the technological and computational meta'omics approaches that are already available, those that are under active development, their success in biological discovery, and several outstanding challenges
Multi-scale molecular descriptions of human heart failure using single cell, spatial, and bulk transcriptomics
Molecular descriptions of human disease have relied on transcriptomics, the genome-wide measurement of gene expression. In the last years the emergence of capture-based technologies have enabled the transcriptomic profiling of single cells both from dissociated and intact tissues, providing a spatial and cell type specific context that complements the catalog of gene expression changes reported from bulk technologies. In the context of cardiovascular disease, these technologies open the opportunity to study the inter and intra-cellular mechanisms that regulate myocardial remodeling. In this thesis I present comprehensive descriptions of the transcriptional changes in acute and chronic human heart failure using bulk, single cell, and spatial technologies. First, I describe the creation of the Reference of the Heart Failure Transcriptome, a resource built from the meta-analysis of 16 independent studies of human heart failure transcriptomics. Then, I report the first spatial and single cell atlas of human myocardial infarction, and propose a computational strategy to identify compositional, organizational, and molecular tissue differences across distinct time points and physiological zones of damaged myocardium. Finally, I outline a methodology for the multicellular analysis of single cell data that allows for a better understanding of tissue responses and cell type coordination events in cardiovascular disease and that links the knowledge of independent studies at multiple scales. Overall my work demonstrates the importance of the generation of reliable molecular references of disease across scales
Role of network topology based methods in discovering novel gene-phenotype associations
The cell is governed by the complex interactions among various types of biomolecules. Coupled with environmental factors, variations in DNA can cause alterations in normal gene function and lead to a disease condition. Often, such disease phenotypes involve coordinated dysregulation of multiple genes that implicate inter-connected pathways. Towards a better understanding and characterization of mechanisms underlying human diseases, here, I present GUILD, a network-based disease-gene prioritization framework. GUILD associates genes with diseases using the global topology of the protein-protein interaction network and an initial set of genes known to be implicated in the disease. Furthermore, I investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. I also introduce GUILDify, an online and user-friendly tool which prioritizes genes for their association to any user-provided phenotype. Finally, I describe current state-of-the-art systems-biology approaches where network modeling has helped extending our view on diseases such as cancer.La cèl•lula es regeix per interaccions complexes entre diferents tipus de biomolècules. Juntament amb factors ambientals, variacions en el DNA poden causar alteracions en la funciĂł normal dels gens i provocar malalties. Sovint, aquests fenotips de malaltia involucren una desregulaciĂł coordinada de mĂşltiples gens implicats en vies interconnectades. Per tal de comprendre i caracteritzar millor els mecanismes subjacents en malalties humanes, en aquesta tesis presento el programa GUILD, una plataforma que prioritza gens relacionats amb una malaltia en concret fent us de la topologia de xarxe. A partir d’un conjunt conegut de gens implicats en una malaltia, GUILD associa altres gens amb la malaltia mitjancant la topologia global de la xarxa d’interaccions de proteĂŻnes. A mĂ©s a mĂ©s, analitzo les relacions mecanĂstiques entre gens associats a malalties i explico la robustesa es desprèn d’aquesta anĂ lisi. TambĂ© presento GUILDify, un servidor web de fácil Ăşs per la prioritzaciĂł de gens i la seva associaciĂł a un determinat fenotip. Finalment, descric els mètodes mĂ©s recents en què el model•latge de xarxes ha ajudat extendre el coneixement sobre malalties complexes, com per exemple a cĂ ncer
- …