319 research outputs found

    The Area Under the ROC Curve as a Measure of Clustering Quality

    Full text link
    The Area Under the the Receiver Operating Characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.Comment: 37 pages, 5 figures, submitted for publicatio

    Classification of tissues and disease subtypes using whole-genome signatures

    Get PDF
    Development and application of microarray technology in biological research has led to compilation of expression and sequence data on a genome-wide scale. Given the volume of data produced and the complexity of gene regulatory mechanisms, it can be difficult to extract meaningful biological information. Classification can be used to reduce the complexity through the detection of genes, genetic loci or conditions that share common attributes and the identification of gene expression patterns or genotypes associated with phenotype. In the study of cancer, supervised classification has been applied to identify gene expression biomarkers of different disease states. Clinically validated biomarkers are valuable indicators for diagnosis and guiding therapeutic strategy. We developed an iterative machine learning algorithm to compare the predictive value of biomarker sets chosen by supervised classification against sets selected randomly from known disease-related genes. Both supervised classification and feature selection based on prior knowledge resulted in discriminative classification of molecular phenotypes in breast cancer and lymphoma. Compilation of gene expression data has led to the identification of genes with bimodal, or switch-like, expression patterns. We used unsupervised, supervised and model-based classification methods to investigate the biological relevance of bimodal expression patterns and to evaluate their potential for class discovery and prediction. Both model-based and supervised classification resulted in the accurate classification of samples by tissue phenotype or infectious disease. Functional enrichment analysis indicates switch-like genes are involved in tissue-specific or immune response functions. Taken together, this evidence supports the assertion that bimodal expression patterns are biologically relevant. Clinical relevance of bimodal expression patterns was investigated in an association study of genotypes of families affected by autism. A subset of neural-specific switch-like genes was used to identify candidate gene regions which may contain genetic variants associated with autism risk. A two-stage family-based association test detected an autism susceptibility locus in the q26 region of chromosome 10. The coding region of the fibroblast growth factor receptor 2 (FGFR2) gene is 80 kilobases downstream from the identified locus. Altered expression of FGFR2 may be a contributing genetic factor in development of autism. Identification of the susceptibility locus provides motivation for novel hypotheses concerning the molecular basis of autism. In addition, we provide a method for integration of gene expression and genotype data that may lead to the identification of disease-related polymorphisms in other disorders.Ph.D., Biomedical Engineering -- Drexel University, 200

    Computational methods for breast cancer diagnosis, prognosis, and treatment prediction

    Full text link
    The research presented here develops a robust reliability algorithm for the identification of reliable protein interactions that can be incorporated with a gene expression dataset to improve the algorithm performance, and novel breast cancer based diagnostic, prognostic and treatment prediction algorithms, respectively, which take into account the existing issues in order to provide a fair estimation of their performance

    Graphical Models for Multivariate Time-Series

    Get PDF
    Gaussian graphical models have received much attention in the last years, due to their flexibility and expression power. In particular, lots of interests have been devoted to graphical models for temporal data, or dynamical graphical models, to understand the relation of variables evolving in time. While powerful in modelling complex systems, such models suffer from computational issues both in terms of convergence rates and memory requirements, and may fail to detect temporal patterns in case the information on the system is partial. This thesis comprises two main contributions in the context of dynamical graphical models, tackling these two aspects: the need of reliable and fast optimisation methods and an increasing modelling power, which are able to retrieve the model in practical applications. The first contribution consists in a forward-backward splitting (FBS) procedure for Gaussian graphical modelling of multivariate time-series which relies on recent theoretical studies ensuring global convergence under mild assumptions. Indeed, such FBS-based implementation achieves, with fast convergence rates, optimal results with respect to ground truth and standard methods for dynamical network inference. The second main contribution focuses on the problem of latent factors, that influence the system while hidden or unobservable. This thesis proposes the novel latent variable time-varying graphical lasso method, which is able to take into account both temporal dynamics in the data and latent factors influencing the system. This is fundamental for the practical use of graphical models, where the information on the data is partial. Indeed, extensive validation of the method on both synthetic and real applications shows the effectiveness of considering latent factors to deal with incomplete information

    Methodological contributions to the challenges and opportunities of high dimensional clustering in the context of single-cell data

    Get PDF
    With the sequencing of single cells it is possible to measure gene expression of each single-cell in contrast to bulk sequencing which enables only average gene expression. This procedure provides access to read counts for each single cell and allows the development of methods such that single cells are automatically allocated to cell types. The determination of cell types is decisive for the analysis of diseases and to understand human health based on the genetic profile of single cells. It is of common use that cell types are allocated using clustering procedures that have been developed explicitly for single-cell data. For that purpose the single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483-486, 2017) is part of the leading clustering methods in this context and is also of relevance for the following contributions. This PhD thesis aims at the development of appropriate analysis techniques for the clustering of high-dimensional single-cell data and their reliable validation. It also provides a simulation framework for the investigation of the influence of distorted measurements of single cells towards clustering performance. We further incorporate cluster indices as informative weights into the regularized regression, which allows a soft filtering of variables

    Multi-scale molecular descriptions of human heart failure using single cell, spatial, and bulk transcriptomics

    Get PDF
    Molecular descriptions of human disease have relied on transcriptomics, the genome-wide measurement of gene expression. In the last years the emergence of capture-based technologies have enabled the transcriptomic profiling of single cells both from dissociated and intact tissues, providing a spatial and cell type specific context that complements the catalog of gene expression changes reported from bulk technologies. In the context of cardiovascular disease, these technologies open the opportunity to study the inter and intra-cellular mechanisms that regulate myocardial remodeling. In this thesis I present comprehensive descriptions of the transcriptional changes in acute and chronic human heart failure using bulk, single cell, and spatial technologies. First, I describe the creation of the Reference of the Heart Failure Transcriptome, a resource built from the meta-analysis of 16 independent studies of human heart failure transcriptomics. Then, I report the first spatial and single cell atlas of human myocardial infarction, and propose a computational strategy to identify compositional, organizational, and molecular tissue differences across distinct time points and physiological zones of damaged myocardium. Finally, I outline a methodology for the multicellular analysis of single cell data that allows for a better understanding of tissue responses and cell type coordination events in cardiovascular disease and that links the knowledge of independent studies at multiple scales. Overall my work demonstrates the importance of the generation of reliable molecular references of disease across scales

    Role of network topology based methods in discovering novel gene-phenotype associations

    Get PDF
    The cell is governed by the complex interactions among various types of biomolecules. Coupled with environmental factors, variations in DNA can cause alterations in normal gene function and lead to a disease condition. Often, such disease phenotypes involve coordinated dysregulation of multiple genes that implicate inter-connected pathways. Towards a better understanding and characterization of mechanisms underlying human diseases, here, I present GUILD, a network-based disease-gene prioritization framework. GUILD associates genes with diseases using the global topology of the protein-protein interaction network and an initial set of genes known to be implicated in the disease. Furthermore, I investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. I also introduce GUILDify, an online and user-friendly tool which prioritizes genes for their association to any user-provided phenotype. Finally, I describe current state-of-the-art systems-biology approaches where network modeling has helped extending our view on diseases such as cancer.La cèl•lula es regeix per interaccions complexes entre diferents tipus de biomolècules. Juntament amb factors ambientals, variacions en el DNA poden causar alteracions en la funció normal dels gens i provocar malalties. Sovint, aquests fenotips de malaltia involucren una desregulació coordinada de múltiples gens implicats en vies interconnectades. Per tal de comprendre i caracteritzar millor els mecanismes subjacents en malalties humanes, en aquesta tesis presento el programa GUILD, una plataforma que prioritza gens relacionats amb una malaltia en concret fent us de la topologia de xarxe. A partir d’un conjunt conegut de gens implicats en una malaltia, GUILD associa altres gens amb la malaltia mitjancant la topologia global de la xarxa d’interaccions de proteïnes. A més a més, analitzo les relacions mecanístiques entre gens associats a malalties i explico la robustesa es desprèn d’aquesta anàlisi. També presento GUILDify, un servidor web de fácil ús per la priorització de gens i la seva associació a un determinat fenotip. Finalment, descric els mètodes més recents en què el model•latge de xarxes ha ajudat extendre el coneixement sobre malalties complexes, com per exemple a càncer
    • …
    corecore