476 research outputs found

    Defining an informativeness metric for clustering gene expression data

    Get PDF
    Motivation: Unsupervised ‘cluster’ analysis is an invaluable tool for exploratory microarray data analysis, as it organizes the data into groups of genes or samples in which the elements share common patterns. Once the data are clustered, finding the optimal number of informative subgroups within a dataset is a problem that, while important for understanding the underlying phenotypes, is one for which there is no robust, widely accepted solution

    attract: A Method for Identifying Core Pathways That Define Cellular Phenotypes

    Get PDF
    attract is a knowledge-driven analytical approach for identifying and annotating the gene-sets that best discriminate between cell phenotypes. attract finds distinguishing patterns within pathways, decomposes pathways into meta-genes representative of these patterns, and then generates synexpression groups of highly correlated genes from the entire transcriptome dataset. attract can be applied to a wide range of biological systems and is freely available as a Bioconductor package and has been incorporated into the MeV software system

    A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities

    Full text link
    Analysis of multivariate data sets from e.g. microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists with respect to sampling variations, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability attributable to sampling variations. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward robust comparison between any two lists. It can also be used to generate new, more stable gene rankings incorporating more information from the experimental data. Using a microarray data set from lung cancer patients we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance

    A roadmap towards breast cancer therapies supported by explainable artificial intelligence

    Get PDF
    In recent years personalized medicine reached an increasing importance, especially in the design of oncological therapies. In particular, the development of patients’ profiling strategies suggests the possibility of promising rewards. In this work, we present an explainable artificial intelligence (XAI) framework based on an adaptive dimensional reduction which (i) outlines the most important clinical features for oncological patients’ profiling and (ii), based on these features, determines the profile, i.e., the cluster a patient belongs to. For these purposes, we collected a cohort of 267 breast cancer patients. The adopted dimensional reduction method determines the relevant subspace where distances among patients are used by a hierarchical clustering procedure to identify the corresponding optimal categories. Our results demonstrate how the molecular subtype is the most important feature for clustering. Then, we assessed the robustness of current therapies and guidelines; our findings show a striking correspondence between available patients’ profiles determined in an unsupervised way and either molecular subtypes or therapies chosen according to guidelines, which guarantees the interpretability characterizing explainable approaches to machine learning techniques. Accordingly, our work suggests the possibility to design data-driven therapies to emphasize the differences observed among the patients

    Exploiting Semantics from Widely Available Ontologies to Aid the Model Building Process

    Get PDF
    This dissertation attempts to address the changing needs of data science and analytics: making it easier to produce accurate models opening up opportunities and perspectives for novices to make sense of existing data. This work aims to incorporate semantics of data in addressing classical machine learning problems, which is one way to tame the deluge of data. The increased availability of data and the existence of easy-to-use procedures for regression and classification in commodity software allows anyone to search for correlations amongst a large set of variables with scant regard of their meaning. Consequently, people tend to use data indiscriminately, leading to the practice of data dredging. It is easy to use sophisticated tools to produce specious models, which generalize poorly and may lead to wrong conclusions. Despite much effort having been placed on advancing learning algorithms, current tools do little to shield people from using data in a semantically lax fashion. By examining the entire model building process and supplying semantic information derived from high-level knowledge in the form of an ontology, the machine can assist in exercising discretion to help the model builder avoid the pitfalls of data dredging. This work introduces a metric, called conceptual distance, to incorporate semantic information into the model building process. The conceptual distance is shown to be practically computed from large-scale existing ontologies. This metric is exploited in feature selection to enable a machine to take semantics of features into consideration when choosing them to build a model. Experiments with ontologies and real world datasets show the comparable performance of this metric in selecting a feature subset to the traditional data-driven measurements, in spite of using only labels of features, not the associated measures. Further, a new end-to-end model building process is developed by using the conceptual distance as a guideline to explore an ontological structure and retrieve relevant features automatically, making it convenient for a novice to build a semantically pertinent model. Experiments show that the proposed model building process can help a user to produce a model with performance comparable to that built by a domain expert. This work offers a tool to help the common man battle the hazard of data dredging that comes from the indiscriminate use of data. The tool results in models with improved generalization and easy to interpret, leading to better decisions or implications

    Unconventional machine learning of genome-wide human cancer data

    Full text link
    Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences

    A complex network approach reveals pivotal sub-structure of genes linked to Schizophrenia

    Get PDF
    Research on brain disorders with a strong genetic component and complex heritability, like schizophrenia and autism, has promoted the development of brain transcriptomics. This research field deals with the deep understanding of how gene-gene interactions impact on risk for heritable brain disorders. With this perspective, we developed a novel data-driven strategy for characterizing genetic modules, i.e., clusters, also called community, of strongly interacting genes. The aim is to uncover a pivotal module of genes by gaining biological insight upon them. Our approach combined network topological properties, to highlight the presence of a pivotal community, matchted with information theory, to assess the informativeness of partitions. Shannon entropy of the complex networks based on average betweenness of the nodes is adopted for this purpose. We analyzed the publicly available BrainCloud dataset, containing post-mortem gene expression data and we focused on the Dopamine Receptor D2, encoded by the DRD2 gene. To parse the DRD2 community into sub-structure, we applied and compared four different community detection algorithms. A pivotal DRD2 module emerged for all procedures applied and it represented a considerable reduction, compared with the beginning network size. Dice index 80% for the detected community confirmed the stability of the results, in a wide range of tested parameters. The detected community was also the most informative, as it represented an optimization of the Shannon entropy. Lastly, we verified that the DRD2 was strongly connected to its neighborhood, stronger than any other randomly selected community and more than the Weighted Gene Coexpression Network Analysis (WGCNA) module, commonly considered the standard approach for these studies

    Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

    Full text link
    In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
    corecore