476 research outputs found
Defining an informativeness metric for clustering gene expression data
Motivation: Unsupervised ‘cluster’ analysis is an invaluable tool for exploratory microarray data analysis, as it organizes the data into groups of genes or samples in which the elements share common patterns. Once the data are clustered, finding the optimal number of informative subgroups within a dataset is a problem that, while important for understanding the underlying phenotypes, is one for which there is no robust, widely accepted solution
attract: A Method for Identifying Core Pathways That Define Cellular Phenotypes
attract is a knowledge-driven analytical approach for identifying and annotating the gene-sets that best discriminate between cell phenotypes. attract finds distinguishing patterns within pathways, decomposes pathways into meta-genes representative of these patterns, and then generates synexpression groups of highly correlated genes from the entire transcriptome dataset. attract can be applied to a wide range of biological systems and is freely available as a Bioconductor package and has been incorporated into the MeV software system
A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities
Analysis of multivariate data sets from e.g. microarray studies frequently
results in lists of genes which are associated with some response of interest.
The biological interpretation is often complicated by the statistical
instability of the obtained gene lists with respect to sampling variations,
which may partly be due to the functional redundancy among genes, implying that
multiple genes can play exchangeable roles in the cell. In this paper we use
the concept of exchangeability of random variables to model this functional
redundancy and thereby account for the instability attributable to sampling
variations. We present a flexible framework to incorporate the exchangeability
into the representation of lists. The proposed framework supports
straightforward robust comparison between any two lists. It can also be used to
generate new, more stable gene rankings incorporating more information from the
experimental data. Using a microarray data set from lung cancer patients we
show that the proposed method provides more robust gene rankings than existing
methods with respect to sampling variations, without compromising the
biological significance
A roadmap towards breast cancer therapies supported by explainable artificial intelligence
In recent years personalized medicine reached an increasing importance, especially in the design of oncological therapies. In particular, the development of patients’ profiling strategies suggests the possibility of promising rewards. In this work, we present an explainable artificial intelligence (XAI) framework based on an adaptive dimensional reduction which (i) outlines the most important clinical features for oncological patients’ profiling and (ii), based on these features, determines the profile, i.e., the cluster a patient belongs to. For these purposes, we collected a cohort of 267 breast cancer patients. The adopted dimensional reduction method determines the relevant subspace where distances among patients are used by a hierarchical clustering procedure to identify the corresponding optimal categories. Our results demonstrate how the molecular subtype is the most important feature for clustering. Then, we assessed the robustness of current therapies and guidelines; our findings show a striking correspondence between available patients’ profiles determined in an unsupervised way and either molecular subtypes or therapies chosen according to guidelines, which guarantees the interpretability characterizing explainable approaches to machine learning techniques. Accordingly, our work suggests the possibility to design data-driven therapies to emphasize the differences observed among the patients
Exploiting Semantics from Widely Available Ontologies to Aid the Model Building Process
This dissertation attempts to address the changing needs of data science and analytics: making it easier to produce accurate models opening up opportunities and perspectives for novices to make sense of existing data. This work aims to incorporate semantics of data in addressing classical machine learning problems, which is one way to tame the deluge of data. The increased availability of data and the existence of easy-to-use procedures for regression and classification in commodity software allows anyone to search for correlations amongst a large set of variables with scant regard of their meaning. Consequently, people tend to use data indiscriminately, leading to the practice of data dredging. It is easy to use sophisticated tools to produce specious models, which generalize poorly and may lead to wrong conclusions. Despite much effort having been placed on advancing learning algorithms, current tools do little to shield people from using data in a semantically lax fashion. By examining the entire model building process and supplying semantic information derived from high-level knowledge in the form of an ontology, the machine can assist in exercising discretion to help the model builder avoid the pitfalls of data dredging. This work introduces a metric, called conceptual distance, to incorporate semantic information into the model building process. The conceptual distance is shown to be practically computed from large-scale existing ontologies. This metric is exploited in feature selection to enable a machine to take semantics of features into consideration when choosing them to build a model. Experiments with ontologies and real world datasets show the comparable performance of this metric in selecting a feature subset to the traditional data-driven measurements, in spite of using only labels of features, not the associated measures. Further, a new end-to-end model building process is developed by using the conceptual distance as a guideline to explore an ontological structure and retrieve relevant features automatically, making it convenient for a novice to build a semantically pertinent model. Experiments show that the proposed model building process can help a user to produce a model with performance comparable to that built by a domain expert. This work offers a tool to help the common man battle the hazard of data dredging that comes from the indiscriminate use of data.
The tool results in models with improved generalization and easy to interpret, leading to better decisions or implications
Unconventional machine learning of genome-wide human cancer data
Recent advances in high-throughput genomic technologies coupled with
exponential increases in computer processing and memory have allowed us to
interrogate the complex aberrant molecular underpinnings of human disease from
a genome-wide perspective. While the deluge of genomic information is expected
to increase, a bottleneck in conventional high-performance computing is rapidly
approaching. Inspired in part by recent advances in physical quantum
processors, we evaluated several unconventional machine learning (ML)
strategies on actual human tumor data. Here we show for the first time the
efficacy of multiple annealing-based ML algorithms for classification of
high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas.
To assess algorithm performance, we compared these classifiers to a variety of
standard ML methods. Our results indicate the feasibility of using
annealing-based ML to provide competitive classification of human cancer types
and associated molecular subtypes and superior performance with smaller
training datasets, thus providing compelling empirical evidence for the
potential future application of unconventional computing architectures in the
biomedical sciences
A complex network approach reveals pivotal sub-structure of genes linked to Schizophrenia
Research on brain disorders with a strong genetic component and complex heritability,
like schizophrenia and autism, has promoted the development of brain transcriptomics.
This research field deals with the deep understanding of how gene-gene interactions
impact on risk for heritable brain disorders. With this perspective, we developed a novel
data-driven strategy for characterizing genetic modules, i.e., clusters, also called
community, of strongly interacting genes. The aim is to uncover a pivotal module of
genes by gaining biological insight upon them. Our approach combined network
topological properties, to highlight the presence of a pivotal community, matchted with
information theory, to assess the informativeness of partitions. Shannon entropy of the complex networks based on average betweenness of the nodes is adopted for this
purpose. We analyzed the publicly available BrainCloud dataset, containing
post-mortem gene expression data and we focused on the Dopamine Receptor D2,
encoded by the DRD2 gene. To parse the DRD2 community into sub-structure, we
applied and compared four different community detection algorithms. A pivotal DRD2
module emerged for all procedures applied and it represented a considerable reduction,
compared with the beginning network size. Dice index 80% for the detected
community confirmed the stability of the results, in a wide range of tested parameters.
The detected community was also the most informative, as it represented an
optimization of the Shannon entropy. Lastly, we verified that the DRD2 was strongly
connected to its neighborhood, stronger than any other randomly selected community
and more than the Weighted Gene Coexpression Network Analysis (WGCNA) module,
commonly considered the standard approach for these studies
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
- …