Search CORE

476 research outputs found

Defining an informativeness metric for clustering gene expression data

Author: Akaike
Christine A. Wells
Datta
Datta
Dunn
Eisen
Gibbons
Handl
Jessica C. Mar
John Quackenbush
Kanehisa
McLachlan
Michaels
Milligan
Müller
Rousseeuw
Schwarz
Tibshirani
Yeung
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Motivation: Unsupervised ‘cluster’ analysis is an invaluable tool for exploratory microarray data analysis, as it organizes the data into groups of genes or samples in which the elements share common patterns. Once the data are clustered, finding the optimal number of informative subgroups within a dataset is a problem that, while important for understanding the underlying phenotypes, is one for which there is no robust, widely accepted solution

Crossref

Harvard University - DASH

PubMed Central

Enlighten

University of Melbourne Institutional Repository

University of Queensland eSpace

attract: A Method for Identifying Core Pathways That Define Cellular Phenotypes

Author: A Subramanian
AP Oron
B Zhang
C Niehrs
Christine A. Wells
DW Huang
F Müller
G Dennis
GK Smyth
I Ulitsky
JC Mar
Jessica C. Mar
John Quackenbush
M Kanehisa
M Mason
Nicholas A. Matigian
Peter Csermely
R Irizarray
S Horvath
Y Benjamini
Z Jiang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

attract is a knowledge-driven analytical approach for identifying and annotating the gene-sets that best discriminate between cell phenotypes. attract finds distinguishing patterns within pathways, decomposes pathways into meta-genes representative of these patterns, and then generates synexpression groups of highly correlated genes from the entire transcriptome dataset. attract can be applied to a wide range of biological systems and is freely available as a Bioconductor package and has been incorporated into the MeV software system

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Enlighten

University of Melbourne Institutional Repository

University of Queensland eSpace

A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities

Author: Fontes Magnus
Soneson Charlotte
Publication venue
Publication date: 15/03/2011
Field of study

Analysis of multivariate data sets from e.g. microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists with respect to sampling variations, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability attributable to sampling variations. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward robust comparison between any two lists. It can also be used to generate new, more stable gene rankings incorporating more information from the experimental data. Using a microarray data set from lung cancer patients we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance

arXiv.org e-Print Archive

Lund University Publications

A roadmap towards breast cancer therapies supported by explainable artificial intelligence

Author: Amoroso N.
Bellotti R.
Didonna V.
Fanizzi A.
Giotta F.
La Forgia D.
Latorre A.
Lorusso V.
Massafra R.
Monaco A.
Pantaleo E.
Petruzzellis N.
Pomarico D.
Tamborra P.
Zito A.
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

In recent years personalized medicine reached an increasing importance, especially in the design of oncological therapies. In particular, the development of patients’ profiling strategies suggests the possibility of promising rewards. In this work, we present an explainable artificial intelligence (XAI) framework based on an adaptive dimensional reduction which (i) outlines the most important clinical features for oncological patients’ profiling and (ii), based on these features, determines the profile, i.e., the cluster a patient belongs to. For these purposes, we collected a cohort of 267 breast cancer patients. The adopted dimensional reduction method determines the relevant subspace where distances among patients are used by a hierarchical clustering procedure to identify the corresponding optimal categories. Our results demonstrate how the molecular subtype is the most important feature for clustering. Then, we assessed the robustness of current therapies and guidelines; our findings show a striking correspondence between available patients’ profiles determined in an unsupervised way and either molecular subtypes or therapies chosen according to guidelines, which guarantees the interpretability characterizing explainable approaches to machine learning techniques. Accordingly, our work suggests the possibility to design data-driven therapies to emphasize the differences observed among the patients

Archivio istituzionale della ricerca - Università di Bari

Exploiting Semantics from Widely Available Ontologies to Aid the Model Building Process

Author: Janpuangtong Sasin
Publication venue
Publication date: 26/08/2020
Field of study

This dissertation attempts to address the changing needs of data science and analytics: making it easier to produce accurate models opening up opportunities and perspectives for novices to make sense of existing data. This work aims to incorporate semantics of data in addressing classical machine learning problems, which is one way to tame the deluge of data. The increased availability of data and the existence of easy-to-use procedures for regression and classification in commodity software allows anyone to search for correlations amongst a large set of variables with scant regard of their meaning. Consequently, people tend to use data indiscriminately, leading to the practice of data dredging. It is easy to use sophisticated tools to produce specious models, which generalize poorly and may lead to wrong conclusions. Despite much effort having been placed on advancing learning algorithms, current tools do little to shield people from using data in a semantically lax fashion. By examining the entire model building process and supplying semantic information derived from high-level knowledge in the form of an ontology, the machine can assist in exercising discretion to help the model builder avoid the pitfalls of data dredging. This work introduces a metric, called conceptual distance, to incorporate semantic information into the model building process. The conceptual distance is shown to be practically computed from large-scale existing ontologies. This metric is exploited in feature selection to enable a machine to take semantics of features into consideration when choosing them to build a model. Experiments with ontologies and real world datasets show the comparable performance of this metric in selecting a feature subset to the traditional data-driven measurements, in spite of using only labels of features, not the associated measures. Further, a new end-to-end model building process is developed by using the conceptual distance as a guideline to explore an ontological structure and retrieve relevant features automatically, making it convenient for a novice to build a semantically pertinent model. Experiments show that the proposed model building process can help a user to produce a model with performance comparable to that built by a domain expert. This work offers a tool to help the common man battle the hazard of data dredging that comes from the indiscriminate use of data. The tool results in models with improved generalization and easy to interpret, leading to better decisions or implications

Texas A&M Repository

Unconventional machine learning of genome-wide human cancer data

Author: Bajaj Sweta R.
Chittenden Thomas W.
Cilfone Nicholas
Gamel Omar E.
Gujja Sharvari
Gulcher Jeffrey R.
Li Richard Y.
Lidar Daniel A.
Publication venue
Publication date: 13/05/2020
Field of study

Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences

arXiv.org e-Print Archive

Directory of Open Access Journals

A complex network approach reveals pivotal sub-structure of genes linked to Schizophrenia

Author: Alessandro Bertolino
Alfonso Monaco
Anna Monda
Giulio Pergola
Giuseppe Blasi
Marco Papalino
Nicola Amoroso
Pasquale Di Carlo
Roberto Bellotti
Sabina Tangaro
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Research on brain disorders with a strong genetic component and complex heritability, like schizophrenia and autism, has promoted the development of brain transcriptomics. This research field deals with the deep understanding of how gene-gene interactions impact on risk for heritable brain disorders. With this perspective, we developed a novel data-driven strategy for characterizing genetic modules, i.e., clusters, also called community, of strongly interacting genes. The aim is to uncover a pivotal module of genes by gaining biological insight upon them. Our approach combined network topological properties, to highlight the presence of a pivotal community, matchted with information theory, to assess the informativeness of partitions. Shannon entropy of the complex networks based on average betweenness of the nodes is adopted for this purpose. We analyzed the publicly available BrainCloud dataset, containing post-mortem gene expression data and we focused on the Dopamine Receptor D2, encoded by the DRD2 gene. To parse the DRD2 community into sub-structure, we applied and compared four different community detection algorithms. A pivotal DRD2 module emerged for all procedures applied and it represented a considerable reduction, compared with the beginning network size. Dice index 80% for the detected community confirmed the stability of the results, in a wide range of tested parameters. The detected community was also the most informative, as it represented an optimization of the Shannon entropy. Lastly, we verified that the DRD2 was strongly connected to its neighborhood, stronger than any other randomly selected community and more than the Weighted Gene Coexpression Network Analysis (WGCNA) module, commonly considered the standard approach for these studies

Crossref

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Bari

FigShare

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 08/10/2010
Field of study

In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

arXiv.org e-Print Archive

CiteSeerX