141 research outputs found

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Learning predictive models from temporal three-way data using triclustering: applications in clinical data analysis

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020O conceito de triclustering estende o conceito de biclustering para um espaço tridimensional, cujo o objetivo é encontrar subespaços coerentes em dados tridimensionais. Considerando dados com dimensão temporal, a necessidade de aprender padrões temporais interessantes e usá-los para aprender modelos preditivos efetivos e interpretáveis, despoleta necessidade em investigar novas metodologias para análise de dados tridimensionais. Neste trabalho, propomos duas metodologias para esse efeito. Na primeira metodologia, encontramos os melhores parâmetros a serem usados em triclustering para descobrir os melhores triclusters (conjuntos de objetos com um padrão coerente ao longo de um dado conjunto de pontos temporais) para que depois estes padrões sejam usados como features por um dos mais apropriados classificadores encontrados na literatura. Neste caso, propomos juntar o classificador com uma abordagem de triclustering temporal. Para isso, idealizámos um algoritmo de triclustering com uma restrição temporal, denominado TCtriCluster para desvendar triclusters temporalmente contínuos (constituídos por pontos temporais contínuos). Na segunda metodologia, adicionámos uma fase de biclustering para descobrir padrões nos dados estáticos (dados que não mudam ao longo do tempo) e juntá-los aos triclusters para melhorar o desempenho e a interpretabilidade dos modelos. Estas metodologias foram usadas para prever a necessidade de administração de ventilação não invasiva (VNI) em pacientes com Esclerose Lateral Amiotrófica (ELA). Neste caso de estudo, aprendemos modelos de prognóstico geral, para os dados de todos os pacientes, e modelos especializados, depois de feita uma estratificação dos pacientes em 3 grupos de progressão: Lentos, Neutros e Rápidos. Os resultados demonstram que, além de serem bastante equiparáveis e por vezes superiores quando comparados com os resultados obtidos por um classificador de alto desempenho (Random Forests), os nossos classificadores são capazes de refinar as previsões através das potencialidades da interpretabilidade do modelo. De facto, quando usados os triclusters (e biclusters) como previsores, estamos a promover o uso de padrões de progressão da doença altamente interpretáveis. Para além disso, quando usados para previsão de prognóstico em doentes com ELA, os nossos modelos preditivos interpretáveis desvendaram padrões clinicamente relevantes para um grupo específico de padrões de progressão da doença, ajudando os médicos a entender a elevada heterogeneidade da progressão da ELA. Os resultados mostram ainda que a restrição temporal tem impacto na melhoria da efetividade e preditividade dos modelos.Triclustering extends biclustering to the three-dimensional space, aiming to find coherent subspaces in three-way data (sets of objects described by subsets of features in a subset of contexts). When the context is time, the need to learn interesting temporal patterns and use them to learn effective and interpretable predictive models triggers the need for new research methodologies to be used in three-way data analysis. In this work, we propose two approaches to learn predictive models from three-way data: 1) a triclustering-based classifier (considering just temporal data) and 2) a mixture of biclustering (with static data) and triclustering (with temporal data). In the first approach, we find the best triclustering parameters to uncover the best triclusters (sets of objects with a coherent pattern along a set of time-points) and then use these patterns as features in a state-of-the-art classifier. In the case of temporal data, we propose to couple the classifier with a temporal triclustering approach. With this aim, we devised a temporally constrained triclustering algorithm, termed TCtriCluster algorithm to mine time-contiguous triclusters. In the second approach, we extended the triclustering-based classifier with a biclustering task, where biclusters are discovered in static data (not changed over the time) and integrated with triclusters to improve performance and model explainability. The proposed methodologies were used to predict the need for non-invasive ventilation (NIV) in patients with Amyotrophic Lateral Sclerosis (ALS). In this case study, we learnt a general prognostic model from all patients data and specialized models after patient stratification into Slow, Neutral and Fast progressors. Our results show that besides comparable and sometimes outperforming results, when compared to a high performing random forest classifier, our predictive models enhance prediction with the potentialities of model interpretability. Indeed, when using triclusters (and biclusters) as predictors, we promoting the use of highly interpretable disease progression patterns. Furthermore, when used for prognostic prediction in ALS, our interpretable predictive models unravelled clinically relevant and group-specific disease progression patterns, helping clinicians to understand the high heterogeneity of ALS disease progression. Results further show that the temporal restriction is effective in improving the effectiveness of the predictive models

    Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

    Get PDF
    Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure

    biclustermd: An R Package for Biclustering with Missing Values

    Get PDF
    Biclustering is a statistical learning technique that attempts to find homogeneous partitions of rows and columns of a data matrix. For example, movie ratings might be biclustered to group both raters and movies. biclust is a current R package allowing users to implement a variety of biclustering algorithms. However, its algorithms do not allow the data matrix to have missing values. We provide a new R package, biclustermd, which allows users to perform biclustering on numeric data even in the presence of missing values

    Construction of gene regulatory networks using biclustering and bayesian networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Understanding gene interactions in complex living systems can be seen as the ultimate goal of the systems biology revolution. Hence, to elucidate disease ontology fully and to reduce the cost of drug development, gene regulatory networks (GRNs) have to be constructed. During the last decade, many GRN inference algorithms based on genome-wide data have been developed to unravel the complexity of gene regulation. Time series transcriptomic data measured by genome-wide DNA microarrays are traditionally used for GRN modelling. One of the major problems with microarrays is that a dataset consists of relatively few time points with respect to the large number of genes. Dimensionality is one of the interesting problems in GRN modelling.</p> <p>Results</p> <p>In this paper, we develop a biclustering function enrichment analysis toolbox (BicAT-plus) to study the effect of biclustering in reducing data dimensions. The network generated from our system was validated via available interaction databases and was compared with previous methods. The results revealed the performance of our proposed method.</p> <p>Conclusions</p> <p>Because of the sparse nature of GRNs, the results of biclustering techniques differ significantly from those of previous methods.</p

    Biclustering electronic health records to unravel disease presentation patterns

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2019A Esclerose Lateral Amiotrófica (ELA) é uma doença neurodegenerativa heterogénea com padrões de apresentação altamente variáveis. Dada a natureza heterogénea dos doentes com ELA, aquando do diagnóstico os clínicos normalmente estimam a progressão da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem Automática que consigam lidar com este padrões complexos é necessária para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivência. Estes modelos devem ser explicáveis para que os clínicos possam tomar decisões informadas. Desta forma, o nosso objectivo é descobrir padrões de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrões (chamados de Meta-atributos) são compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrões similares de apresentação da doença. Os Registos de Saúde Electrónicos (RSE) utilizados neste trabalho provêm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questões standardizadas acerca de factores de risco, mutações genéticas, atributos clínicos ou informação de sobrevivência de uma coorte de doentes e controlos seguidos pelo consórcio ENCALS (European Network to Cure ALS), que inclui vários países europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrões de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressão (categorizados em grupos Lentos, Neutros e Rápidos). Nenhum padrão coerente emergiu das experiências efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrões encontrados para cada um dos três grupos de progressão foram reconhecidos e validados por clínicos especialistas em ELA, como sendo características relevantes de doentes com progressão Lenta, Neutra e Rápida. Estes resultados sugerem que a nossa abordagem genérica baseada em Biclustering tem potencial para identificar padrões de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases

    Relating gene expression data on two-component systems to functional annotations in Escherichia coli

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Obtaining physiological insights from microarray experiments requires computational techniques that relate gene expression data to functional information. Traditionally, this has been done in two consecutive steps. The first step identifies important genes through clustering or statistical techniques, while the second step assigns biological functions to the identified groups. Recently, techniques have been developed that identify such relationships in a single step.</p> <p>Results</p> <p>We have developed an algorithm that relates patterns of gene expression in a set of microarray experiments to functional groups in one step. Our only assumption is that patterns co-occur frequently. The effectiveness of the algorithm is demonstrated as part of a study of regulation by two-component systems in <it>Escherichia coli</it>. The significance of the relationships between expression data and functional annotations is evaluated based on density histograms that are constructed using product similarity among expression vectors. We present a biological analysis of three of the resulting functional groups of proteins, develop hypotheses for further biological studies, and test one of these hypotheses experimentally. A comparison with other algorithms and a different data set is presented.</p> <p>Conclusion</p> <p>Our new algorithm is able to find interesting and biologically meaningful relationships, not found by other algorithms, in previously analyzed data sets. Scaling of the algorithm to large data sets can be achieved based on a theoretical model.</p
    corecore