25 research outputs found

    Biclustering on expression data: A review

    Get PDF
    Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.Ministerio de Economía y Competitividad TIN2011-2895

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of rows coherent with groups of columns. This kind of clustering is called <it>biclustering</it>. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed.</p> <p>Methods</p> <p>We introduce <it>BiMine</it>, a new enumeration algorithm for biclustering of DNA microarray data. The proposed algorithm is based on three original features. First, <it>BiMine </it>relies on a new evaluation function called <it>Average Spearman's rho </it>(ASR). Second, <it>BiMine </it>uses a new tree structure, called <it>Bicluster Enumeration Tree </it>(BET), to represent the different biclusters discovered during the enumeration process. Third, to avoid the combinatorial explosion of the search tree, <it>BiMine </it>introduces a parametric rule that allows the enumeration process to cut tree branches that cannot lead to good biclusters.</p> <p>Results</p> <p>The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data. The experimental results show that <it>BiMine </it>competes well with several other biclustering methods. Moreover, we test the biological significance using a gene annotation web-tool to show that our proposed method is able to produce biologically relevant biclusters. The software is available upon request from the authors to academic users.</p

    User-Specific Bicluster-based Collaborative Filtering

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020Collaborative Filtering is one of the most popular and successful approaches for Recommender Systems. However, some challenges limit the effectiveness of Collaborative Filtering approaches when dealing with recommendation data, mainly due to the vast amounts of data and their sparse nature. In order to improve the scalability and performance of Collaborative Filtering approaches, several authors proposed successful approaches combining Collaborative Filtering with clustering techniques. In this work, we study the effectiveness of biclustering, an advanced clustering technique that groups rows and columns simultaneously, in Collaborative Filtering. When applied to the classic U-I interaction matrices, biclustering considers the duality relations between users and items, creating clusters of users who are similar under a particular group of items. We propose USBCF, a novel biclustering-based Collaborative Filtering approach that creates user specific models to improve the scalability of traditional CF approaches. Using a realworld dataset, we conduct a set of experiments to objectively evaluate the performance of the proposed approach, comparing it against baseline and state-of-the-art Collaborative Filtering methods. Our results show that the proposed approach can successfully suppress the main limitation of the previously proposed state-of-the-art biclustering-based Collaborative Filtering (BBCF) since BBCF can only output predictions for a small subset of the system users and item (lack of coverage). Moreover, USBCF produces rating predictions with quality comparable to the state-of-the-art approaches

    Genomic integrative analysis to improve fusion transcript detection, liquid association and biclustering

    Get PDF
    More data provide more possibilities. Growing number of genomic data provide new perspectives to understand some complex biological problems. Many algorithms for single-study have been developed, however, their results are not stable for small sample size or overwhelmed by study-specific signals. Taking the advantage of high throughput genomic data from multiple cohorts, in this dissertation, we are able to detect novel fusion transcripts, explore complex gene regulations and discovery disease subtypes within an integrative analysis framework. In the first project, we evaluated 15 fusion transcript detection tools for paired-end RNA-seq data. Though no single method had distinguished performance over the others, several top tools were selected according to their F-measures. We further developed a fusion meta-caller algorithm by combining top methods to re-prioritize candidate fusion transcripts. The results showed that our meta-caller can successfully balance precision and recall compared to any single fusion detection tool. In the second project, we extended liquid association to two meta-analytic frameworks (MetaLA and MetaMLA). Liquid association is the dynamic gene-gene correlation depending on the expression level of a third gene. Our MetaLA and MetaMLA provided stronger detection signals and more consistent and stable results compared to single-study analysis. When applied our method to five Yeast datasets related to environmental changes, genes in the top triplets were highly enriched in fundamental biological processes corresponding to environmental changes. In the third project, we extended the plaid model from single-study analysis to multiple cohorts for bicluster detection. Our meta-biclustering algorithm can successfully discovery biclusters with higher Jaccard accuracy toward large noise and small sample size. We also introduced the concept of gap statistic for pruning parameter estimation. In addition, biclusters detected from five breast cancer mRNA expression cohorts can successfully select genes highly associated with many breast cancer related pathways and split samples with significantly different survival behaviors. In conclusion, we improved the fusion transcripts detection, liquid association analysis and bicluster discovery through integrative-analysis frameworks. These results provided strong evidence of gene fusion structure variation, three-way gene regulation and disease subtype detection, and thus contribute to better understanding of complex disease mechanism ultimately

    Pathway and Network Analysis of Transcriptomic and Genomic Data

    Get PDF
    Department of Biological SciencesThe development of high-throughput technologies has enabled to produce omics data and it has facilitated the systemic analysis of biomolecules in cells. In addition, thanks to the vast amount of knowledge in molecular biology accumulated for decades, numerous biological pathways have been categorized as gene-sets. Using these omics data and pre-defined gene-sets, the pathway analysis identifies genes that are collectively altered on a gene-set level under a phenotype. It helps the biological interpretation of the phenotype, and find phenotype-related genes that are not detected by single gene-based approach. Besides, the high-throughput technologies have contributed to construct various biological networks such as the protein-protein interactions (PPIs), metabolic/cell signaling networks, gene-regulatory networks and gene co-expression networks. Using these networks, we can visualize the relationships among gene-set members and find the hub genes, or infer new biological regulatory modules. Overall, this thesis/dissertation describes three approaches to enhance the performance of pathway and/or network analysis of transcriptomic and genomic data. First, a simple but effective method that improves the gene-permuting gene-set enrichment analysis (GSEA) of RNA-sequencing data will be addressed, which is especially useful for small replicate data. By taking absolute statistic, it greatly reduced the false positive rate caused by inter-gene correlation within gene-sets, and improved the overall discriminatory ability in gene-permuting GSEA. Next, a powerful competitive gene-set analysis tool for GWAS summary data, named GSA-SNP2, will be introduced. The z-score method applied with adjusted gene score greatly improved sensitivity compared to existing competitive gene-set analysis methods while exhibiting decent false positive control. The performance was validated using both simulation and real data. In addition, GSA-SNP2 visualizes protein interaction networks within and across the significant pathways so that the user can prioritize the core subnetworks for further mechanistic study. Finally, a novel approach to predict condition-specific miRNA target network by biclustering a large collection of mRNA fold-change data for sequence-specific targets will be introduced. The bicluster targets exhibited on average 17.0% (median 19.4%) improved gain in certainty (sensitivity + specificity). The net gain was further increased up to 32.0% (median 33.2%) by filtering them using functional network information. The analysis of cancer-related biclusters revealed that PI3K/Akt signaling pathway is strongly enriched in targets of a few miRNAs in breast cancer and diffuse large B-cell lymphoma. Among them, five independent prognostic miRNAs were identified, and repressions of bicluster targets and pathway activity by mir-29 were experimentally validated. The BiMIR database provides a useful resource to search for miRNA regulation modules for 459 human miRNAs.clos

    Transcriptome-based predictive modeling approaches in Arabidopsis thaliana

    Get PDF

    Forestogram: Biclustering Visualization Framework with Applications in Public Transport and Bioinformatics

    Get PDF
    RÉSUMÉ : Dans de nombreux problèmes d’analyse de données, les données sont exprimées dans une matrice avec les sujets en ligne et les attributs en colonne. Les méthodes de segmentations traditionnelles visent à regrouper les sujets (lignes), selon des critères de similitude entre ces sujets. Le but est de constituer des groupes de sujets (lignes) qui partagent un certain degré de ressemblance. Les groupes obtenus permettent de garantir que les sujets partagent des similitudes dans leurs attributs (colonnes), il n’y a cependant aucune garantie sur ce qui se passe au niveau des attributs (les colonnes). Dans certaines applications, un regroupement simultané des lignes et des colonnes appelé biclustering de la matrice de données peut être souhaité. Pour cela, nous concevons et développons un nouveau cadre appelé Forestogram, qui permet le calcul de ce regroupement simultané des lignes et des colonnes (biclusters)dans un mode hiérarchique. Le regroupement simultané des lignes et des colonnes de manière hiérarchique peut aider les praticiens à mieux comprendre comment les groupes évoluent avec des propriétés théoriques intéressantes. Forestogram, le nouvel outil de calcul et de visualisation proposé, pourrait être considéré comme une extension 3D du dendrogramme, avec une fusion orthogonale étendue. Chaque bicluster est constitué d’un groupe de lignes (ou de sujets) qui déplie un schéma fortement corrélé avec le groupe de colonnes (ou attributs) correspondantes. Cependant, au lieu d’effectuer un clustering bidirectionnel indépendamment de chaque côté, nous proposons un algorithme de biclustering hiérarchique qui prend les lignes et les colonnes en même temps pour déterminer les biclusters. De plus, nous développons un critère d’information basé sur un modèle qui fournit un nombre estimé de biclusters à travers un ensemble de configurations hiérarchiques au sein du forestogramme sous des hypothèses légères. Nous étudions le cadre suggéré dans deux perspectives appliquées différentes, l’une dans le domaine du transport en commun, l’autre dans le domaine de la bioinformatique. En premier lieu, nous étudions le comportement des usagers dans le transport en commun à partir de deux informations distinctes, les données temporelles et les coordonnées spatiales recueillies à partir des données de transaction de la carte à puce des usagers. Dans de nombreuses villes, les sociétés de transport en commun du monde entier utilisent un système de carte à puce pour gérer la perception des tarifs. L’analyse de cette information fournit un aperçu complet de l’influence de l’utilisateur dans le réseau de transport en commun interactif. À cet égard, l’analyse des données temporelles, décrivant l’heure d’entrée dans le réseau de transport en commun est considérée comme la composante la plus importante des données recueillies à partir des cartes à puce. Les techniques classiques de segmentation, basées sur la distance, ne sont pas appropriées pour analyser les données temporelles. Une nouvelle projection intuitive est suggérée pour conserver le modèle de données horodatées. Ceci est introduit dans la méthode suggérée pour découvrir le modèle temporel comportemental des utilisateurs. Cette projection conserve la distance temporelle entre toute paire arbitraire de données horodatées avec une visualisation significative. Par conséquent, cette information est introduite dans un algorithme de classification hiérarchique en tant que méthode de segmentation de données pour découvrir le modèle des utilisateurs. Ensuite, l’heure d’utilisation est prise en compte comme une variable latente pour rendre la métrique euclidienne appropriée dans l’extraction du motif spatial à travers notre forestogramme. Comme deuxième application, le forestogramme est testé sur un ensemble de données multiomiques combinées à partir de différentes mesures biologiques pour étudier comment l’état de santé des patientes et les modalités biologiques correspondantes évoluent hiérarchiquement au cours du terme de la grossesse, dans chaque bicluster. Le maintien de la grossesse repose sur un équilibre finement équilibré entre la tolérance à l’allogreffe foetale et la protection mécanismes contre les agents pathogènes envahissants. Malgré l’impact bien établi du développement pendant les premiers mois de la grossesse sur les résultats à long terme, les interactions entre les divers mécanismes biologiques qui régissent la progression de la grossesse n’ont pas été étudiées en détail. Démontrer la chronologie de ces adaptations à la grossesse à terme fournit le cadre pour de futures études examinant les déviations impliquées dans les pathologies liées à la grossesse, y compris la naissance prématurée et la prééclampsie. Nous effectuons une analyse multi-physique de 51 échantillons de 17 femmes enceintes, livrant à terme. Les ensembles de données comprennent des mesures de l’immunome, du transcriptome, du microbiome, du protéome et du métabolome d’échantillons obtenus simultanément chez les mêmes patients. La modélisation prédictive multivariée utilisant l’algorithme Elastic Net est utilisée pour mesurer la capacité de chaque ensemble de données à prédire l’âge gestationnel. En utilisant la généralisation empilée, ces ensembles de données sont combinés en un seul modèle. Ce modèle augmente non seulement significativement le pouvoir prédictif en combinant tous les ensembles de données, mais révèle également de nouvelles interactions entre différentes modalités biologiques. En outre, notre forestogramme suggéré est une autre ligne directrice avec l’âge gestationnel au moment de l’échantillonnage qui fournit un modèle non supervisé pour montrer combien d’informations supervisées sont nécessaires pour chaque trimestre pour caractériser les changements induits par la grossesse dans Microbiome, Transcriptome, Génome, Exposome et Immunome réponses efficacement.----------ABSTRACT : In many statistical modeling problems data are expressed in a matrix with subjects in row and attributes in column. In this regard, simultaneous grouping of rows and columns known as biclustering of the data matrix is desired. We design and develop a new framework called Forestogram, with the aim of fast computational and hierarchical illustration of biclusters. Often in practical data analysis, we deal with a two-dimensional object known as the data matrix, where observations are expressed as samples (or subjects) in rows, and attributes (or features) in columns. Thus, simultaneous grouping of rows and columns in a hierarchical manner helps practitioners better understanding how clusters evolve. Forestogram, a novel computational and visualization tool, could be thought of as a 3D expansion of dendrogram, with extended orthogonal merge. Each bicluster consists of group of rows (or samples) that unfolds a highly-correlated schema with their corresponding group of columns (or attributes). However, instead of performing two-way clustering independently on each side, we propose a hierarchical biclustering algorithm which takes rows and columns at the same time to determine the biclusters. Furthermore, we develop a model-based information criterion which provides an estimated number of biclusters through a set of hierarchical configurations within the forestogram under mild assumptions. We study the suggested framework in two different applied perspectives, one in public transit domain, another one in bioinformatics field. First, we investigate the users’ behavior in public transit based on two distinct information, temporal data and spatial coordinates gathered from smart card. In many cities, worldwide public transit companies use smart card system to manage fare collection. Analysis of this information provides a comprehensive insight of user’s influence in the interactive public transit network. In this regard, analysis of temporal data, describing the time of entering to the public transit network is considered as the most substantial component of the data gathered from the smart cards. Classical distance-based techniques are not always suitable to analyze this time series data. A novel projection with intuitive visual map from higher dimension into a three-dimensional clock-like space is suggested to reveal the underlying temporal pattern of public transit users. This projection retains the temporal distance between any arbitrary pair of time-stamped data with meaningful visualization. Consequently, this information is fed into a hierarchical clustering algorithm as a method of data segmentation to discover the pattern of users. Then, the time of the usage is taken as a latent variable into account to make the Euclidean metric appropriate for extracting the spatial pattern through our forestogram. As a second application, forestogram is tested on a multiomics dataset combined from different biological measurements to study how patients and corresponding biological modalities evolve hierarchically in each bicluster over the term of pregnancy. The maintenance of pregnancy relies on a finely-tuned balance between tolerance to the fetal allograft and protective mechanisms against invading pathogens. Despite the well-established impact of development during the early months of pregnancy on long-term outcomes, the interactions between various biological mechanisms that govern the progression of pregnancy have not been studied in details. Demonstrating the chronology of these adaptations to term pregnancy provides the framework for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia. We perform a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets include measurements from the immunome, transcriptome, microbiome, proteome, and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net algorithm is used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets are combined into a single model. This model not only significantly increases the predictive power by combining all datasets, but also reveals novel interactions between different biological modalities. Furthermore, our suggested forestogram is another guideline along with the gestational age at time of sampling that provides an unsupervised model to show how much supervised information is necessary for each trimester to characterize the pregnancy-induced changes in Microbiome, Transcriptome, Genome, Exposome, and Immunome responses effectively
    corecore