97 research outputs found

    EDISA: extracting biclusters from multiple time-series of gene expression profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cells dynamically adapt their gene expression patterns in response to various stimuli. This response is orchestrated into a number of gene expression modules consisting of co-regulated genes. A growing pool of publicly available microarray datasets allows the identification of modules by monitoring expression changes over time. These time-series datasets can be searched for gene expression modules by one of the many clustering methods published to date. For an integrative analysis, several time-series datasets can be joined into a three-dimensional <it>gene-condition-time </it>dataset, to which standard clustering or biclustering methods are, however, not applicable. We thus devise a probabilistic clustering algorithm for <it>gene-condition-time </it>datasets.</p> <p>Results</p> <p>In this work, we present the EDISA (Extended Dimension Iterative Signature Algorithm), a novel probabilistic clustering approach for 3D <it>gene-condition-time </it>datasets. Based on mathematical definitions of gene expression modules, the EDISA samples initial modules from the dataset which are then refined by removing genes and conditions until they comply with the module definition. A subsequent extension step ensures gene and condition maximality. We applied the algorithm to a synthetic dataset and were able to successfully recover the implanted modules over a range of background noise intensities. Analysis of microarray datasets has lead us to define three biologically relevant module types: 1) We found modules with independent response profiles to be the most prevalent ones. These modules comprise genes which are co-regulated under several conditions, yet with a different response pattern under each condition. 2) Coherent modules with similar responses under all conditions occurred frequently, too, and were often contained within these modules. 3) A third module type, which covers a response specific to a single condition was also detected, but rarely. All of these modules are essentially different types of biclusters.</p> <p>Conclusion</p> <p>We successfully applied the EDISA to different 3D datasets. While previous studies were mostly aimed at detecting coherent modules only, our results show that coherent responses are often part of a more general module type with independent response profiles under different conditions. Our approach thus allows for a more comprehensive view of the gene expression response. After subsequent analysis of the resulting modules, the EDISA helped to shed light on the global organization of transcriptional control. An implementation of the algorithm is available at http://www-ra.informatik.uni-tuebingen.de/software/IAGEN/.</p

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Data Mining Using the Crossing Minimization Paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis

    A review of estimation of distribution algorithms in bioinformatics

    Get PDF
    Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain

    Correlation Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities. Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique. Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster). For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gĂŒltig, bisher unbekannt und potentiell nĂŒtzlich fĂŒr eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken, um nĂŒtzliche Beziehungen und ZusammenhĂ€nge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst Ă€hnlich und Objekte verschiedener Cluster möglichst unĂ€hnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die Ă€hnliche Interessen haben, oder Gruppen von Genen, die Ă€hnliche FunktionalitĂ€ten besitzen. Eine aktuelle Herausforderung fĂŒr Clustering-Verfahren stellen hochdimensionale Feature-RĂ€ume dar. Reale DatensĂ€tze beinhalten dank moderner Verfahren zur Datenerhebung hĂ€ufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder AbhĂ€ngigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprĂ€gt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknĂŒpft werden. Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-RĂ€umen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen AnsĂ€tze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der NĂ€he von Cluster-Objekten in achsenparallelen UnterrĂ€umen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen UnterrĂ€umen und (3.) hinsichtlich der NĂ€he von Cluster-Objekten in beliebig orientierten UnterrĂ€umen (sogenannte Korrelations-Cluster). FĂŒr die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative LösungsansĂ€tze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien von Korrelations-Clustern und den entsprechenden UnterrĂ€umen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsĂ€tzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-RĂ€umen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle fĂŒr Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht

    Bi-(N-) cluster editing and its biomedical applications

    Get PDF
    The extremely fast advances in wet-lab techniques lead to an exponential growth of heterogeneous and unstructured biological data, posing a great challenge to data integration in nowadays system biology. The traditional clustering approach, although widely used to divide the data into groups sharing common features, is less powerful in the analysis of heterogeneous data from n different sources (n _ 2). The co-clustering approach has been widely used for combined analyses of multiple networks to address the challenge of heterogeneity. In this thesis, novel methods for the co-clustering of large scale heterogeneous data sets are presented in the software package n-CluE: one exact algorithm and two heuristic algorithms based on the model of bi-/n-cluster editing by modeling the input as n-partite graphs and solving the clustering problem with various strategies. In the first part of the thesis, the complexity and the fixed-parameter tractability of the extended bicluster editing model with relaxed constraints are investigated, namely the ?-bicluster editing model and its NP-hardness is proven. Based on the results of this analysis, three strategies within the n-CluE software package are then established and discussed, together with the evaluations on performances and the systematic comparisons against other algorithms of the same type in solving bi-/n-cluster editing problem. To demonstrate the practical impact, three real-world analyses using n-CluE are performed, including (a) prediction of novel genotype-phenotype associations by clustering the data from Genome-Wide Association Studies; (b) comparison between n-CluE and eight other biclustering tools on GEO Omnibus microarray data sets; (c) drug repositioning predictions by co-clustering on drug, gene and disease networks. The outstanding performance of n-CluE in the real-world applications shows its strength and flexibility in integrating heterogeneous data and extracting biological relevant information in bioinformatic analyses.Die enormen Fortschritte im Bereich Labortechnik haben in jĂŒngster Zeit zu einer exponentiell wachsenden Menge an heterogenen und unstrukturierten Daten gefĂŒhrt. Dies stellt eine große Herausforderung fĂŒr systembiologische Forschung dar, innerhalb derer diese Datenmengen durch Datenintegration und Datamining zusammengefasst und in Kombination analysiert werden. Traditionelles Clustering ist eine vielseitig eingesetzte Methode, um EntitĂ€ten innerhalb grosser Datenmengen bezĂŒglich ihrer Ähnlichkeit bestimmter Attribute zu gruppieren (“clustern„). Beim Clustern von heterogenen Daten aus n (n > 2) unterschiedlichen Quellen zeigen traditionelle Clusteringmethoden jedoch SchwĂ€chen. In solchen FĂ€llen bieten Co-clusteringmethoden dadurch Vorteile, dass sie DatensĂ€tze gleichzeitig partitionieren können. In dieser Dissertation stelle ich neue Clusteringmethoden vor, die in der Software n-CluE zusammengefĂŒhrt sind. Diese neuen Methoden wurden aus dem bi-/n-cluster editing heraus entwickelt und lösen durch Transformation der EingangsdatensĂ€tze in n-partite Graphen mit verschiedenen Strategien das zugrundeliegende Clusteringproblem. Diese Dissertation ist in zwei verschiedene Teile gegliedert. Der erste Teil befasst sich eingehend mit der KomplexitĂ€tanalyse verschiedener erweiterter bicluster editing Modelle, die sog. ?-bicluster editing Modelle und es wird der Beweis der NP-Schwere erbracht. Basierend auf diesen theoretischen Gesichtspunkten prĂ€sentiere ich im zweiten Teil drei unterschiedliche Algorithmen, einen exakten Algorithmus und zwei Heuristiken und demonstriere ihre LeistungsfĂ€higkeit und Robustheit im Vergleich mit anderen algorithmischen Herangehensweisen. Die StĂ€rken von n-CluE werden anhand von drei realen Anwendungsbeispielen untermauert: (a) Die Vorhersage neuartiger Genotyp-PhĂ€notyp-Assoziationen durch Biclustering-Analyse von Daten aus genomweiten Assoziationsstudien (GWAS);(b) Der Vergleich zwischen n-CluE und acht weiteren Softwarepaketen anhand von Bicluster-Analysen von Microarraydaten aus den Gene Expression Omnibus (GEO); (c) Die Vorhersage von Medikamenten-Repositionierung durch integrierte Analyse von Medikamenten-, Gen- und Krankeitsnetzwerken. Die Resultate zeigen eindrucksvoll die StĂ€rken der n-CluE Software. Das Ergebnis ist eine leistungsstarke, robuste und flexibel erweiterbare Implementierung des Biclustering-Theorems zur Integration grosser heterogener Datenmengen fĂŒr das Extrahieren biologisch relevanter Ergebnisse im Rahmen von bioinformatischen Studien

    Data mining using the crossing minimization paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Forestogram: Biclustering Visualization Framework with Applications in Public Transport and Bioinformatics

    Get PDF
    RÉSUMÉ : Dans de nombreux problĂšmes d’analyse de donnĂ©es, les donnĂ©es sont exprimĂ©es dans une matrice avec les sujets en ligne et les attributs en colonne. Les mĂ©thodes de segmentations traditionnelles visent Ă  regrouper les sujets (lignes), selon des critĂšres de similitude entre ces sujets. Le but est de constituer des groupes de sujets (lignes) qui partagent un certain degrĂ© de ressemblance. Les groupes obtenus permettent de garantir que les sujets partagent des similitudes dans leurs attributs (colonnes), il n’y a cependant aucune garantie sur ce qui se passe au niveau des attributs (les colonnes). Dans certaines applications, un regroupement simultanĂ© des lignes et des colonnes appelĂ© biclustering de la matrice de donnĂ©es peut ĂȘtre souhaitĂ©. Pour cela, nous concevons et dĂ©veloppons un nouveau cadre appelĂ© Forestogram, qui permet le calcul de ce regroupement simultanĂ© des lignes et des colonnes (biclusters)dans un mode hiĂ©rarchique. Le regroupement simultanĂ© des lignes et des colonnes de maniĂšre hiĂ©rarchique peut aider les praticiens Ă  mieux comprendre comment les groupes Ă©voluent avec des propriĂ©tĂ©s thĂ©oriques intĂ©ressantes. Forestogram, le nouvel outil de calcul et de visualisation proposĂ©, pourrait ĂȘtre considĂ©rĂ© comme une extension 3D du dendrogramme, avec une fusion orthogonale Ă©tendue. Chaque bicluster est constituĂ© d’un groupe de lignes (ou de sujets) qui dĂ©plie un schĂ©ma fortement corrĂ©lĂ© avec le groupe de colonnes (ou attributs) correspondantes. Cependant, au lieu d’effectuer un clustering bidirectionnel indĂ©pendamment de chaque cĂŽtĂ©, nous proposons un algorithme de biclustering hiĂ©rarchique qui prend les lignes et les colonnes en mĂȘme temps pour dĂ©terminer les biclusters. De plus, nous dĂ©veloppons un critĂšre d’information basĂ© sur un modĂšle qui fournit un nombre estimĂ© de biclusters Ă  travers un ensemble de configurations hiĂ©rarchiques au sein du forestogramme sous des hypothĂšses lĂ©gĂšres. Nous Ă©tudions le cadre suggĂ©rĂ© dans deux perspectives appliquĂ©es diffĂ©rentes, l’une dans le domaine du transport en commun, l’autre dans le domaine de la bioinformatique. En premier lieu, nous Ă©tudions le comportement des usagers dans le transport en commun Ă  partir de deux informations distinctes, les donnĂ©es temporelles et les coordonnĂ©es spatiales recueillies Ă  partir des donnĂ©es de transaction de la carte Ă  puce des usagers. Dans de nombreuses villes, les sociĂ©tĂ©s de transport en commun du monde entier utilisent un systĂšme de carte Ă  puce pour gĂ©rer la perception des tarifs. L’analyse de cette information fournit un aperçu complet de l’influence de l’utilisateur dans le rĂ©seau de transport en commun interactif. À cet Ă©gard, l’analyse des donnĂ©es temporelles, dĂ©crivant l’heure d’entrĂ©e dans le rĂ©seau de transport en commun est considĂ©rĂ©e comme la composante la plus importante des donnĂ©es recueillies Ă  partir des cartes Ă  puce. Les techniques classiques de segmentation, basĂ©es sur la distance, ne sont pas appropriĂ©es pour analyser les donnĂ©es temporelles. Une nouvelle projection intuitive est suggĂ©rĂ©e pour conserver le modĂšle de donnĂ©es horodatĂ©es. Ceci est introduit dans la mĂ©thode suggĂ©rĂ©e pour dĂ©couvrir le modĂšle temporel comportemental des utilisateurs. Cette projection conserve la distance temporelle entre toute paire arbitraire de donnĂ©es horodatĂ©es avec une visualisation significative. Par consĂ©quent, cette information est introduite dans un algorithme de classification hiĂ©rarchique en tant que mĂ©thode de segmentation de donnĂ©es pour dĂ©couvrir le modĂšle des utilisateurs. Ensuite, l’heure d’utilisation est prise en compte comme une variable latente pour rendre la mĂ©trique euclidienne appropriĂ©e dans l’extraction du motif spatial Ă  travers notre forestogramme. Comme deuxiĂšme application, le forestogramme est testĂ© sur un ensemble de donnĂ©es multiomiques combinĂ©es Ă  partir de diffĂ©rentes mesures biologiques pour Ă©tudier comment l’état de santĂ© des patientes et les modalitĂ©s biologiques correspondantes Ă©voluent hiĂ©rarchiquement au cours du terme de la grossesse, dans chaque bicluster. Le maintien de la grossesse repose sur un Ă©quilibre finement Ă©quilibrĂ© entre la tolĂ©rance Ă  l’allogreffe foetale et la protection mĂ©canismes contre les agents pathogĂšnes envahissants. MalgrĂ© l’impact bien Ă©tabli du dĂ©veloppement pendant les premiers mois de la grossesse sur les rĂ©sultats Ă  long terme, les interactions entre les divers mĂ©canismes biologiques qui rĂ©gissent la progression de la grossesse n’ont pas Ă©tĂ© Ă©tudiĂ©es en dĂ©tail. DĂ©montrer la chronologie de ces adaptations Ă  la grossesse Ă  terme fournit le cadre pour de futures Ă©tudes examinant les dĂ©viations impliquĂ©es dans les pathologies liĂ©es Ă  la grossesse, y compris la naissance prĂ©maturĂ©e et la prĂ©Ă©clampsie. Nous effectuons une analyse multi-physique de 51 Ă©chantillons de 17 femmes enceintes, livrant Ă  terme. Les ensembles de donnĂ©es comprennent des mesures de l’immunome, du transcriptome, du microbiome, du protĂ©ome et du mĂ©tabolome d’échantillons obtenus simultanĂ©ment chez les mĂȘmes patients. La modĂ©lisation prĂ©dictive multivariĂ©e utilisant l’algorithme Elastic Net est utilisĂ©e pour mesurer la capacitĂ© de chaque ensemble de donnĂ©es Ă  prĂ©dire l’ñge gestationnel. En utilisant la gĂ©nĂ©ralisation empilĂ©e, ces ensembles de donnĂ©es sont combinĂ©s en un seul modĂšle. Ce modĂšle augmente non seulement significativement le pouvoir prĂ©dictif en combinant tous les ensembles de donnĂ©es, mais rĂ©vĂšle Ă©galement de nouvelles interactions entre diffĂ©rentes modalitĂ©s biologiques. En outre, notre forestogramme suggĂ©rĂ© est une autre ligne directrice avec l’ñge gestationnel au moment de l’échantillonnage qui fournit un modĂšle non supervisĂ© pour montrer combien d’informations supervisĂ©es sont nĂ©cessaires pour chaque trimestre pour caractĂ©riser les changements induits par la grossesse dans Microbiome, Transcriptome, GĂ©nome, Exposome et Immunome rĂ©ponses efficacement.----------ABSTRACT : In many statistical modeling problems data are expressed in a matrix with subjects in row and attributes in column. In this regard, simultaneous grouping of rows and columns known as biclustering of the data matrix is desired. We design and develop a new framework called Forestogram, with the aim of fast computational and hierarchical illustration of biclusters. Often in practical data analysis, we deal with a two-dimensional object known as the data matrix, where observations are expressed as samples (or subjects) in rows, and attributes (or features) in columns. Thus, simultaneous grouping of rows and columns in a hierarchical manner helps practitioners better understanding how clusters evolve. Forestogram, a novel computational and visualization tool, could be thought of as a 3D expansion of dendrogram, with extended orthogonal merge. Each bicluster consists of group of rows (or samples) that unfolds a highly-correlated schema with their corresponding group of columns (or attributes). However, instead of performing two-way clustering independently on each side, we propose a hierarchical biclustering algorithm which takes rows and columns at the same time to determine the biclusters. Furthermore, we develop a model-based information criterion which provides an estimated number of biclusters through a set of hierarchical configurations within the forestogram under mild assumptions. We study the suggested framework in two different applied perspectives, one in public transit domain, another one in bioinformatics field. First, we investigate the users’ behavior in public transit based on two distinct information, temporal data and spatial coordinates gathered from smart card. In many cities, worldwide public transit companies use smart card system to manage fare collection. Analysis of this information provides a comprehensive insight of user’s influence in the interactive public transit network. In this regard, analysis of temporal data, describing the time of entering to the public transit network is considered as the most substantial component of the data gathered from the smart cards. Classical distance-based techniques are not always suitable to analyze this time series data. A novel projection with intuitive visual map from higher dimension into a three-dimensional clock-like space is suggested to reveal the underlying temporal pattern of public transit users. This projection retains the temporal distance between any arbitrary pair of time-stamped data with meaningful visualization. Consequently, this information is fed into a hierarchical clustering algorithm as a method of data segmentation to discover the pattern of users. Then, the time of the usage is taken as a latent variable into account to make the Euclidean metric appropriate for extracting the spatial pattern through our forestogram. As a second application, forestogram is tested on a multiomics dataset combined from different biological measurements to study how patients and corresponding biological modalities evolve hierarchically in each bicluster over the term of pregnancy. The maintenance of pregnancy relies on a finely-tuned balance between tolerance to the fetal allograft and protective mechanisms against invading pathogens. Despite the well-established impact of development during the early months of pregnancy on long-term outcomes, the interactions between various biological mechanisms that govern the progression of pregnancy have not been studied in details. Demonstrating the chronology of these adaptations to term pregnancy provides the framework for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia. We perform a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets include measurements from the immunome, transcriptome, microbiome, proteome, and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net algorithm is used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets are combined into a single model. This model not only significantly increases the predictive power by combining all datasets, but also reveals novel interactions between different biological modalities. Furthermore, our suggested forestogram is another guideline along with the gestational age at time of sampling that provides an unsupervised model to show how much supervised information is necessary for each trimester to characterize the pregnancy-induced changes in Microbiome, Transcriptome, Genome, Exposome, and Immunome responses effectively

    Biclustering: Methods, Software and Application

    Get PDF
    Over the past 10 years, biclustering has become popular not only in the field of biological data analysis but also in other applications with high-dimensional two way datasets. This technique clusters both rows and columns simultaneously, as opposed to clustering only rows or only columns. Biclustering retrieves subgroups of objects that are similar in one subgroup of variables and different in the remaining variables. This dissertation focuses on improving and advancing biclustering methods. Since most existing methods are extremely sensitive to variations in parameters and data, we developed an ensemble method to overcome these limitations. It is possible to retrieve more stable and reliable bicluster in two ways: either by running algorithms with different parameter settings or by running them on sub- or bootstrap samples of the data and combining the results. To this end, we designed a software package containing a collection of bicluster algorithms for different clustering tasks and data scales, developed several new ways of visualizing bicluster solutions, and adapted traditional cluster validation indices (e.g. Jaccard index) for validating the bicluster framework. Finally, we applied biclustering to marketing data. Well-established algorithms were adjusted to slightly different data situations, and a new method specially adapted to ordinal data was developed. In order to test this method on artificial data, we generated correlated original random values. This dissertation introduces two methods for generating such values given a probability vector and a correlation structure. All the methods outlined in this dissertation are freely available in the R packages biclust and orddata. Numerous examples in this work illustrate how to use the methods and software.In den letzten 10 Jahren wurde das Biclustern vor allem auf dem Gebiet der biologischen Datenanalyse, jedoch auch in allen Bereichen mit hochdimensionalen Daten immer populĂ€rer. Unter Biclustering versteht man das simultane Clustern von 2-Wege-Daten, um Teilmengen von Objekten zu finden, die sich zu Teilmengen von Variablen Ă€hnlich verhalten. Diese Arbeit beschĂ€ftigt sich mit der Weiterentwicklung und Optimierung von Biclusterverfahren. Neben der Entwicklung eines Softwarepaketes zur Berechnung, Aufarbeitung und graphischen Darstellung von Bicluster Ergebnissen wurde eine Ensemble Methode fĂŒr Bicluster Algorithmen entwickelt. Da die meisten Algorithmen sehr anfĂ€llig auf kleine VerĂ€nderungen der Startparameter sind, können so robustere Ergebnisse erzielt werden. Die neue Methode schließt auch das ZusammenfĂŒgen von Bicluster Ergebnissen auf Subsample- und Bootstrap-Stichproben mit ein. Zur Validierung der Ergebnisse wurden auch bestehende Maße des traditionellen Clusterings (z.B. Jaccard Index) fĂŒr das Biclustering adaptiert und neue graphische Mittel fĂŒr die Interpretation der Ergebnisse entwickelt. Ein weiterer Teil der Arbeit beschĂ€ftigt sich mit der Anwendung von Bicluster Algorithmen auf Daten aus dem Marketing Bereich. Dazu mussten bestehende Algorithmen verĂ€ndert und auch ein neuer Algorithmus speziell fĂŒr ordinale Daten entwickelt werden. Um das Testen dieser Methoden auf kĂŒnstlichen Daten zu ermöglichen, beinhaltet die Arbeit auch die Ausarbeitung eines Verfahrens zur Ziehung ordinaler Zufallszahlen mit vorgegebenen Wahrscheinlichkeiten und Korrelationsstruktur. Die in der Arbeit vorgestellten Methoden stehen durch die beiden R Pakete biclust und orddata allgemein zur VerfĂŒgung. Die Nutzbarkeit wird in der Arbeit durch zahlreiche Beispiele aufgezeigt
    • 

    corecore