48 research outputs found

    Biclustering fMRI time series

    Get PDF
    Tese de mestrado, CiĂȘncia de Dados, Universidade de Lisboa, Faculdade de CiĂȘncias, 2020Biclustering Ă© um mĂ©todo de anĂĄlise que procura gerar clusters tendo em conta simultaneamente as linhas e as colunas de uma matriz de dados. Este mĂ©todo tem sido vastamente explorado em anĂĄlise de dados genĂ©ticos. Apesar de diversos estudos reconhecerem as capacidades deste mĂ©todo de anĂĄlise em outras ĂĄreas de investigação, as Ășltimas duas dĂ©cadas tem sido marcadas por um nĂșmero elevado de estudos aplicados em dados genĂ©ticos e pela ausĂȘncia de uma linha de investigação que explore as capacidades de biclustering fora desta ĂĄrea tradicional Esta tese segue pistas que sugerem potencial no uso de biclustering em dados de natureza espaço-temporal. Considerando o contexto particular das neurociĂȘncias, esta tese explora as capacidades dos algoritmos de biclustering em extrair conhecimento das sĂ©ries temporais geradas por tĂ©cnicas de imagem por ressonĂąncia magnĂ©tica funcional (fMRI). Eta tese propĂ”e uma metodologia para avaliar a capacidade de algoritmos de biclustering em estudar dados fMRI, considerando tanto dados sintĂ©ticos como dados reais. Para avaliar estes algoritmos, usamos mĂ©tricas de avaliação interna. Os nossos resultados discutem o uso de diversas estratĂ©gias de busca, revelando a superioridade de estratĂ©gias exaustivos para obter os biclusters mais homogĂ©neos. No entanto, o elevado custo computacional de estratĂ©gias exaustivas ainda sĂŁo um desafio e Ă© necessĂĄrio pesquisa adicional para a busca eficiente de biclusters no contexto de anĂĄlise de dados fMRI. Propomos adicionalmente uma nova metodologia de anĂĄlise de biclusters baseada em algoritmos de descoberta de padrĂ”es para determinar os padrĂ”es mais frequentes presentes nas soluçÔes de biclustering geradas. Um bicluster nĂŁo Ă© mais que um hipervĂ©rtice num hipergrafo . Extrair padrĂ”es frequentes numa solução de biclustering implica extrair os hipervĂ©rtices mais significativos. Numa primeira abordagem, isto permite entender relaçÔes entre regiĂ”es do cĂ©rebro e traçar perfis temporais que mĂ©todos tradicionais de estudos de correlação nĂŁo sĂŁo capazes de detetar. Adicionalmente, o processo de gerar os biclusters permite filtrar ligaçÔes pouco interessantes, permitindo potencialmente gerar hipergrafos de forma eficiente. A questĂŁo final Ă© o que podemos fazer com este conhecimento. Conhecer a relação entre regiĂ”es do cĂ©rebro Ă© o objetivo central das neurociĂȘncias. Entender as ligaçÔes entre regiĂ”es do cĂ©rebro para vĂĄrios sujeitos permitem traçar perfis. Nesse caso, propomos uma metodologia para extrapolar biclusters para dados tridimensionais e efetuar triclustering. Adicionalmente, entender a ligação entre zonas cerebrais permite identificar doenças como a esquizofrenia, demĂȘncia ou o Alzheimer. Este trabalho aponta caminhos para o uso de biclustering na anĂĄlise de dados espaço-temporais, em particular em neurociĂȘncias. A metodologia de avaliação proposta mostra evidĂȘncias da eficĂĄcia do biclustering para encontrar padrĂ”es locais em dados de fMRI, embora mais trabalhos sejam necessĂĄrios em relação Ă  escalabilidade para promover a aplicação em cenĂĄrios reais.The effectiveness of biclustering, simultaneous clustering of both rows and columns in a data matrix, has been primarily shown in gene expression data analysis. Furthermore, several researchers recognize its potentialities in other research areas. Nevertheless, the last two decades witnessed many biclustering algorithms targeting gene expression data analysis and a lack of consistent studies exploring the capacities of biclustering outside this traditional application domain. Following hints that suggest potentialities for biclustering on Spatiotemporal data, particularly in neurosciences, this thesis explores biclustering’s capacity to extract knowledge from fMRI time series. This thesis proposes a methodology to evaluate biclustering algorithms’ feasibility to study the fMRI signal, considering both synthetic and realworld fMRI datasets. In the absence of ground truth to compare bicluster solutions with a reference one, we used internal valuation metrics. Results discussing the use of different search strategies showed the superiority of exhaustive approaches, obtaining the most homogeneous biclusters. However, their high computational cost is still a challenge, and further work is needed for the efficient use of biclustering in fMRI data analysis. We propose a new methodology for analyzing biclusters based on performing pattern mining algorithms to determine the most frequent patterns present in the generated biclustering solutions. A bicluster is nothing more than a hyperlink in a hypergraph. Extracting frequent patterns in a biclustering solution implies extracting the most significant hyperlinks. In a first approach, this allows to understand relationships between regions of the brain and draw temporal profiles that traditional methods of correlation studies cannot detect. Additionally, the process of generating biclusters allows filtering uninteresting links, potentially allowing to generate hypergraphs efficiently. The final question is, what can we do with this knowledge. Knowing the relationship between brain regions is the central objective of neurosciences. Understanding the connections between regions of the brain for various subjects allows one to draw profiles. In this case, we propose a methodology to extrapolate biclusters to threedimensional data and perform triclustering. Additionally, understanding the link between brain zones allows identifying diseases like schizophrenia, dementia, or Alzheimer’s. This work pinpoints avenues for the use of biclustering in Spatiotemporal data analysis, in particular neurosciences applications. The proposed evaluation methodology showed evidence of biclustering’s effectiveness in finding local fMRI data patterns, although further work is needed regarding scalability to promote the application in real scenarios

    Analyzing Activity of the Human Brain During Decision Making

    Get PDF
    Orbitofrontaalne ajukoor (OFC) on aju ees istuv piirkond, mille toimimist ei ole suudetud tĂ€ielikult mĂ”ista. Siiski on see selgelt seotud otsuste tegemisega, nagu on nĂ€idatud paljudes viimastel aastakĂŒmnetel lĂ€bi viidud neuroloogiauuringutes. Saez jt [1] on oma viimases uuringus leidnud tĂ”endeid selle kohta, et OFC kĂ”rge sagedusega aktiivsus (HFA) 70-200 Hz vahel on otseselt seotud kĂ€itumisreaktsioonidega otsuste tegemisel. NĂ€iteks nĂ€itasid Saez jt, et mĂ”ned HFA modulatsioonid korreleeruvad inimese valikuga ja tagajĂ€rgedega lihtsa kihlveo mĂ€ngus. Saez jt viisid lĂ€bi analĂŒĂŒsi ĂŒhe muutujaga lineaarse regressiooniga, ennustades HFA vÀÀrtusi korraga ĂŒhest ĂŒlesandega seotud parameetrist, et leida elektroode, mis kodeerivad otsuste tegemisega seotud informatsiooni. Antud magistritöö keskendus Saez jt tulemuste ja analĂŒĂŒsi laiendamisele, rakendades mitmemÔÔtmelisi meetodeid, et avastada keerulisi signaale ja olulisi mustreid neuroloogilistes andmetes. Selleks kasutati 600 erineval andmekogumil kanoonilist korrelatsioonianalĂŒĂŒsi ja klasterdamist, et leida mustreid elektroodide aktiivsusmÔÔdetes ja kĂ€itumuslike reaktsioonide keerukaid kombinatsioone kodeerituna inimaju signaalides. Lisaks kasutati masinĂ”ppemeetodeid, et analĂŒĂŒsida patsientide kĂ€itumissuundumusi riskivĂ”tmise suhtes hasartmĂ€nguĂŒlesandes ja ennustada nĂ€rviandmetest ĂŒlesandega seotud sĂŒndmusi nagu vĂ”itmine, kaotamine ja riskivĂ”tmine. Enamiku meetoditega saavutati mÔÔdukad kuni head tulemused, kuid pĂ”hjalikum analĂŒĂŒs on siiski vajalik, et saada tĂ€ielik arusaam sellest, kuidas orbitofrontaalse ajukoore aktiivsus pĂ”hjustab inimkĂ€itumist otsuste tegemisel.The orbitofrontal cortex (OFC) is a region sitting at the front of the brain which function is not fully understood. However, it has been clearly implicated in decision making as shown by many neuroimaging studies over the last decades. Recent work by Saez et al. [1] has found evidence that OFC activity of high frequency (HFA) between 70-200 Hz is directly related to behavioral responses during decision making tasks. In particular, Saez et al. showed that some modulations of HFA correlated with the human choice and outcome in a simple betting game. Saez et al. conducted their analysis with univariate linear regression, predicting HFA values from one task-related parameter at a time to find electrodes which encode decision making information. This Thesis focused on extending these results and analyses of Saez et al. by applying multivariate methods to discover complex signals and important patterns in the neural data. For this, canonical correlation analysis and biclustering were used on 600 different datasets to find evidence of patterns in electrode responses and complicated combinations of behavioral responses encoded in the human brain signals. In addition, machine learning methods were used to analyze the patients' behavioral tendencies towards risk-taking in a gambling task and to predict task-related events such as winning, losing and gambling from the neural data. Moderate to good performance was achieved with most methods, but in-depth analysis is still necessary to gain a full understanding of how activity in orbitofrontal cortex gives rise to human behavior in decision making tasks

    Statistical Techniques for Exploratory Analysis of Structured Three-Way and Dynamic Network Data.

    Full text link
    In this thesis, I develop different techniques for the pattern extraction and visual exploration of a collection of data matrices. Specifically, I present methods to help home in on and visualize an underlying structure and its evolution over ordered (e.g., time) or unordered (e.g., experimental conditions) index sets. The first part of the thesis introduces a biclustering technique for such three dimensional data arrays. This technique is capable of discovering potentially overlapping groups of samples and variables that evolve similarly with respect to a subset of conditions. To facilitate and enhance visual exploration, I introduce a framework that utilizes kernel smoothing to guide the estimation of bicluster responses over the array. In the second part of the thesis, I introduce two matrix factorization models. The first is a data integration model that decomposes the data into two factors: a basis common to all data matrices, and a coefficient matrix that varies for each data matrix. The second model is meant for visual clustering of nodes in dynamic network data, which often contains complex evolving structure. Hence, this approach is more flexible and additionally lets the basis evolve for each matrix in the array. Both models utilize a regularization within the framework of non-negative matrix factorization to encourage local smoothness of the basis and coefficient matrices, which improves interpretability and highlights the structural patterns underlying the data, while mitigating noise effects. I also address computational aspects of applying regularized non-negative matrix factorization models to large data arrays by presenting multiple algorithms, including an approximation algorithm based on alternating least squares.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99838/1/smankad_1.pd

    A Survey of Adaptive Resonance Theory Neural Network Models for Engineering Applications

    Full text link
    This survey samples from the ever-growing family of adaptive resonance theory (ART) neural network models used to perform the three primary machine learning modalities, namely, unsupervised, supervised and reinforcement learning. It comprises a representative list from classic to modern ART models, thereby painting a general picture of the architectures developed by researchers over the past 30 years. The learning dynamics of these ART models are briefly described, and their distinctive characteristics such as code representation, long-term memory and corresponding geometric interpretation are discussed. Useful engineering properties of ART (speed, configurability, explainability, parallelization and hardware implementation) are examined along with current challenges. Finally, a compilation of online software libraries is provided. It is expected that this overview will be helpful to new and seasoned ART researchers

    G-Tric: enhancing triclustering evaluation using three-way synthetic datasets with ground truth

    Get PDF
    Tese de mestrado, CiĂȘncia de Dados, Universidade de Lisboa, Faculdade de CiĂȘncias, 2020Three-dimensional datasets, or three-way data, started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations _ features _ contexts). With an increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount.These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real three-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers’ needs across several properties, including data type (numeric or symbolic), dimension, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled by defining the number of missing values, noise, and errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches. Besides reviewing the current state-of-the-art regarding triclustering approaches, comparison studies and evaluation metrics, this work also analyzes how the lack of frameworks to generate synthetic data influences existent evaluation methodologies, limiting the scope of performance insights that can be extracted from each algorithm. As well as exemplifying how the set of decisions made on these evaluations can impact the quality and validity of those results. Alternatively, a different methodology that takes advantage of synthetic data with ground truth is presented. This approach, combined with the proposal of an extension to an existing clustering extrinsic measure, enables to assess solutions’ quality under new perspectives

    Forestogram: Biclustering Visualization Framework with Applications in Public Transport and Bioinformatics

    Get PDF
    RÉSUMÉ : Dans de nombreux problĂšmes d’analyse de donnĂ©es, les donnĂ©es sont exprimĂ©es dans une matrice avec les sujets en ligne et les attributs en colonne. Les mĂ©thodes de segmentations traditionnelles visent Ă  regrouper les sujets (lignes), selon des critĂšres de similitude entre ces sujets. Le but est de constituer des groupes de sujets (lignes) qui partagent un certain degrĂ© de ressemblance. Les groupes obtenus permettent de garantir que les sujets partagent des similitudes dans leurs attributs (colonnes), il n’y a cependant aucune garantie sur ce qui se passe au niveau des attributs (les colonnes). Dans certaines applications, un regroupement simultanĂ© des lignes et des colonnes appelĂ© biclustering de la matrice de donnĂ©es peut ĂȘtre souhaitĂ©. Pour cela, nous concevons et dĂ©veloppons un nouveau cadre appelĂ© Forestogram, qui permet le calcul de ce regroupement simultanĂ© des lignes et des colonnes (biclusters)dans un mode hiĂ©rarchique. Le regroupement simultanĂ© des lignes et des colonnes de maniĂšre hiĂ©rarchique peut aider les praticiens Ă  mieux comprendre comment les groupes Ă©voluent avec des propriĂ©tĂ©s thĂ©oriques intĂ©ressantes. Forestogram, le nouvel outil de calcul et de visualisation proposĂ©, pourrait ĂȘtre considĂ©rĂ© comme une extension 3D du dendrogramme, avec une fusion orthogonale Ă©tendue. Chaque bicluster est constituĂ© d’un groupe de lignes (ou de sujets) qui dĂ©plie un schĂ©ma fortement corrĂ©lĂ© avec le groupe de colonnes (ou attributs) correspondantes. Cependant, au lieu d’effectuer un clustering bidirectionnel indĂ©pendamment de chaque cĂŽtĂ©, nous proposons un algorithme de biclustering hiĂ©rarchique qui prend les lignes et les colonnes en mĂȘme temps pour dĂ©terminer les biclusters. De plus, nous dĂ©veloppons un critĂšre d’information basĂ© sur un modĂšle qui fournit un nombre estimĂ© de biclusters Ă  travers un ensemble de configurations hiĂ©rarchiques au sein du forestogramme sous des hypothĂšses lĂ©gĂšres. Nous Ă©tudions le cadre suggĂ©rĂ© dans deux perspectives appliquĂ©es diffĂ©rentes, l’une dans le domaine du transport en commun, l’autre dans le domaine de la bioinformatique. En premier lieu, nous Ă©tudions le comportement des usagers dans le transport en commun Ă  partir de deux informations distinctes, les donnĂ©es temporelles et les coordonnĂ©es spatiales recueillies Ă  partir des donnĂ©es de transaction de la carte Ă  puce des usagers. Dans de nombreuses villes, les sociĂ©tĂ©s de transport en commun du monde entier utilisent un systĂšme de carte Ă  puce pour gĂ©rer la perception des tarifs. L’analyse de cette information fournit un aperçu complet de l’influence de l’utilisateur dans le rĂ©seau de transport en commun interactif. À cet Ă©gard, l’analyse des donnĂ©es temporelles, dĂ©crivant l’heure d’entrĂ©e dans le rĂ©seau de transport en commun est considĂ©rĂ©e comme la composante la plus importante des donnĂ©es recueillies Ă  partir des cartes Ă  puce. Les techniques classiques de segmentation, basĂ©es sur la distance, ne sont pas appropriĂ©es pour analyser les donnĂ©es temporelles. Une nouvelle projection intuitive est suggĂ©rĂ©e pour conserver le modĂšle de donnĂ©es horodatĂ©es. Ceci est introduit dans la mĂ©thode suggĂ©rĂ©e pour dĂ©couvrir le modĂšle temporel comportemental des utilisateurs. Cette projection conserve la distance temporelle entre toute paire arbitraire de donnĂ©es horodatĂ©es avec une visualisation significative. Par consĂ©quent, cette information est introduite dans un algorithme de classification hiĂ©rarchique en tant que mĂ©thode de segmentation de donnĂ©es pour dĂ©couvrir le modĂšle des utilisateurs. Ensuite, l’heure d’utilisation est prise en compte comme une variable latente pour rendre la mĂ©trique euclidienne appropriĂ©e dans l’extraction du motif spatial Ă  travers notre forestogramme. Comme deuxiĂšme application, le forestogramme est testĂ© sur un ensemble de donnĂ©es multiomiques combinĂ©es Ă  partir de diffĂ©rentes mesures biologiques pour Ă©tudier comment l’état de santĂ© des patientes et les modalitĂ©s biologiques correspondantes Ă©voluent hiĂ©rarchiquement au cours du terme de la grossesse, dans chaque bicluster. Le maintien de la grossesse repose sur un Ă©quilibre finement Ă©quilibrĂ© entre la tolĂ©rance Ă  l’allogreffe foetale et la protection mĂ©canismes contre les agents pathogĂšnes envahissants. MalgrĂ© l’impact bien Ă©tabli du dĂ©veloppement pendant les premiers mois de la grossesse sur les rĂ©sultats Ă  long terme, les interactions entre les divers mĂ©canismes biologiques qui rĂ©gissent la progression de la grossesse n’ont pas Ă©tĂ© Ă©tudiĂ©es en dĂ©tail. DĂ©montrer la chronologie de ces adaptations Ă  la grossesse Ă  terme fournit le cadre pour de futures Ă©tudes examinant les dĂ©viations impliquĂ©es dans les pathologies liĂ©es Ă  la grossesse, y compris la naissance prĂ©maturĂ©e et la prĂ©Ă©clampsie. Nous effectuons une analyse multi-physique de 51 Ă©chantillons de 17 femmes enceintes, livrant Ă  terme. Les ensembles de donnĂ©es comprennent des mesures de l’immunome, du transcriptome, du microbiome, du protĂ©ome et du mĂ©tabolome d’échantillons obtenus simultanĂ©ment chez les mĂȘmes patients. La modĂ©lisation prĂ©dictive multivariĂ©e utilisant l’algorithme Elastic Net est utilisĂ©e pour mesurer la capacitĂ© de chaque ensemble de donnĂ©es Ă  prĂ©dire l’ñge gestationnel. En utilisant la gĂ©nĂ©ralisation empilĂ©e, ces ensembles de donnĂ©es sont combinĂ©s en un seul modĂšle. Ce modĂšle augmente non seulement significativement le pouvoir prĂ©dictif en combinant tous les ensembles de donnĂ©es, mais rĂ©vĂšle Ă©galement de nouvelles interactions entre diffĂ©rentes modalitĂ©s biologiques. En outre, notre forestogramme suggĂ©rĂ© est une autre ligne directrice avec l’ñge gestationnel au moment de l’échantillonnage qui fournit un modĂšle non supervisĂ© pour montrer combien d’informations supervisĂ©es sont nĂ©cessaires pour chaque trimestre pour caractĂ©riser les changements induits par la grossesse dans Microbiome, Transcriptome, GĂ©nome, Exposome et Immunome rĂ©ponses efficacement.----------ABSTRACT : In many statistical modeling problems data are expressed in a matrix with subjects in row and attributes in column. In this regard, simultaneous grouping of rows and columns known as biclustering of the data matrix is desired. We design and develop a new framework called Forestogram, with the aim of fast computational and hierarchical illustration of biclusters. Often in practical data analysis, we deal with a two-dimensional object known as the data matrix, where observations are expressed as samples (or subjects) in rows, and attributes (or features) in columns. Thus, simultaneous grouping of rows and columns in a hierarchical manner helps practitioners better understanding how clusters evolve. Forestogram, a novel computational and visualization tool, could be thought of as a 3D expansion of dendrogram, with extended orthogonal merge. Each bicluster consists of group of rows (or samples) that unfolds a highly-correlated schema with their corresponding group of columns (or attributes). However, instead of performing two-way clustering independently on each side, we propose a hierarchical biclustering algorithm which takes rows and columns at the same time to determine the biclusters. Furthermore, we develop a model-based information criterion which provides an estimated number of biclusters through a set of hierarchical configurations within the forestogram under mild assumptions. We study the suggested framework in two different applied perspectives, one in public transit domain, another one in bioinformatics field. First, we investigate the users’ behavior in public transit based on two distinct information, temporal data and spatial coordinates gathered from smart card. In many cities, worldwide public transit companies use smart card system to manage fare collection. Analysis of this information provides a comprehensive insight of user’s influence in the interactive public transit network. In this regard, analysis of temporal data, describing the time of entering to the public transit network is considered as the most substantial component of the data gathered from the smart cards. Classical distance-based techniques are not always suitable to analyze this time series data. A novel projection with intuitive visual map from higher dimension into a three-dimensional clock-like space is suggested to reveal the underlying temporal pattern of public transit users. This projection retains the temporal distance between any arbitrary pair of time-stamped data with meaningful visualization. Consequently, this information is fed into a hierarchical clustering algorithm as a method of data segmentation to discover the pattern of users. Then, the time of the usage is taken as a latent variable into account to make the Euclidean metric appropriate for extracting the spatial pattern through our forestogram. As a second application, forestogram is tested on a multiomics dataset combined from different biological measurements to study how patients and corresponding biological modalities evolve hierarchically in each bicluster over the term of pregnancy. The maintenance of pregnancy relies on a finely-tuned balance between tolerance to the fetal allograft and protective mechanisms against invading pathogens. Despite the well-established impact of development during the early months of pregnancy on long-term outcomes, the interactions between various biological mechanisms that govern the progression of pregnancy have not been studied in details. Demonstrating the chronology of these adaptations to term pregnancy provides the framework for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia. We perform a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets include measurements from the immunome, transcriptome, microbiome, proteome, and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net algorithm is used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets are combined into a single model. This model not only significantly increases the predictive power by combining all datasets, but also reveals novel interactions between different biological modalities. Furthermore, our suggested forestogram is another guideline along with the gestational age at time of sampling that provides an unsupervised model to show how much supervised information is necessary for each trimester to characterize the pregnancy-induced changes in Microbiome, Transcriptome, Genome, Exposome, and Immunome responses effectively

    Matrix Reordering Methods for Table and Network Visualization

    Get PDF
    International audienceThis survey provides a description of algorithms to reorder visual matrices of tabular data and adjacency matrix of networks. The goal of this survey is to provide a comprehensive list of reordering algorithms published in different fields such as statistics, bioinformatics, or graph theory. While several of these algorithms are described in publications and others are available in software libraries and programs, there is little awareness of what is done across all fields. Our survey aims at describing these reordering algorithms in a unified manner to enable a wide audience to understand their differences and subtleties. We organize this corpus in a consistent manner, independently of the application or research field. We also provide practical guidance on how to select appropriate algorithms depending on the structure and size of the matrix to reorder, and point to implementations when available

    Vertical integration of multiple high-dimensional datasets

    Get PDF
    Research in genomics and related fields now often requires the analysis of emph{multi-block} data, in which multiple high-dimensional types of data are available for a common set of objects. We introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of multi-block datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across datatypes, low-rank approximations for structured variation individual to each datatype, and residual noise. JIVE quantifies the amount of joint variation between datatypes, reduces the dimensionality of the data, and allows for the visual exploration of joint and individual structure. JIVE is an extension of Principal Components Analysis and has clear advantages over popular two-block methods such as Canonical Correlation and Partial Least Squares. Research in a number of fields also requires the analysis of emph{multi-way data}. Multi-way data take the form of a three (or higher) dimensional array. We compare several existing factorization methods for multi-way data, and we show that these methods belong to the same unified framework. The final portion of this dissertation concerns biclustering. We introduce an approach to biclustering a binary data matrix, and discuss the application of biclustering to classification problems
    corecore