80 research outputs found

    BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

    Get PDF
    Matrix tri-factorization subject to binary constraints is a versatile and powerful framework for the simultaneous clustering of observations and features, also known as biclustering. Applications for biclustering encompass the clustering of high-dimensional data and explorative data mining, where the selection of the most important features is relevant. Unfortunately, due to the lack of suitable methods for the optimization subject to binary constraints, the powerful framework of biclustering is typically constrained to clusterings which partition the set of observations or features. As a result, overlap between clusters cannot be modelled and every item, even outliers in the data, have to be assigned to exactly one cluster. In this paper we propose Broccoli, an optimization scheme for matrix factorization subject to binary constraints, which is based on the theoretically well-founded optimization scheme of proximal stochastic gradient descent. Thereby, we do not impose any restrictions on the obtained clusters. Our experimental evaluation, performed on both synthetic and real-world data, and against 6 competitor algorithms, show reliable and competitive performance, even in presence of a high amount of noise in the data. Moreover, a qualitative analysis of the identified clusters shows that Broccoli may provide meaningful and interpretable clustering structures

    Deep Learning based Recommender System: A Survey and New Perspectives

    Full text link
    With the ever-growing volume of online information, recommender systems have been an effective strategy to overcome such information overload. The utility of recommender systems cannot be overstated, given its widespread adoption in many web applications, along with its potential impact to ameliorate many problems related to over-choice. In recent years, deep learning has garnered considerable interest in many research fields such as computer vision and natural language processing, owing not only to stellar performance but also the attractive property of learning feature representations from scratch. The influence of deep learning is also pervasive, recently demonstrating its effectiveness when applied to information retrieval and recommender systems research. Evidently, the field of deep learning in recommender system is flourishing. This article aims to provide a comprehensive review of recent research efforts on deep learning based recommender systems. More concretely, we provide and devise a taxonomy of deep learning based recommendation models, along with providing a comprehensive summary of the state-of-the-art. Finally, we expand on current trends and provide new perspectives pertaining to this new exciting development of the field.Comment: The paper has been accepted by ACM Computing Surveys. https://doi.acm.org/10.1145/328502

    Graphical models for biclustering and information retrieval in gene expression data

    Get PDF
    The cell coordinates its biological response to the environment partly via the selective synthesis of thousands of unique RNA and protein molecules. Understanding the molecular biology of the cell is thus essential to the advancement of areas such as health care, agriculture, and energy production, but requires the ability to simultaneously acquire information about thousands of molecules in a sample. Recent high-throughput measurement technologies address this concern. While being useful, they generate a high volume of data and bring in methodological challenges, effectively shifting the bottleneck in molecular biology research from data acquisition to data analysis. In particular, an important challenge is the genome-wide analysis of how RNA is transcribed under different conditions, organisms, and tissues, a process known as gene expression. When developing computational methods for biological data analysis tasks, probabilistic frameworks constitute promising approaches due to their flexibility, soundness, and ability to handle noisy data. In this thesis, the contributions are in the development of probabilistic methods for two relevant tasks in genome-wide gene expression analysis, namely biclustering and information retrieval. Biclustering concerns the simultaneous grouping of objects, e.g., genes, and conditions. The first contribution is the development of a Bayesian extension to an existing biclustering model. The second contribution is a novel probabilistic method that allows deriving a hierarchical organization of microarrays in a gene expression data set and at the same time indicate the genes that characterize the hierarchy. Finally, the third contribution is a general probabilistic biclustering framework that easily lends itself to different data types and model assumptions. Information retrieval in gene expression data is needed because of the increasing amount of available data stored in public databases. Two probabilistic methods for information retrieval are proposed. The models are used in a series of biological case studies that show how the proposed approaches have the potential to accelerate biological research by jointly analyzing data from different studies. In particular, several connections between biological conditions found by the models either correspond to existing biological knowledge or were used in a confirmatory follow-up study to obtain novel biological findings

    The State of the Art in Multilayer Network Visualization

    Get PDF
    Modelling relationships between entities in real-world systems with a simple graph is a standard approach. However, reality is better embraced as several interdependent subsystems (or layers). Recently the concept of a multilayer network model has emerged from the field of complex systems. This model can be applied to a wide range of real-world datasets. Examples of multilayer networks can be found in the domains of life sciences, sociology, digital humanities and more. Within the domain of graph visualization there are many systems which visualize datasets having many characteristics of multilayer graphs. This report provides a state of the art and a structured analysis of contemporary multilayer network visualization, not only for researchers in visualization, but also for those who aim to visualize multilayer networks in the domain of complex systems, as well as those developing systems across application domains. We have explored the visualization literature to survey visualization techniques suitable for multilayer graph visualization, as well as tools, tasks, and analytic techniques from within application domains. This report also identifies the outstanding challenges for multilayer graph visualization and suggests future research directions for addressing them

    The State of the Art in Multilayer Network Visualization

    Get PDF
    Modelling relationship between entities in real-world systems with a simple graph is a standard approach. However, realityis better embraced as several interdependent subsystems (or layers). Recently, the concept of a multilayer network model hasemerged from the field of complex systems. This model can be applied to a wide range of real-world data sets. Examples ofmultilayer networks can be found in the domains of life sciences, sociology, digital humanities and more. Within the domainof graph visualization, there are many systems which visualize data sets having many characteristics of multilayer graphs.This report provides a state of the art and a structured analysis of contemporary multilayer network visualization, not only forresearchers in visualization, but also for those who aim to visualize multilayer networks in the domain of complex systems, as wellas those developing systems across application domains. We have explored the visualization literature to survey visualizationtechniques suitable for multilayer graph visualization, as well as tools, tasks and analytic techniques from within applicationdomains. This report also identifies the outstanding challenges for multilayer graph visualization and suggests future researchdirections for addressing them

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    Summarisation of weighted networks

    Get PDF
    Networks often contain implicit structure. We introduce novel problems and methods that look for structure in networks, by grouping nodes into supernodes and edges to superedges, and then make this structure visible to the user in a smaller generalised network. This task of finding generalisations of nodes and edges is formulated as network Summarisation'. We propose models and algorithms for networks that have weights on edges, on nodes or on both, and study three new variants of the network summarisation problem. In edge-based weighted network summarisation, the summarised network should preserve edge weights as well as possible. A wider class of settings is considered in path-based weighted network summarisation, where the resulting summarised network should preserve longer range connectivities between nodes. Node-based weighted network summarisation in turn allows weights also on nodes and summarisation aims to preserve more information related to high weight nodes. We study theoretical properties of these problems and show them to be NP-hard. We propose a range of heuristic generalisation algorithms with different trade-offs between complexity and quality of the result. Comprehensive experiments on real data show that weighted networks can be summarised efficiently with relatively little error.Peer reviewe

    Forestogram: Biclustering Visualization Framework with Applications in Public Transport and Bioinformatics

    Get PDF
    RÉSUMÉ : Dans de nombreux problèmes d’analyse de données, les données sont exprimées dans une matrice avec les sujets en ligne et les attributs en colonne. Les méthodes de segmentations traditionnelles visent à regrouper les sujets (lignes), selon des critères de similitude entre ces sujets. Le but est de constituer des groupes de sujets (lignes) qui partagent un certain degré de ressemblance. Les groupes obtenus permettent de garantir que les sujets partagent des similitudes dans leurs attributs (colonnes), il n’y a cependant aucune garantie sur ce qui se passe au niveau des attributs (les colonnes). Dans certaines applications, un regroupement simultané des lignes et des colonnes appelé biclustering de la matrice de données peut être souhaité. Pour cela, nous concevons et développons un nouveau cadre appelé Forestogram, qui permet le calcul de ce regroupement simultané des lignes et des colonnes (biclusters)dans un mode hiérarchique. Le regroupement simultané des lignes et des colonnes de manière hiérarchique peut aider les praticiens à mieux comprendre comment les groupes évoluent avec des propriétés théoriques intéressantes. Forestogram, le nouvel outil de calcul et de visualisation proposé, pourrait être considéré comme une extension 3D du dendrogramme, avec une fusion orthogonale étendue. Chaque bicluster est constitué d’un groupe de lignes (ou de sujets) qui déplie un schéma fortement corrélé avec le groupe de colonnes (ou attributs) correspondantes. Cependant, au lieu d’effectuer un clustering bidirectionnel indépendamment de chaque côté, nous proposons un algorithme de biclustering hiérarchique qui prend les lignes et les colonnes en même temps pour déterminer les biclusters. De plus, nous développons un critère d’information basé sur un modèle qui fournit un nombre estimé de biclusters à travers un ensemble de configurations hiérarchiques au sein du forestogramme sous des hypothèses légères. Nous étudions le cadre suggéré dans deux perspectives appliquées différentes, l’une dans le domaine du transport en commun, l’autre dans le domaine de la bioinformatique. En premier lieu, nous étudions le comportement des usagers dans le transport en commun à partir de deux informations distinctes, les données temporelles et les coordonnées spatiales recueillies à partir des données de transaction de la carte à puce des usagers. Dans de nombreuses villes, les sociétés de transport en commun du monde entier utilisent un système de carte à puce pour gérer la perception des tarifs. L’analyse de cette information fournit un aperçu complet de l’influence de l’utilisateur dans le réseau de transport en commun interactif. À cet égard, l’analyse des données temporelles, décrivant l’heure d’entrée dans le réseau de transport en commun est considérée comme la composante la plus importante des données recueillies à partir des cartes à puce. Les techniques classiques de segmentation, basées sur la distance, ne sont pas appropriées pour analyser les données temporelles. Une nouvelle projection intuitive est suggérée pour conserver le modèle de données horodatées. Ceci est introduit dans la méthode suggérée pour découvrir le modèle temporel comportemental des utilisateurs. Cette projection conserve la distance temporelle entre toute paire arbitraire de données horodatées avec une visualisation significative. Par conséquent, cette information est introduite dans un algorithme de classification hiérarchique en tant que méthode de segmentation de données pour découvrir le modèle des utilisateurs. Ensuite, l’heure d’utilisation est prise en compte comme une variable latente pour rendre la métrique euclidienne appropriée dans l’extraction du motif spatial à travers notre forestogramme. Comme deuxième application, le forestogramme est testé sur un ensemble de données multiomiques combinées à partir de différentes mesures biologiques pour étudier comment l’état de santé des patientes et les modalités biologiques correspondantes évoluent hiérarchiquement au cours du terme de la grossesse, dans chaque bicluster. Le maintien de la grossesse repose sur un équilibre finement équilibré entre la tolérance à l’allogreffe foetale et la protection mécanismes contre les agents pathogènes envahissants. Malgré l’impact bien établi du développement pendant les premiers mois de la grossesse sur les résultats à long terme, les interactions entre les divers mécanismes biologiques qui régissent la progression de la grossesse n’ont pas été étudiées en détail. Démontrer la chronologie de ces adaptations à la grossesse à terme fournit le cadre pour de futures études examinant les déviations impliquées dans les pathologies liées à la grossesse, y compris la naissance prématurée et la prééclampsie. Nous effectuons une analyse multi-physique de 51 échantillons de 17 femmes enceintes, livrant à terme. Les ensembles de données comprennent des mesures de l’immunome, du transcriptome, du microbiome, du protéome et du métabolome d’échantillons obtenus simultanément chez les mêmes patients. La modélisation prédictive multivariée utilisant l’algorithme Elastic Net est utilisée pour mesurer la capacité de chaque ensemble de données à prédire l’âge gestationnel. En utilisant la généralisation empilée, ces ensembles de données sont combinés en un seul modèle. Ce modèle augmente non seulement significativement le pouvoir prédictif en combinant tous les ensembles de données, mais révèle également de nouvelles interactions entre différentes modalités biologiques. En outre, notre forestogramme suggéré est une autre ligne directrice avec l’âge gestationnel au moment de l’échantillonnage qui fournit un modèle non supervisé pour montrer combien d’informations supervisées sont nécessaires pour chaque trimestre pour caractériser les changements induits par la grossesse dans Microbiome, Transcriptome, Génome, Exposome et Immunome réponses efficacement.----------ABSTRACT : In many statistical modeling problems data are expressed in a matrix with subjects in row and attributes in column. In this regard, simultaneous grouping of rows and columns known as biclustering of the data matrix is desired. We design and develop a new framework called Forestogram, with the aim of fast computational and hierarchical illustration of biclusters. Often in practical data analysis, we deal with a two-dimensional object known as the data matrix, where observations are expressed as samples (or subjects) in rows, and attributes (or features) in columns. Thus, simultaneous grouping of rows and columns in a hierarchical manner helps practitioners better understanding how clusters evolve. Forestogram, a novel computational and visualization tool, could be thought of as a 3D expansion of dendrogram, with extended orthogonal merge. Each bicluster consists of group of rows (or samples) that unfolds a highly-correlated schema with their corresponding group of columns (or attributes). However, instead of performing two-way clustering independently on each side, we propose a hierarchical biclustering algorithm which takes rows and columns at the same time to determine the biclusters. Furthermore, we develop a model-based information criterion which provides an estimated number of biclusters through a set of hierarchical configurations within the forestogram under mild assumptions. We study the suggested framework in two different applied perspectives, one in public transit domain, another one in bioinformatics field. First, we investigate the users’ behavior in public transit based on two distinct information, temporal data and spatial coordinates gathered from smart card. In many cities, worldwide public transit companies use smart card system to manage fare collection. Analysis of this information provides a comprehensive insight of user’s influence in the interactive public transit network. In this regard, analysis of temporal data, describing the time of entering to the public transit network is considered as the most substantial component of the data gathered from the smart cards. Classical distance-based techniques are not always suitable to analyze this time series data. A novel projection with intuitive visual map from higher dimension into a three-dimensional clock-like space is suggested to reveal the underlying temporal pattern of public transit users. This projection retains the temporal distance between any arbitrary pair of time-stamped data with meaningful visualization. Consequently, this information is fed into a hierarchical clustering algorithm as a method of data segmentation to discover the pattern of users. Then, the time of the usage is taken as a latent variable into account to make the Euclidean metric appropriate for extracting the spatial pattern through our forestogram. As a second application, forestogram is tested on a multiomics dataset combined from different biological measurements to study how patients and corresponding biological modalities evolve hierarchically in each bicluster over the term of pregnancy. The maintenance of pregnancy relies on a finely-tuned balance between tolerance to the fetal allograft and protective mechanisms against invading pathogens. Despite the well-established impact of development during the early months of pregnancy on long-term outcomes, the interactions between various biological mechanisms that govern the progression of pregnancy have not been studied in details. Demonstrating the chronology of these adaptations to term pregnancy provides the framework for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia. We perform a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets include measurements from the immunome, transcriptome, microbiome, proteome, and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net algorithm is used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets are combined into a single model. This model not only significantly increases the predictive power by combining all datasets, but also reveals novel interactions between different biological modalities. Furthermore, our suggested forestogram is another guideline along with the gestational age at time of sampling that provides an unsupervised model to show how much supervised information is necessary for each trimester to characterize the pregnancy-induced changes in Microbiome, Transcriptome, Genome, Exposome, and Immunome responses effectively
    corecore