10,172 research outputs found

    Discovering transcriptional modules by Bayesian data integration

    Get PDF
    Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets. Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs

    A temporal switch model for estimating transcriptional activity in gene expression

    Get PDF
    Motivation: The analysis and mechanistic modelling of time series gene expression data provided by techniques such as microarrays, NanoString, reverse transcription–polymerase chain reaction and advanced sequencing are invaluable for developing an understanding of the variation in key biological processes. We address this by proposing the estimation of a flexible dynamic model, which decouples temporal synthesis and degradation of mRNA and, hence, allows for transcriptional activity to switch between different states. Results: The model is flexible enough to capture a variety of observed transcriptional dynamics, including oscillatory behaviour, in a way that is compatible with the demands imposed by the quality, time-resolution and quantity of the data. We show that the timing and number of switch events in transcriptional activity can be estimated alongside individual gene mRNA stability with the help of a Bayesian reversible jump Markov chain Monte Carlo algorithm. To demonstrate the methodology, we focus on modelling the wild-type behaviour of a selection of 200 circadian genes of the model plant Arabidopsis thaliana. The results support the idea that using a mechanistic model to identify transcriptional switch points is likely to strongly contribute to efforts in elucidating and understanding key biological processes, such as transcription and degradation

    Survey of data mining approaches to user modeling for adaptive hypermedia

    Get PDF
    The ability of an adaptive hypermedia system to create tailored environments depends mainly on the amount and accuracy of information stored in each user model. Some of the difficulties that user modeling faces are the amount of data available to create user models, the adequacy of the data, the noise within that data, and the necessity of capturing the imprecise nature of human behavior. Data mining and machine learning techniques have the ability to handle large amounts of data and to process uncertainty. These characteristics make these techniques suitable for automatic generation of user models that simulate human decision making. This paper surveys different data mining techniques that can be used to efficiently and accurately capture user behavior. The paper also presents guidelines that show which techniques may be used more efficiently according to the task implemented by the applicatio

    Bayesian correlated clustering to integrate multiple datasets

    Get PDF
    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

    Dirichlet belief networks for topic structure learning

    Full text link
    Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201

    On Practical machine Learning and Data Analysis

    Get PDF
    This thesis discusses and addresses some of the difficulties associated with practical machine learning and data analysis. Introducing data driven methods in e.g industrial and business applications can lead to large gains in productivity and efficiency, but the cost and complexity are often overwhelming. Creating machine learning applications in practise often involves a large amount of manual labour, which often needs to be performed by an experienced analyst without significant experience with the application area. We will here discuss some of the hurdles faced in a typical analysis project and suggest measures and methods to simplify the process. One of the most important issues when applying machine learning methods to complex data, such as e.g. industrial applications, is that the processes generating the data are modelled in an appropriate way. Relevant aspects have to be formalised and represented in a way that allow us to perform our calculations in an efficient manner. We present a statistical modelling framework, Hierarchical Graph Mixtures, based on a combination of graphical models and mixture models. It allows us to create consistent, expressive statistical models that simplify the modelling of complex systems. Using a Bayesian approach, we allow for encoding of prior knowledge and make the models applicable in situations when relatively little data are available. Detecting structures in data, such as clusters and dependency structure, is very important both for understanding an application area and for specifying the structure of e.g. a hierarchical graph mixture. We will discuss how this structure can be extracted for sequential data. By using the inherent dependency structure of sequential data we construct an information theoretical measure of correlation that does not suffer from the problems most common correlation measures have with this type of data. In many diagnosis situations it is desirable to perform a classification in an iterative and interactive manner. The matter is often complicated by very limited amounts of knowledge and examples when a new system to be diagnosed is initially brought into use. We describe how to create an incremental classification system based on a statistical model that is trained from empirical data, and show how the limited available background information can still be used initially for a functioning diagnosis system. To minimise the effort with which results are achieved within data analysis projects, we need to address not only the models used, but also the methodology and applications that can help simplify the process. We present a methodology for data preparation and a software library intended for rapid analysis, prototyping, and deployment. Finally, we will study a few example applications, presenting tasks within classification, prediction and anomaly detection. The examples include demand prediction for supply chain management, approximating complex simulators for increased speed in parameter optimisation, and fraud detection and classification within a media-on-demand system

    Automated user modeling for personalized digital libraries

    Get PDF
    Digital libraries (DL) have become one of the most typical ways of accessing any kind of digitalized information. Due to this key role, users welcome any improvements on the services they receive from digital libraries. One trend used to improve digital services is through personalization. Up to now, the most common approach for personalization in digital libraries has been user-driven. Nevertheless, the design of efficient personalized services has to be done, at least in part, in an automatic way. In this context, machine learning techniques automate the process of constructing user models. This paper proposes a new approach to construct digital libraries that satisfy user’s necessity for information: Adaptive Digital Libraries, libraries that automatically learn user preferences and goals and personalize their interaction using this information
    corecore