128 research outputs found

    KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

    Get PDF
    Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules

    Relating the thermodynamic arrow of time to the causal arrow

    Full text link
    Consider a Hamiltonian system that consists of a slow subsystem S and a fast subsystem F. The autonomous dynamics of S is driven by an effective Hamiltonian, but its thermodynamics is unexpected. We show that a well-defined thermodynamic arrow of time (second law) emerges for S whenever there is a well-defined causal arrow from S to F and the back-action is negligible. This is because the back-action of F on S is described by a non-globally Hamiltonian Born-Oppenheimer term that violates the Liouville theorem, and makes the second law inapplicable to S. If S and F are mixing, under the causal arrow condition they are described by microcanonic distributions P(S) and P(S|F). Their structure supports a causal inference principle proposed recently in machine learning.Comment: 10 page

    Statistical M-Estimation and Consistency in Large Deformable Models for Image Warping

    Get PDF
    The problem of defining appropriate distances between shapes or images and modeling the variability of natural images by group transformations is at the heart of modern image analysis. A current trend is the study of probabilistic and statistical aspects of deformation models, and the development of consistent statistical procedure for the estimation of template images. In this paper, we consider a set of images randomly warped from a mean template which has to be recovered. For this, we define an appropriate statistical parametric model to generate random diffeomorphic deformations in two-dimensions. Then, we focus on the problem of estimating the mean pattern when the images are observed with noise. This problem is challenging both from a theoretical and a practical point of view. M-estimation theory enables us to build an estimator defined as a minimizer of a well-tailored empirical criterion. We prove the convergence of this estimator and propose a gradient descent algorithm to compute this M-estimator in practice. Simulations of template extraction and an application to image clustering and classification are also provided

    ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    Multiple Imputation Ensembles (MIE) for dealing with missing data

    Get PDF
    Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

    Knowledge-based energy functions for computational studies of proteins

    Full text link
    This chapter discusses theoretical framework and methods for developing knowledge-based potential functions essential for protein structure prediction, protein-protein interaction, and protein sequence design. We discuss in some details about the Miyazawa-Jernigan contact statistical potential, distance-dependent statistical potentials, as well as geometric statistical potentials. We also describe a geometric model for developing both linear and non-linear potential functions by optimization. Applications of knowledge-based potential functions in protein-decoy discrimination, in protein-protein interactions, and in protein design are then described. Several issues of knowledge-based potential functions are finally discussed.Comment: 57 pages, 6 figures. To be published in a book by Springe

    Segmentation of Multi-Isotope Imaging Mass Spectrometry Data for Semi-Automatic Detection of Regions of Interest

    Get PDF
    Multi-isotope imaging mass spectrometry (MIMS) associates secondary ion mass spectrometry (SIMS) with detection of several atomic masses, the use of stable isotopes as labels, and affiliated quantitative image-analysis software. By associating image and measure, MIMS allows one to obtain quantitative information about biological processes in sub-cellular domains. MIMS can be applied to a wide range of biomedical problems, in particular metabolism and cell fate [1], [2], [3]. In order to obtain morphologically pertinent data from MIMS images, we have to define regions of interest (ROIs). ROIs are drawn by hand, a tedious and time-consuming process. We have developed and successfully applied a support vector machine (SVM) for segmentation of MIMS images that allows fast, semi-automatic boundary detection of regions of interests. Using the SVM, high-quality ROIs (as compared to an expert's manual delineation) were obtained for 2 types of images derived from unrelated data sets. This automation simplifies, accelerates and improves the post-processing analysis of MIMS images. This approach has been integrated into “Open MIMS,” an ImageJ-plugin for comprehensive analysis of MIMS images that is available online at http://www.nrims.hms.harvard.edu/NRIMS_ImageJ.php

    Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data.</p> <p>Results</p> <p>A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods.</p> <p>Conclusions</p> <p>Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p
    corecore