169,504 research outputs found

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

    Random walks on complex trees

    Get PDF
    We study the properties of random walks on complex trees. We observe that the absence of loops is reflected in physical observables showing large differences with respect to their looped counterparts. First, both the vertex discovery rate and the mean topological displacement from the origin present a considerable slowing down in the tree case. Second, the mean first passage time (MFPT) displays a logarithmic degree dependence, in contrast to the inverse degree shape exhibited in looped networks. This deviation can be ascribed to the dominance of source-target topological distance in trees. To show this, we study the distance dependence of a symmetrized MFPT and derive its logarithmic profile, obtaining good agreement with simulation results. These unique properties shed light on the recently reported anomalies observed in diffusive dynamical systems on trees

    Metrics for Graph Comparison: A Practitioner's Guide

    Full text link
    Comparison of graph structure is a ubiquitous task in data analysis and machine learning, with diverse applications in fields such as neuroscience, cyber security, social network analysis, and bioinformatics, among others. Discovery and comparison of structures such as modular communities, rich clubs, hubs, and trees in data in these fields yields insight into the generative mechanisms and functional properties of the graph. Often, two graphs are compared via a pairwise distance measure, with a small distance indicating structural similarity and vice versa. Common choices include spectral distances (also known as λ\lambda distances) and distances based on node affinities. However, there has of yet been no comparative study of the efficacy of these distance measures in discerning between common graph topologies and different structural scales. In this work, we compare commonly used graph metrics and distance measures, and demonstrate their ability to discern between common topological features found in both random graph models and empirical datasets. We put forward a multi-scale picture of graph structure, in which the effect of global and local structure upon the distance measures is considered. We make recommendations on the applicability of different distance measures to empirical graph data problem based on this multi-scale view. Finally, we introduce the Python library NetComp which implements the graph distances used in this work

    Subgroup identification in individual patient data meta-analysis using model-based recursive partitioning

    Full text link
    Model-based recursive partitioning (MOB) can be used to identify subgroups with differing treatment effects. The detection rate of treatment-by-covariate interactions and the accuracy of identified subgroups using MOB depend strongly on the sample size. Using data from multiple randomized controlled clinical trials can overcome the problem of too small samples. However, naively pooling data from multiple trials may result in the identification of spurious subgroups as differences in study design, subject selection and other sources of between-trial heterogeneity are ignored. In order to account for between-trial heterogeneity in individual participant data (IPD) meta-analysis random-effect models are frequently used. Commonly, heterogeneity in the treatment effect is modelled using random effects whereas heterogeneity in the baseline risks is modelled by either fixed effects or random effects. In this article, we propose metaMOB, a procedure using the generalized mixed-effects model tree (GLMM tree) algorithm for subgroup identification in IPD meta-analysis. Although the application of metaMOB is potentially wider, e.g. randomized experiments with participants in social sciences or preclinical experiments in life sciences, we focus on randomized controlled clinical trials. In a simulation study, metaMOB outperformed GLMM trees assuming a random intercept only and model-based recursive partitioning (MOB), whose algorithm is the basis for GLMM trees, with respect to the false discovery rates, accuracy of identified subgroups and accuracy of estimated treatment effect. The most robust and therefore most promising method is metaMOB with fixed effects for modelling the between-trial heterogeneity in the baseline risks

    Interpretable Categorization of Heterogeneous Time Series Data

    Get PDF
    Understanding heterogeneous multivariate time series data is important in many applications ranging from smart homes to aviation. Learning models of heterogeneous multivariate time series that are also human-interpretable is challenging and not adequately addressed by the existing literature. We propose grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs extend decision trees with a grammar framework. Logical expressions derived from a context-free grammar are used for branching in place of simple thresholds on attributes. The added expressivity enables support for a wide range of data types while retaining the interpretability of decision trees. In particular, when a grammar based on temporal logic is used, we show that GBDTs can be used for the interpretable classi cation of high-dimensional and heterogeneous time series data. Furthermore, we show how GBDTs can also be used for categorization, which is a combination of clustering and generating interpretable explanations for each cluster. We apply GBDTs to analyze the classic Australian Sign Language dataset as well as data on near mid-air collisions (NMACs). The NMAC data comes from aircraft simulations used in the development of the next-generation Airborne Collision Avoidance System (ACAS X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data Mining (SDM) 201

    Motif counting beyond five nodes

    Get PDF
    Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets
    • …
    corecore