169,504 research outputs found
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
Random walks on complex trees
We study the properties of random walks on complex trees. We observe that the absence of loops is reflected in physical observables showing large differences with respect to their looped counterparts. First, both the vertex discovery rate and the mean topological displacement from the origin present a considerable slowing down in the tree case. Second, the mean first passage time (MFPT) displays a logarithmic degree dependence, in contrast to the inverse degree shape exhibited in looped networks. This deviation can be ascribed to the dominance of source-target topological distance in trees. To show this, we study the distance dependence of a symmetrized MFPT and derive its logarithmic profile, obtaining good agreement with simulation results. These unique properties shed light on the recently reported anomalies observed in diffusive dynamical systems on trees
Metrics for Graph Comparison: A Practitioner's Guide
Comparison of graph structure is a ubiquitous task in data analysis and
machine learning, with diverse applications in fields such as neuroscience,
cyber security, social network analysis, and bioinformatics, among others.
Discovery and comparison of structures such as modular communities, rich clubs,
hubs, and trees in data in these fields yields insight into the generative
mechanisms and functional properties of the graph.
Often, two graphs are compared via a pairwise distance measure, with a small
distance indicating structural similarity and vice versa. Common choices
include spectral distances (also known as distances) and distances
based on node affinities. However, there has of yet been no comparative study
of the efficacy of these distance measures in discerning between common graph
topologies and different structural scales.
In this work, we compare commonly used graph metrics and distance measures,
and demonstrate their ability to discern between common topological features
found in both random graph models and empirical datasets. We put forward a
multi-scale picture of graph structure, in which the effect of global and local
structure upon the distance measures is considered. We make recommendations on
the applicability of different distance measures to empirical graph data
problem based on this multi-scale view. Finally, we introduce the Python
library NetComp which implements the graph distances used in this work
Subgroup identification in individual patient data meta-analysis using model-based recursive partitioning
Model-based recursive partitioning (MOB) can be used to identify subgroups
with differing treatment effects. The detection rate of treatment-by-covariate
interactions and the accuracy of identified subgroups using MOB depend strongly
on the sample size. Using data from multiple randomized controlled clinical
trials can overcome the problem of too small samples. However, naively pooling
data from multiple trials may result in the identification of spurious
subgroups as differences in study design, subject selection and other sources
of between-trial heterogeneity are ignored. In order to account for
between-trial heterogeneity in individual participant data (IPD) meta-analysis
random-effect models are frequently used. Commonly, heterogeneity in the
treatment effect is modelled using random effects whereas heterogeneity in the
baseline risks is modelled by either fixed effects or random effects. In this
article, we propose metaMOB, a procedure using the generalized mixed-effects
model tree (GLMM tree) algorithm for subgroup identification in IPD
meta-analysis. Although the application of metaMOB is potentially wider, e.g.
randomized experiments with participants in social sciences or preclinical
experiments in life sciences, we focus on randomized controlled clinical
trials. In a simulation study, metaMOB outperformed GLMM trees assuming a
random intercept only and model-based recursive partitioning (MOB), whose
algorithm is the basis for GLMM trees, with respect to the false discovery
rates, accuracy of identified subgroups and accuracy of estimated treatment
effect. The most robust and therefore most promising method is metaMOB with
fixed effects for modelling the between-trial heterogeneity in the baseline
risks
Interpretable Categorization of Heterogeneous Time Series Data
Understanding heterogeneous multivariate time series data is important in
many applications ranging from smart homes to aviation. Learning models of
heterogeneous multivariate time series that are also human-interpretable is
challenging and not adequately addressed by the existing literature. We propose
grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs
extend decision trees with a grammar framework. Logical expressions derived
from a context-free grammar are used for branching in place of simple
thresholds on attributes. The added expressivity enables support for a wide
range of data types while retaining the interpretability of decision trees. In
particular, when a grammar based on temporal logic is used, we show that GBDTs
can be used for the interpretable classi cation of high-dimensional and
heterogeneous time series data. Furthermore, we show how GBDTs can also be used
for categorization, which is a combination of clustering and generating
interpretable explanations for each cluster. We apply GBDTs to analyze the
classic Australian Sign Language dataset as well as data on near mid-air
collisions (NMACs). The NMAC data comes from aircraft simulations used in the
development of the next-generation Airborne Collision Avoidance System (ACAS
X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data
Mining (SDM) 201
Motif counting beyond five nodes
Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets
- …