9 research outputs found

    Two optimal strategies for active learning of causal models from interventions

    Get PDF
    Abstract From observational data alone, a causal DAG is in general only identifiable up to Markov equivalence. Interventional data generally improves identifiability; however, the gain of an intervention strongly depends on the intervention target, i.e., the intervened variables. We present active learning strategies calculating optimal interventions for two different learning goals. The first one is a greedy approach using single-vertex interventions that maximizes the number of edges that can be oriented after each intervention. The second one yields in polynomial time a minimum set of targets of arbitrary size that guarantees full identifiability. This second approach proves a conjecture of Eberhard

    Experiment Selection for Causal Discovery

    Get PDF
    Randomized controlled experiments are often described as the most reliable tool available to scientists for discovering causal relationships among quantities of interest. However, it is often unclear how many and which different experiments are needed to identify the full (possibly cyclic) causal structure among some given (possibly causally insufficient) set of variables. Recent results in the causal discovery literature have explored various identifiability criteria that depend on the assumptions one is able to make about the underlying causal process, but these criteria are not directly constructive for selecting the optimal set of experiments. Fortunately, many of the needed constructions already exist in the combinatorics literature, albeit under terminology which is unfamiliar to most of the causal discovery community. In this paper we translate the theoretical results and apply them to the concrete problem of experiment selection. For a variety of settings we give explicit constructions of the optimal set of experiments and adapt some of the general combinatorics results to answer questions relating to the problem of experiment selection

    Methods for Reconstructing Networks with Incomplete Information.

    Full text link
    Network representations of complex systems are widespread and reconstructing unknown networks from data has been intensively researched in statistical and scientific communities more broadly. Two challenges in network reconstruction problems include having insufficient data to illuminate the full structure of the network and needing to combine information from different data sources. Addressing these challenges, this thesis contributes methodology for network reconstruction in three respects. First, we consider sequentially choosing interventions to discover structure in directed networks focusing on learning a partial order over the nodes. This focus leads to a new model for intervention data under which nodal variables depend on the lengths of paths separating them from intervention targets rather than on parent sets. Taking a Bayesian approach, we present partial-order based priors and develop a novel Markov-Chain Monte Carlo (MCMC) method for computing posterior expectations over directed acyclic graphs. The utility of the MCMC approach comes from designing new proposals for the Metropolis algorithm that move locally among partial orders while independently sampling graphs from each partial order. The resulting Markov Chains mix rapidly and are ergodic. We also adapt an existing strategy for active structure learning, develop an efficient Monte Carlo procedure for estimating the resulting decision function, and evaluate the proposed methods numerically using simulations and benchmark datasets. We next study penalized likelihood methods using incomplete order information as arising from intervention data. To make the notion of incomplete information precise, we introduce and formally define incomplete partial orders which subsumes the important special case of a known total ordering of the nodes. This special case lies along an information lattice and we study the reconstruction performance of penalized likelihood methods at different points along this lattice. Finally, we present a method for ranking a network's potential edges using time-course data. The novelty is our development of a nonparametric gradient-matching procedure and a related summary statistic for measuring the strength of relationships among components in dynamic systems. Simulation studies demonstrate that given sufficient signal moving using this procedure to move from linear to additive approximations leads to improved rankings of potential edges.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113316/1/jbhender_1.pd

    Inference of gene regulatory networks from non independently and identically distributed transcriptomic data

    Get PDF
    Cette thèse étudie l'inférence de modèles graphiques Gaussiens en grande dimension à partir de données du transcriptome non indépendamment et identiquement distribuées dans l'objectif d'estimer des réseaux de régulation génétique. Dans ce contexte de données en grande dimension, l'hétérogénéité des données peut être mise à profit pour définir des méthodes de régularisation structurées améliorant la qualité des estimateurs. Nous considérons tout d'abord l'hétérogénéité apparaissant au niveau du réseau, fondée sur l'hypothèse que les réseaux biologiques sont organisés, ce qui nous conduit à définir une régularisation l1 pondérée. Modélisant l'hétérogénéité au niveau des données, nous étudions les propriétés théoriques d'une méthode de régularisation par bloc appelée coopérative-Lasso, définie dans le but de lier l'inférence sur des jeux de données distincts mais proches en un certain sens. Pour finir, nous nous intéressons au problème central de l'incertitude des estimations, définissant un test d'homogénéité pour modèle linéaire en grande dimension.This thesis investigates the inference of high-dimensional Gaussian graphical models from non identically and independently distributed transcriptomic data in the objective of recovering gene regulatory networks. In the context of high-dimensional statistics, data heterogeneity paves the way to the definition of structured regularizers in order to improve the quality of the estimator. We first consider heterogeneity at the network level, building upon the assumption that biological networks are organized, which leads to the definition of a weighted l1 regularization. Modelling heterogeneity at the observation level, we provide a consistency analysis of a recent block-sparse regularizer called the cooperative-Lasso designed to combine observations from distinct but close datasets. Finally we address the crucial question of uncertainty, deriving homonegeity tests for high-dimensional linear regression.EVRY-Bib. électronique (912289901) / SudocSudocFranceF

    Apprentissage de la structure de réseaux bayésiens : application aux données de génétique-génomique

    Get PDF
    Apprendre la structure d'un réseau de régulation de gènes est une tâche complexe due à la fois au nombre élevé de variables le composant (plusieurs milliers) et à la faible quantité d'échantillons disponibles (quelques centaines). Parmi les approches proposées, nous utilisons le formalisme des réseaux bayésiens, ainsi apprendre la structure d'un réseau de régulation consiste à apprendre la structure d'un réseau bayésien où chaque variable représente un gène et chaque arc un phénomène de régulation. Dans la première partie de ce manuscrit nous nous intéressons à l'apprentissage de la structure de réseaux bayésiens génériques au travers de recherches locales. Nous explorons plus efficacement l'espace des réseaux possibles grâce à un nouvel algorithme de recherche stochastique (SGS), un nouvel opérateur local (SWAP), ainsi qu'une extension des opérateurs classiques qui permet d'assouplir temporairement la contrainte d'acyclicité des réseaux bayésiens. La deuxième partie se focalise sur l'apprentissage de réseaux de régulation de gènes. Nous proposons une modélisation du problème dans le cadre des réseaux bayésiens prenant en compte deux types d'information. Le premier, classiquement utilisé, est le niveau d'expression des gènes. Le second, plus original, est la présence de mutations sur la séquence d'ADN pouvant expliquer des variations d'expression. L'utilisation de ces données combinées dites de génétique-génomique, vise à améliorer la reconstruction. Nos différentes propositions se sont montrées performantes sur des données de génétique-génomique simulées et ont permis de reconstruire un réseau de régulation pour des données observées sur le plante Arabidopsis thaliana.Structure learning of gene regulatory networks is a complex process, due to the high number of variables (several thousands) and the small number of available samples (few hundred). Among the proposed approaches to learn these networks, we use the Bayesian network framework. In this way to learn a regulatory network corresponds to learn the structure of a Bayesian network where each variable is a gene and each edge represents a regulation between genes. In the first part of this thesis, we are interested in learning the structure of generic Bayesian networks using local search. We explore more efficiently the search space thanks to a new stochastic search algorithm (SGS), a new local operator (SWAP) and an extension for classical operators to briefly overcome the acyclic constraint imposed by Bayesian networks. The second part focuses on learning gene regulatory networks. We proposed a model in the Bayesian networks framework taking into account two kinds of information. The first one, commonly used, is gene expression levels. The second one, more original, is the mutations on the DNA sequence which can explain gene expression variations. The use of these combined data, called genetical genomics, aims to improve the structural learning quality. Our different proposals appeared to be efficient on simulated genetical genomics data and allowed to learn a regulatory network for observed data from Arabidopsis thaliana
    corecore