21 research outputs found

    Network reconstruction and causal analysis in systems biology

    No full text
    L'inférence de la causalité est une problématique récurrente pour un large éventail de domaines où les méthodes d'interventions ou d'acquisition de données temporelles sont inapplicables. Toutefois, établir des relations de causalité uniquement à partir de données d'observation peut se révéler être une tâche complexe. Je présente ici une méthode d'apprentissage de réseaux qui combine les avantages des méthodes d'inférence par identification de contraintes structurales et par optimisation de scores bayésiens pour reconstruire de manière robuste des réseaux causaux malgré le bruit d'échantillonnage inhérent aux données d'observation. Cette méthode repose sur l'identification de v-structures à l'aide de l'information (conditionnelle) à trois variables, une mesure issue de la théorie de l'information, qui est négative quand elle est associée à un collider et positive sinon. Cette approche soustrait itérativement l'information conditionnelle à trois variables la plus forte à l'information conditionnelle à deux variables entre chaque paire de noeuds. Les indépendences conditionnelles sont progressivement calculées en collectant les contributions les plus fortes. Le squelette est ensuite partiellement orienté et ces orientations sont propagées aux liens non orientés selon le signe et la force de l'interaction dans les triplets ouverts. Cette approche obtient de meilleurs résultats que les méthodes par contraintes ou optimisation de score sur un ensemble de réseaux benchmark et fournit des prédictions prometteuses pour des systèmes biologiques complexes, tels que les réseaux neuronaux du poisson zèbre ou l'inférence des cascades de mutations dans les tumeurs.The inference of causality is an everyday life question that spans a broad range of domains for which interventions or time-series acquisition may be impracticable if not unethical. Yet, elucidating causal relationships in real-life complex systems can be convoluted when relying solely on observational data. I report here a novel network reconstruction method, which combines constraint-based and Bayesian frameworks to reliably reconstruct networks despite inherent sampling noise in finite observational datasets. The approach is based on an information theory result tracing back the existence of colliders in graphical models to negative conditional 3-point information between observed variables. This enables to confidently ascertain structural independencies in causal graphs, based on the ranking of their most likely contributing nodes with (significantly) positive conditional 3-point information. Dispensible edges from a complete undirected graph are progressively pruned by iteratively taking off the most likely positive conditional 3-point information from the 2-point (mutual) information between each pair of nodes. The resulting skeleton is then partially directed by orienting and propagating edge directions based on the sign and magnitude of the conditional 3-point information of unshielded triples. This new approach outperforms constraint-based and Bayesian inference methods on a range of benchmark networks and provides promising predictions when applied to the reconstruction of complex biological systems, such as hematopoietic regulatory subnetworks, zebrafish neural networks, mutational pathways or the interplay of genomic properties on the evolution of vertebrates

    Reconstruction de réseaux fonctionnels et analyse causale en biologie des systèmes

    No full text
    The inference of causality is an everyday life question that spans a broad range of domains for which interventions or time-series acquisition may be impracticable if not unethical. Yet, elucidating causal relationships in real-life complex systems can be convoluted when relying solely on observational data. I report here a novel network reconstruction method, which combines constraint-based and Bayesian frameworks to reliably reconstruct networks despite inherent sampling noise in finite observational datasets. The approach is based on an information theory result tracing back the existence of colliders in graphical models to negative conditional 3-point information between observed variables. This enables to confidently ascertain structural independencies in causal graphs, based on the ranking of their most likely contributing nodes with (significantly) positive conditional 3-point information. Dispensible edges from a complete undirected graph are progressively pruned by iteratively taking off the most likely positive conditional 3-point information from the 2-point (mutual) information between each pair of nodes. The resulting skeleton is then partially directed by orienting and propagating edge directions based on the sign and magnitude of the conditional 3-point information of unshielded triples. This new approach outperforms constraint-based and Bayesian inference methods on a range of benchmark networks and provides promising predictions when applied to the reconstruction of complex biological systems, such as hematopoietic regulatory subnetworks, zebrafish neural networks, mutational pathways or the interplay of genomic properties on the evolution of vertebrates.L'inférence de la causalité est une problématique récurrente pour un large éventail de domaines où les méthodes d'interventions ou d'acquisition de données temporelles sont inapplicables. Toutefois, établir des relations de causalité uniquement à partir de données d'observation peut se révéler être une tâche complexe. Je présente ici une méthode d'apprentissage de réseaux qui combine les avantages des méthodes d'inférence par identification de contraintes structurales et par optimisation de scores bayésiens pour reconstruire de manière robuste des réseaux causaux malgré le bruit d'échantillonnage inhérent aux données d'observation. Cette méthode repose sur l'identification de v-structures à l'aide de l'information (conditionnelle) à trois variables, une mesure issue de la théorie de l'information, qui est négative quand elle est associée à un collider et positive sinon. Cette approche soustrait itérativement l'information conditionnelle à trois variables la plus forte à l'information conditionnelle à deux variables entre chaque paire de noeuds. Les indépendences conditionnelles sont progressivement calculées en collectant les contributions les plus fortes. Le squelette est ensuite partiellement orienté et ces orientations sont propagées aux liens non orientés selon le signe et la force de l'interaction dans les triplets ouverts. Cette approche obtient de meilleurs résultats que les méthodes par contraintes ou optimisation de score sur un ensemble de réseaux benchmark et fournit des prédictions prometteuses pour des systèmes biologiques complexes, tels que les réseaux neuronaux du poisson zèbre ou l'inférence des cascades de mutations dans les tumeurs

    Regularized Dual-PPMI Co-clustering for Text Data

    No full text
    International audienceCo-clustering of document-term matrices has proved to be more effective than one-sided clustering. By their nature, text data are also generally unbalanced and directional. Recently, the von Mises-Fisher (vMF) mixture model was proposed to handle unbalanced data while harnessing the directional nature of text. In this paper we propose a novel co-clustering approach based on a matrix formulation of vMF model-based co-clustering. This formulation leads to a flexible method for text co-clustering that can easily incorporate both word-word semantic relationships and document-document similarities. By contrast with existing methods, which generally use an additive incorporation of similarities, we propose a dual multiplicative regularization that better encapsulates the underlying text data structure. Extensive evaluations on various real-world text datasets demonstrate the superior performance of our proposed approach over baseline and competitive methods, both in terms of clustering results and co-cluster topic coherence

    A survey on machine learning methods for churn prediction

    No full text
    International audienceThe diversity and specificities of today's businesses have leveraged a wide range of prediction techniques. In particular, churn prediction is a major economic concern for many companies. The purpose of this study is to draw general guidelines from a benchmark of supervised machine learning techniques in association with widely used data sampling approaches on publicly available datasets in the context of churn prediction. Choosing a priori the most appropriate sampling method as well as the most suitable classification model is not trivial, as it strongly depends on the data intrinsic characteristics. In this paper we study the behavior of eleven supervised and semi-supervised learning methods and seven sampling approaches on sixteen diverse and publicly available churn-like datasets. Our evaluations, reported in terms of the Area Under the Curve (AUC) metric, explore the influence of sampling approaches and data characteristics on the performance of the studied learning methods. Besides, we propose Nemenyi test and Correspondence Analysis as means of comparison and visualization of the association between classification algorithms, sampling methods and datasets. Most importantly, our experiments lead to a practical recommendation for a prediction pipeline based on an ensemble approach. Our proposal can be successfully applied to a wide range of churn-like datasets

    Regularized bi-directional co-clustering

    No full text
    International audienceThe simultaneous clustering of documents and words, known as co-clustering, has proved to be more effective than one-sidedclustering in dealing with sparse high-dimensional datasets. By their nature, text data are also generally unbalanced anddirectional. Recently, the von Mises–Fisher (vMF) mixture model was proposed to handle unbalanced data while harnessingthe directional nature of text. In this paper, we propose a general co-clustering framework based on a matrix formulationof vMF model-based co-clustering. This formulation leads to a flexible framework for text co-clustering that can easilyincorporate both word–word semantic relationships and document–document similarities. By contrast with existing methods,which generally use an additive incorporation of similarities, we propose a bi-directional multiplicative regularization thatbetter encapsulates the underlying text data structure. Extensive evaluations on various real-world text datasets demonstratethe superior performance of our proposed approach over baseline and competitive methods, both in terms of clustering resultsand co-cluster topic coherence
    corecore