872 research outputs found

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Integration and visualisation of clinical-omics datasets for medical knowledge discovery

    Get PDF
    In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

    Reverse-engineering biological networks from large data sets

    Get PDF
    Much of contemporary systems biology owes its success to the abstraction of a network, the idea that diverse kinds of molecular, cellular, and organismal species and interactions can be modeled as relational nodes and edges in a graph of dependencies. Since the advent of high-throughput data-acquisition technologies in fields such as genomics, metabolomics, and neuroscience, the automated inference and reconstruction of such interaction networks directly from large sets of activation data, commonly known as reverse-engineering, has become a routine procedure. Whereas early attempts at network reverse-engineering focused predominantly on producing maps of system architectures with minimal predictive modeling, reconstructions now play instrumental roles in answering questions about the statistics and dynamics of the underlying systems they represent. Many of these predictions have clinical relevance, suggesting novel paradigms for drug discovery and disease treatment. While other reviews focus predominantly on the details and effectiveness of individual network inference algorithms, here we examine the emerging field as a whole. We first summarize several key application areas in which inferred networks have made successful predictions. We then outline the two major classes of reverse-engineering methodologies, emphasizing that the type of prediction that one aims to make dictates the algorithms one should employ. We conclude by discussing whether recent breakthroughs justify the computational costs of large-scale reverse-engineering sufficiently to admit it as a mainstay in the quantitative analysis of living systems.Fil: Natale, Joseph J.. University of Emory; Estados UnidosFil: Hofmann, David. University of Emory; Estados UnidosFil: Hernández Lahme, Damián Gabriel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Area de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. University of Emory; Estados UnidosFil: Nemenman, Ilya. University of Emory; Estados Unido

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages

    Pacific Symposium on Biocomputing 2023

    Get PDF
    The Pacific Symposium on Biocomputing (PSB) 2023 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2023 will be held on January 3-7, 2023 in Kohala Coast, Hawaii. Tutorials and workshops will be offered prior to the start of the conference.PSB 2023 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's 'hot topics.' In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field

    Improving Outcomes in Machine Learning and Data-Driven Learning Systems using Structural Causal Models

    Get PDF
    The field of causal inference has experienced rapid growth and development in recent years. Its significance in addressing a diverse array of problems and its relevance across various research and application domains are increasingly being acknowledged. However, the current state-of-the-art approaches to causal inference have not yet gained widespread adoption in mainstream data science practices. This research endeavor begins by seeking to motivate enthusiasm for contemporary approaches to causal investigation utilizing observational data. It explores the existing applications and potential future prospects for employing causal inference methods to enhance desired outcomes in data-driven learning applications across various domains, with a particular focus on their relevance in artificial intelligence (AI). Following this motivation, this dissertation proceeds to offer a broad review of fundamental concepts, theoretical frameworks, methodological advancements, and existing techniques pertaining to causal inference. The research advances by investigating the problem of data-driven root cause analysis through the lens of causal structure modeling. Data-driven approaches to root cause analysis (RCA) have received attention recently due to their ability to exploit increasing data availability for more effective root cause identification in complex processes. Advancements in the field of causal inference enable unbiased causal investigations using observational data. This study proposes a data-driven RCA method and a time-to-event (TTE) data simulation procedure built on the structural causal model (SCM) framework. A novel causality-based method is introduced for learning a representation of root cause mechanisms, termed in this work as root cause graphs (RCGs), from observational TTE data. Three case scenarios are used to generate TTE datasets for evaluating the proposed method. The utility of the proposed RCG recovery method is demonstrated by using recovered RCGs to guide the estimation of root cause treatment effects. In the presence of mediation, RCG-guided models produce superior estimates of root cause total effects compared to models that adjust for all covariates. The author delves into the subject of integrating causal inference and machine learning. Incorporating causal inference into machine learning offers many benefits including enhancing model interpretability and robustness to changes in data distributions. This work considers the task of feature selection for prediction model development in the context of potentially changing environments. First, a filter feature selection approach that improves on the select k-best method and prioritizes causal features is introduced and compared to the standard select k-best algorithm. Secondly, a causal feature selection algorithm which adapts to covariate shifts in the target domain is proposed for domain adaptation. Causal approaches to feature selection are demonstrated to be capable of yielding optimal prediction performance when modeling assumptions are met. Additionally, they can mitigate the degrading effects of some forms of dataset shifts on prediction performance
    • …
    corecore