532 research outputs found

    Uncertainty Quantification for Numerical Models with Two Regions of Solution

    Get PDF
    Complex numerical models and simulators are essential for representing real life physical systems so that we can make predictions and get a better understanding of the systems themselves. For certain models, the outputs can behave very differently for some input parameters as compared with others, and hence, we end up with distinct bounded regions in the input space. The aim of this thesis is to develop methods for uncertainty quantification for such models. Emulators act as `black box' functions to statistically represent the relationships between complex simulator inputs and outputs. It is important not to assume continuity across the output space as there may be discontinuities between the distinct regions. Therefore, it is not possible to use one single Gaussian process emulator (GP) for the entire model. Further, model outputs can take any form and can be either qualitative or quantitative. For example, there may be computer code for a complex model that fails to run for certain input values. In such an example, the output data would correspond to separate binary outcomes of either `runs' or`fails to run'. Classification methods can be used to split the input space into separate regions according to their associated outputs. Existing classification methods include logistic regression, which models the probability of being classified into one of two regions. However, to make classification predictions we often draw from an independent Bernoulli distribution (0 represents one region and 1 represents the other), meaning that a distance relationship is lost from the independent draws, and so can result in many misclassifications. The first section of this thesis presents a new method for classification, where the model outputs are given distinct classifying labels, which are modelled using a latent Gaussian process. The latent variable is estimated using MCMC sampling, a unique likelihood and distinct prior specifications. The classifier is then verified by calculating a misclassification rate across the input space. By modelling the labels using a latent GP, the major problems associated with logistic regression are avoided. The novel method is applied to a range of examples, including a motivating example which models the hormones associated with the reproductive system in mammals. The two labelled outputs are high and low rates of reproduction. The remainder of this thesis looks into developing a correlated Bernoulli process to solve the independent drawing problems found when using logistic regression. If simulating chains or fields of 0’s and 1’s, it is hard to control the ‘stickiness’ of like symbols. Presented here is a novel approach for a correlated Bernoulli process to create chains of 0’s and 1’s, for which like symbols cluster together. The structure is used from de Bruijn Graphs - a directed graph, where given a set of symbols, V, and a ‘word’ length, m, the nodes of the graph consist of all possible sequences of V of length m. De Bruijn Graphs are a generalisation of Markov chains, where the ‘word’ length controls the number of states that each individual state is dependent on. This increases correlation over a wider area. A de Bruijn process is defined along with run length properties and inference. Ways of expanding this process to higher dimensions are also presented

    Analyzing the differences between reads and contigs when performing a taxonomic assignment comparison in metagenomics

    Get PDF
    Metagenomics is an inherently complex field in which one of the primary goals is to determine the compositional organisms present in an environmental sample. Thereby, diverse tools have been developed that are based on the similarity search results obtained from comparing a set of sequences against a database. However, to achieve this goal there still are affairs to solve such as dealing with genomic variants and detecting repeated sequences that could belong to different species in a mixture of uneven and unknown representation of organisms in a sample. Hence, the question of whether analyzing a sample with reads provides further understanding of the metagenome than with contigs arises. The assembly yields larger genomic fragments but bears the risk of producing chimeric contigs. On the other hand, reads are shorter and therefore their statistical significance is harder to asses, but there is a larger number of them. Consequently, we have developed a workflow to assess and compare the quality of each of these alternatives. Synthetic read datasets beloging to previously identified organisms are generated in order to validate the results. Afterwards, we assemble these into a set of contigs and perform a taxonomic analysis on both datasets. The tools we have developed demonstrate that analyzing with reads provide a more trustworthy representation of the species in a sample than contigs especially in cases that present a high genomic variability.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Bidirectional LSTM-CRF for Clinical Concept Extraction

    Full text link
    Automated extraction of concepts from patient clinical records is an essential facilitator of clinical research. For this reason, the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records introduced a concept extraction task aimed at identifying and classifying concepts into predefined categories (i.e., treatments, tests and problems). State-of-the-art concept extraction approaches heavily rely on handcrafted features and domain-specific resources which are hard to collect and define. For this reason, this paper proposes an alternative, streamlined approach: a recurrent neural network (the bidirectional LSTM with CRF decoding) initialized with general-purpose, off-the-shelf word embeddings. The experimental results achieved on the 2010 i2b2/VA reference corpora using the proposed framework outperform all recent methods and ranks closely to the best submission from the original 2010 i2b2/VA challenge.Comment: This paper "Bidirectional LSTM-CRF for Clinical Concept Extraction" is accepted for short paper presentation at Clinical Natural Language Processing Workshop at COLING 2016 Osaka, Japan. December 11, 201

    TIGRA: A targeted iterative graph routing assembler for breakpoint assembly

    Get PDF
    Recent progress in next-generation sequencing has greatly facilitated our study of genomic structural variation. Unlike single nucleotide variants and small indels, many structural variants have not been completely characterized at nucleotide resolution. Deriving the complete sequences underlying such breakpoints is crucial for not only accurate discovery, but also for the functional characterization of altered alleles. However, our current ability to determine such breakpoint sequences is limited because of challenges in aligning and assembling short reads. To address this issue, we developed a targeted iterative graph routing assembler, TIGRA, which implements a set of novel data analysis routines to achieve effective breakpoint assembly from next-generation sequencing data. In our assessment using data from the 1000 Genomes Project, TIGRA was able to accurately assemble the majority of deletion and mobile element insertion breakpoints, with a substantively better success rate and accuracy than other algorithms. TIGRA has been applied in the 1000 Genomes Project and other projects and is freely available for academic use

    On the representation of gliders in Rule 54 by de Bruijn and cycle diagrams

    Get PDF
    Rule 54, in Wolfram’s notation, is one of elementary yet complexly behaving one-dimensional cellular automata. The automaton supports gliders, glider guns and other non-trivial long transients. We show how to characterize gliders in Rule 54 by diagram representations as de Bruijn and cycle diagrams; offering a way to present each glider in Rule 54 with particular characteristics. This allows a compact encoding of initial conditions which can be used in implementing non-trivial collision-based computing in one-dimensional cellular automata

    On Binary de Bruijn Sequences from LFSRs with Arbitrary Characteristic Polynomials

    Full text link
    We propose a construction of de Bruijn sequences by the cycle joining method from linear feedback shift registers (LFSRs) with arbitrary characteristic polynomial f(x)f(x). We study in detail the cycle structure of the set Ω(f(x))\Omega(f(x)) that contains all sequences produced by a specific LFSR on distinct inputs and provide a fast way to find a state of each cycle. This leads to an efficient algorithm to find all conjugate pairs between any two cycles, yielding the adjacency graph. The approach is practical to generate a large class of de Bruijn sequences up to order n≈20n \approx 20. Many previously proposed constructions of de Bruijn sequences are shown to be special cases of our construction

    What is the difference between the breakpoint graph and the de Bruijn graph?

    Get PDF
    The breakpoint graph and the de Bruijn graph are two key data structures in the studies of genome rearrangements and genome assembly. However, the classical breakpoint graphs are defined on two genomes (represented as sequences of synteny blocks), while the classical de Bruijn graphs are defined on a single genome (represented as DNA strings). Thus, the connection between these two graph models is not explicit. We generalize the notions of both the breakpoint graph and the de Bruijn graph, and make it transparent that the breakpoint graph and the de Bruijn graph are mathematically equivalent. The explicit description of the connection between these important data structures provides a bridge between two previously separated bioinformatics communities studying genome rearrangements and genome assembly
    • …
    corecore