33 research outputs found

    Computational annotation of eukaryotic gene structures: algorithms development and software systems

    Get PDF
    An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.;The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations.;Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed.;Therefore, the continued development of methods not reliant on homology information---the so-called ab initio gene prediction methods---should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least.;This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches.;The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular

    Decision Tree-based Syntactic Language Modeling

    Get PDF
    Statistical Language Modeling is an integral part of many natural language processing applications, such as Automatic Speech Recognition (ASR) and Machine Translation. N-gram language models dominate the field, despite having an extremely shallow view of language---a Markov chain of words. In this thesis, we develop and evaluate a joint language model that incorporates syntactic and lexical information in a effort to ``put language back into language modeling.'' Our main goal is to demonstrate that such a model is not only effective but can be made scalable and tractable. We utilize decision trees to tackle the problem of sparse parameter estimation which is exacerbated by the use of syntactic information jointly with word context. While decision trees have been previously applied to language modeling, there has been little analysis of factors affecting decision tree induction and probability estimation for language modeling. In this thesis, we analyze several aspects that affect decision tree-based language modeling, with an emphasis on syntactic language modeling. We then propose improvements to the decision tree induction algorithm based on our analysis, as well as the methods for constructing forest models---models consisting of multiple decision trees. Finally, we evaluate the impact of our syntactic language model on large scale Speech Recognition and Machine Translation tasks. In this thesis, we also address a number of engineering problems associated with the joint syntactic language model in order to make it tractable. Particularly, we propose a novel decoding algorithm that exploits the decision tree structure to eliminate unnecessary computation. We also propose and evaluate an approximation of our syntactic model by word n-grams---the approximation that makes it possible to incorporate our model directly into the CDEC Machine Translation decoder rather than using the model for rescoring hypotheses produced using an n-gram model

    META-LEARNING NEURAL MACHINE TRANSLATION CURRICULA

    Get PDF
    Curriculum learning hypothesizes that presenting training samples in a meaningful order to machine learners during training helps improve model quality and conver- gence rate. In this dissertation, we explore this framework for learning in the context of Neural Machine Translation (NMT). NMT systems are typically trained on a large amount of heterogeneous data and have the potential to benefit greatly from curricu- lum learning in terms of both speed and quality. We concern ourselves with three primary questions in our investigation : (i) how do we design a task and/or dataset specific curriculum for NMT training? (ii) can we leverage human intuition about learning in this design or can we learn the curriculum itself? (iii) how do we featurize training samples (e.g., easy versus hard) so that they can be effectively slotted into a curriculum? We begin by empirically exploring various hand-designed curricula and their effect on translation performance and speed of training NMT systems. We show that these curricula, most of which are based on human intuition, can improve NMT training speed but are highly sensitive to hyperparameter settings. Next, instead of using a hand-designed curriculum, we meta-learn a curriculum for the task of learning from noisy translation samples using reinforcement learning. We demonstrate that this learned curriculum significantly outperforms a random-curriculum baseline and matches the strongest hand-designed curriculum. We then extend this approach to the task of multi-lingual NMT with an emphasis on accumulating knowledge and learning from multiple training runs. Again, we show that this technique can match the strongest baseline obtained via expensive fine-grained grid search for the (learned) hyperparameters. We conclude with an extension which requires no prior knowledge of sample relevance to the task and uses sample features instead, hence learning both the relevance of each training sample to the task and the appropriate curriculum jointly. We show that this technique outperforms the state-of-the-art results on a noisy filtering task

    Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

    Get PDF
    Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language

    Stream-based statistical machine translation

    Get PDF
    We investigate a new approach for SMT system training within the streaming model of computation. We develop and test incrementally retrainable models which, given an incoming stream of new data, can efficiently incorporate the stream data online. A naive approach using a stream would use an unbounded amount of space. Instead, our online SMT system can incorporate information from unbounded incoming streams and maintain constant space and time. Crucially, we are able to match (or even exceed) translation performance of comparable systems which are batch retrained and use unbounded space. Our approach is particularly suited for situations when there is arbitrarily large amounts of new training material and we wish to incorporate it efficiently and in small space. The novel contributions of this thesis are: 1. An online, randomised language model that can model unbounded input streams in constant space and time. 2. An incrementally retrainable translationmodel for both phrase-based and grammarbased systems. The model presented is efficient enough to incorporate novel parallel text at the single sentence level. 3. Strategies for updating our stream-based language model and translation model which demonstrate how such components can be successfully used in a streaming translation setting. This operates both within a single streaming environment and also in the novel situation of having to translate multiple streams. 4. Demonstration that recent data from the stream is beneficial to translation performance. Our stream-based SMT system is efficient for tackling massive volumes of new training data and offers-up new ways of thinking about translating web data and dealing with other natural language streams

    Highly Parallel Geometric Characterization and Visualization of Volumetric Data Sets

    Get PDF
    Volumetric 3D data sets are being generated in many different application areas. Some examples are CAT scans and MRI data, 3D models of protein molecules represented by implicit surfaces, multi-dimensional numeric simulations of plasma turbulence, and stacks of confocal microscopy images of cells. The size of these data sets has been increasing, requiring the speed of analysis and visualization techniques to also increase to keep up. Recent advances in processor technology have stopped increasing clock speed and instead begun increasing parallelism, resulting in multi-core CPUS and many-core GPUs. To take advantage of these new parallel architectures, algorithms must be explicitly written to exploit parallelism. In this thesis we describe several algorithms and techniques for volumetric data set analysis and visualization that are amenable to these modern parallel architectures. We first discuss modeling volumetric data with Gaussian Radial Basis Functions (RBFs). RBF representation of a data set has several advantages, including lossy compression, analytic differentiability, and analytic application of Gaussian blur. We also describe a parallel volume rendering algorithm that can create images of the data directly from the RBF representation. Next we discuss a parallel, stochastic algorithm for measuring the surface area of volumetric representations of molecules. The algorithm is suitable for implementation on a GPU and is also progressive, allowing it to return a rough answer almost immediately and refine the answer over time to the desired level of accuracy. After this we discuss the concept of Confluent Visualization, which allows the visualization of the interaction between a pair of volumetric data sets. The interaction is visualized through volume rendering, which is well suited to implementation on parallel architectures. Finally we discuss a parallel, stochastic algorithm for classifying stem cells as having been grown on a surface that induces differentiation or on a surface that does not induce differentiation. The algorithm takes as input 3D volumetric models of the cells generated from confocal microscopy. This algorithm builds on our algorithm for surface area measurement and, like that algorithm, this algorithm is also suitable for implementation on a GPU and is progressive

    Energy-precision tradeoffs in the graphics pipeline

    Get PDF
    The energy consumption of a graphics processing unit (GPU) is an important factor in its design, whether for a server, desktop, or mobile device. Mobile products, such as smart phones, tablets, and laptop computers, rely on batteries to function; the less the demand for power is on these batteries, the longer they will last before needing to be recharged. GPUs used in servers and desktops, while not dependent on a battery for operation, are still limited by the efficiency of power supplies and heat dissipation techniques. In this dissertation, I propose to lower the energy consumption of GPUs by reducing the precision of floating-point arithmetic in the graphics pipeline and the data sent and stored on- and off-chip. The key idea behind this work is twofold: energy can be saved through a systematic and targeted reduction in the number of bits 1) computed and 2) communicated. Reducing the number of bits computed will necessarily reduce either the precision or range of a floating point number. I focus on saving energy by way of reducing precision, which can exploit the over-provisioning of bits in many stages of the graphics pipeline. Reducing the number of bits communicated takes several forms. First, I propose enhancements to existing compression schemes for off-chip buffers to save bandwidth. I also suggest a simple extension that exploits unused bits in reduced-precision data undergoing compression. Finally, I present techniques for saving energy in on-chip communication of reduced-precision data. By designing and simulating variable-precision arithmetic circuits with promising energy versus precision characteristics and tradeoffs, I have developed an energy model for GPUs. Using this model and my techniques, I have shown that significant savings (up to 70% in computation in the vertex and pixel shader stages) are possible by reducing the precision of the arithmetic. Further, my compression approaches have enabled improvements of 1.26x over past work, and a general-purpose compressor design has achieved bandwidth savings of 34%, 87%, and 65% for color, depth, and geometry data, respectively, which is competitive with past work. Lastly, an initial exploration in signal gating unused lines in on-chip buses has suggested savings of 13-48% for the tested applications' traffic from a multiprocessor's register file to its L1 cache

    PaLM: Scaling Language Modeling with Pathways

    Full text link
    Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies

    Exploiting Latent Features of Text and Graphs

    Get PDF
    As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial processes to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings --- semantically rich vectors of latent features --- to convert human constructs for machine understanding. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machine-generated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking criteria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for HIV-associated Neurodegenerative Disorders. We additionally focus on the latent features of graphs, and propose a new bipartite graph embedding technique. Using our graph embedding, we advance the state-of-the-art in hypergraph partitioning quality. Having newfound intuition of graph embeddings, we present Agatha, a deep-learning approach to hypothesis generation. This system learns a data-driven ranking criteria derived from the embeddings of our large proposed biomedical semantic graph. To produce human-readable results, we additionally propose CBAG, a technique for conditional biomedical abstract generation
    corecore