373 research outputs found

    Prefix Codes for Power Laws with Countable Support

    Full text link
    In prefix coding over an infinite alphabet, methods that consider specific distributions generally consider those that decline more quickly than a power law (e.g., Golomb coding). Particular power-law distributions, however, model many random variables encountered in practice. For such random variables, compression performance is judged via estimates of expected bits per input symbol. This correspondence introduces a family of prefix codes with an eye towards near-optimal coding of known distributions. Compression performance is precisely estimated for well-known probability distributions using these codes and using previously known prefix codes. One application of these near-optimal codes is an improved representation of rational numbers.Comment: 5 pages, 2 tables, submitted to Transactions on Information Theor

    Robust Transmission of Unbounded Strings Using Fibonacci Representations

    Get PDF

    Universal codes of the natural numbers

    Full text link
    A code of the natural numbers is a uniquely-decodable binary code of the natural numbers with non-decreasing codeword lengths, which satisfies Kraft's inequality tightly. We define a natural partial order on the set of codes, and show how to construct effectively a code better than a given sequence of codes, in a certain precise sense. As an application, we prove that the existence of a scale of codes (a well-ordered set of codes which contains a code better than any given code) is independent of ZFC.Comment: 11 page

    DEVELOPING AN ONLINE CORPUS OF FORMOSAN LANGUAGES

    Get PDF
    Information technologies have now matured to the point of enabling researchers to create a repository of language resources, especially for those languages facing the crisis of endangerment. The development of an online platform of corpora, made possible by recent advances in data storage, character-encoding and web technology, has profound consequences for the accessibility, quantity, quality and interoperability of linguistic field data. This is of particular significance for Formosan languages in Taiwan, many of which are on the verge of extinction. As a response to the recognition of this burgeoning problem, the key objectives of the establishment of the NTU Corpus of Formosan Languages aim to document and thus preserve valuable linguistic data, as well as relevant ethnological and cultural information. This paper will introduce some of the theoretical bases behind this initiative, as well as the procedures, transcription conventions, database normalization, in-house system and three special features in the creation of this corpus

    Off-line compression by greedy textual substitution

    Full text link

    Investigation of Sequential Machine Design Techniques for Implementation of a TRAC Scanning Algorithm

    Get PDF
    This report will demonstrate the design techniques to translate a given scanning algorithm into a hardwired pre-processor. The language to be pre-processed is TRAC (Text Reckoning and Compiling) devised by Mooers and Deutsch. The major drawback in the current implementation of TRAC is speed. The software overhead required for string manipulations and execution of the input scanning algorithm is the major degrading factor. A TRAC machine consisting of a hardwired pre-processor to scan the input and produce formatted data for a stack oriented evaluator is proposed. The control machine for the input scanning algorithm for the pre-processor is designed using various sequential machine design techniques. The one-hot code and the minimum state variable design represent the two extremes which are presented

    Corpus-Based Machine Translation : A Study Case for the e-Government of Costa Rica Corpus-Based Machine Translation: A Study Case for the e-Government of Costa Rica

    Get PDF
    Esta investigación pretende estudiar el estado del arte en las tecnologías de la traducción automática. Se explorará la teoría fundamental de los sistemas estadísticos basados en frases (PB-SMT) y neuronales (NMT): su arquitectura y funcionamiento. Luego, nos concentraremos en un caso de estudio que pondrá a prueba la capacidad del traductor para aprovechar al máximo el potencial de estas tecnologías. Este caso de estudio incita al traductor a poner en práctica todos sus conocimientos y habilidades profesionales para llevar a cabo la preparación de datos, entrenamiento, evaluación y ajuste de los motores.This research paper aims to approach the state-of-the-art technologies in machine translation. Following an overview of the architecture and mechanisms underpinning PB-SMT and NMT systems, we will focus on a specific use-case that would attest the translator's agency at maximizing the cutting-edge potential of these technologies, particularly the PB-SMT's capacity. The use-case urges the translator to dig out of his/her toolbox the best practices possible to improve the translation output text by means of data preparation, training, assessment and refinement tasks

    Incorporating Punctuation Into the Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective

    Get PDF
    Punctuation helps us to structure, and thus to understand, texts. Many uses of punctuation straddle the line between syntax and discourse, because they serve to combine multiple propositions within a single orthographic sentence. They allow us to insert discourse-level relations at the level of a single sentence. Just as people make use of information from punctuation in processing what they read, computers can use information from punctuation in processing texts automatically. Most current natural language processing systems fail to take punctuation into account at all, losing a valuable source of information about the text. Those which do mostly do so in a superficial way, again failing to fully exploit the information conveyed by punctuation. To be able to make use of such information in a computational system, we must first characterize its uses and find a suitable representation for encoding them. The work here focuses on extending a syntactic grammar to handle phenomena occurring within a single sentence which have punctuation as an integral component. Punctuation marks are treated as full-fledged lexical items in a Lexicalized Tree Adjoining Grammar, which is an extremely well-suited formalism for encoding punctuation in the sentence grammar. Each mark anchors its own elementary trees and imposes constraints on the surrounding lexical items. I have analyzed data representing a wide variety of constructions, and added treatments of them to the large English grammar which is part of the XTAG system. The advantages of using LTAG are that its elementary units are structured trees of a suitable size for stating the constraints we are interested in, and the derivation histories it produces contain information the discourse grammar will need about which elementary units have used and how they have been combined. I also consider in detail a few particularly interesting constructions where the sentence and discourse grammars meet-appositives, reported speech and uses of parentheses. My results confirm that punctuation can be used in analyzing sentences to increase the coverage of the grammar, reduce the ambiguity of certain word sequences and facilitate discourse-level processing of the texts
    corecore