107 research outputs found

    Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

    Full text link
    DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

    Testing a Random Number Generator: formal properties and automotive application

    Get PDF
    L'elaborato analizza un metodo di validazione dei generatori di numeri casuali (RNG), utilizzati per garantire la sicurezza dei moderni sistemi automotive. Il primo capitolo fornisce una panoramica della struttura di comunicazione dei moderni autoveicoli attraverso l'utilizzo di centraline (ECU): vengono riportati i principali punti di accesso ad un automobile, assieme a possibili tipologie di hacking; viene poi descritto l'utilizzo dei numeri casuali in crittografia, con particolare riferimento a quella utilizzata nei veicoli. Il secondo capitolo riporta le basi di probabilitĂ  necessarie all'approccio dei test statistici utilizzati per la validazione e riporta i principali approcci teorici al problema della casualitĂ . Nei due capitoli centrali, viene proposta una descrizione dei metodi probabilistici ed entropici per l'analisi di dati reali utilizzati nei test. Vengono poi descritti e studiati i 15 test statistici proposti dal National Institute of Standards and Technology (NIST). Dopo i primi test, basati su proprietĂ  molto semplici delle sequenze casuali, vengono proposti test piĂč sofisticati, basati sull'uso della trasformata di Fourier (per testare eventuali comportamenti periodici), dell'entropia (strettamente connessi con la comprimibilitĂ  della sequenza), o sui random path. Due ulteriori test, permettono di valutare il buon funzionamento del generatore, e non solo delle singole sequenze generate. Infine, il quinto capitolo Ăš dedicato all'implementazione dei test al fine di testare il TRNG delle centraline

    Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

    Full text link
    Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-kk or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).Comment: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG 202

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Get PDF
    The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nn-letter long text describes nÎČn^\beta independent facts in a random but consistent way then the text contains at least nÎČ/log⁥nn^\beta/\log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These are: a construction of universal grammar-based codes for which the differences of code lengths can be bounded easily, ergodic decomposition theorems for mutual information between the past and future of a stationary process, and a lemma that bounds differences of a sublinear function. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.While searching for concrete processes to which our proposition can be applied, we introduce several instances of strongly nonergodic processes. In particular, we define the subclass of accessible description processes, which formalizes the notion of texts that describe facts in a self-contained way

    Using Visual Salience in Empirical Game Theory

    Get PDF
    Coordination games often have salient “focal points”. In games where choices are locations in images, we test for the effect of salience, predicted a priori using a neuroscience-based algorithm, Concentration of salience is correlated with the rate of matching when players are trying to match (r=.64). In hider-seeker games, all players choose salient locations more often, creating a “seeker’s advantage” (seekers win 9% of games). Salience-choice relations are explained by a salience-enhanced cognitive hierarchy model. The novel prediction that time pressure will increases seeker’s advantage, by biasing choices toward salience, is confirmed. Other links to salience in economics are suggested

    Tokenization and the Noiseless Channel

    Full text link
    Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with α=2.5\alpha = 2.5 has a very strong correlation with \textsc{Bleu}: 0.780.78 in comparison to just −0.32-0.32 for compressed length.Comment: ACL 202
    • 

    corecore