Search CORE

107 research outputs found

Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

Author: Schwartz Moshe
Yehezkeally Yonatan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/09/2019
Field of study

DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

arXiv.org e-Print Archive

Crossref

Testing a Random Number Generator: formal properties and automotive application

Author: Mattioli Federico
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 29/03/2019
Field of study

L'elaborato analizza un metodo di validazione dei generatori di numeri casuali (RNG), utilizzati per garantire la sicurezza dei moderni sistemi automotive. Il primo capitolo fornisce una panoramica della struttura di comunicazione dei moderni autoveicoli attraverso l'utilizzo di centraline (ECU): vengono riportati i principali punti di accesso ad un automobile, assieme a possibili tipologie di hacking; viene poi descritto l'utilizzo dei numeri casuali in crittografia, con particolare riferimento a quella utilizzata nei veicoli. Il secondo capitolo riporta le basi di probabilità necessarie all'approccio dei test statistici utilizzati per la validazione e riporta i principali approcci teorici al problema della casualità. Nei due capitoli centrali, viene proposta una descrizione dei metodi probabilistici ed entropici per l'analisi di dati reali utilizzati nei test. Vengono poi descritti e studiati i 15 test statistici proposti dal National Institute of Standards and Technology (NIST). Dopo i primi test, basati su proprietà molto semplici delle sequenze casuali, vengono proposti test più sofisticati, basati sull'uso della trasformata di Fourier (per testare eventuali comportamenti periodici), dell'entropia (strettamente connessi con la comprimibilità della sequenza), o sui random path. Due ulteriori test, permettono di valutare il buon funzionamento del generatore, e non solo delle singole sequenze generate. Infine, il quinto capitolo è dedicato all'implementazione dei test al fine di testare il TRNG delle centraline

AMS Tesi di Laurea

Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

Author: Carlini Nicholas
Ippolito Daphne
Lee Katherine
Nasr Milad
Yu Yun William
Publication venue
Publication date: 09/09/2023
Field of study

Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-

k

or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).Comment: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG 202

arXiv.org e-Print Archive

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

Author: Debowski L. (Lukasz Jerzy)
Publication venue
Publication date: 01/01/2008
Field of study

The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an

n

-letter long text describes

n^\beta

independent facts in a random but consistent way then the text contains at least

n^\beta/\log n

different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These are: a construction of universal grammar-based codes for which the differences of code lengths can be bounded easily, ergodic decomposition theorems for mutual information between the past and future of a stationary process, and a lemma that bounds differences of a sublinear function. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.While searching for concrete processes to which our proposition can be applied, we introduce several instances of strongly nonergodic processes. In particular, we define the subclass of accessible description processes, which formalizes the notion of texts that describe facts in a self-contained way

CWI's Institutional Repository

Using Visual Salience in Empirical Game Theory

Author: Camerer Colin
Li Xiaoming
Publication venue
Publication date: 01/01/2019
Field of study

Coordination games often have salient “focal points”. In games where choices are locations in images, we test for the effect of salience, predicted a priori using a neuroscience-based algorithm, Concentration of salience is correlated with the rate of matching when players are trying to match (r=.64). In hider-seeker games, all players choose salient locations more often, creating a “seeker’s advantage” (seekers win 9% of games). Salience-choice relations are explained by a salience-enhanced cognitive hierarchy model. The novel prediction that time pressure will increases seeker’s advantage, by biasing choices toward salience, is confirmed. Other links to salience in economics are suggested

Caltech Authors

Tokenization and the Noiseless Channel

Author: Cotterell Ryan
Du Li
Gastaldi Juan Luis
Meister Clara
Sachan Mrinmaya
Zouhar Vilém
Publication venue
Publication date: 29/06/2023
Field of study

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with