107 research outputs found
Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors
DNA as a data storage medium has several advantages, including far greater
data density compared to electronic media. We propose that schemes for data
storage in the DNA of living organisms may benefit from studying the
reconstruction problem, which is applicable whenever multiple reads of noisy
data are available. This strategy is uniquely suited to the medium, which
inherently replicates stored data in multiple distinct ways, caused by
mutations. We consider noise introduced solely by uniform tandem-duplication,
and utilize the relation to constant-weight integer codes in the Manhattan
metric. By bounding the intersection of the cross-polytope with hyperplanes, we
prove the existence of reconstruction codes with greater capacity than known
error-correcting codes, which we can determine analytically for any set of
parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio
Testing a Random Number Generator: formal properties and automotive application
L'elaborato analizza un metodo di validazione dei generatori di numeri casuali (RNG), utilizzati per garantire la sicurezza dei moderni sistemi automotive.
Il primo capitolo fornisce una panoramica della struttura di comunicazione dei moderni autoveicoli attraverso l'utilizzo di centraline (ECU): vengono riportati i principali punti di accesso ad un automobile, assieme a possibili tipologie di hacking; viene poi descritto l'utilizzo dei numeri casuali in crittografia, con particolare riferimento a quella utilizzata nei veicoli.
Il secondo capitolo riporta le basi di probabilitĂ necessarie all'approccio dei test statistici utilizzati per la validazione e riporta i principali approcci teorici al problema della casualitĂ .
Nei due capitoli centrali, viene proposta una descrizione dei metodi probabilistici ed entropici per l'analisi di dati reali utilizzati nei test. Vengono poi descritti e studiati i 15 test statistici proposti dal National Institute of Standards and Technology (NIST). Dopo i primi test, basati su proprietĂ molto semplici delle sequenze casuali, vengono proposti test piĂč sofisticati, basati sull'uso della trasformata di Fourier (per testare eventuali comportamenti periodici), dell'entropia (strettamente connessi con la comprimibilitĂ della sequenza), o sui random path. Due ulteriori test, permettono di valutare il buon funzionamento del generatore, e non solo delle singole sequenze generate.
Infine, il quinto capitolo Ăš dedicato all'implementazione dei test al fine di testare il TRNG delle centraline
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
Neural language models are increasingly deployed into APIs and websites that
allow a user to pass in a prompt and receive generated text. Many of these
systems do not reveal generation parameters. In this paper, we present methods
to reverse-engineer the decoding method used to generate text (i.e., top- or
nucleus sampling). Our ability to discover which decoding strategy was used has
implications for detecting generated text. Additionally, the process of
discovering the decoding strategy can reveal biases caused by selecting
decoding settings which severely truncate a model's predicted distributions. We
perform our attack on several families of open-source language models, as well
as on production systems (e.g., ChatGPT).Comment: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG
202
On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
The article presents a new interpretation for Zipf's law in
natural language which relies on two areas of information
theory. We reformulate the problem of grammar-based compression
and investigate properties of strongly nonergodic stationary
processes. The motivation for the joint discussion is to prove a
proposition with a simple informal statement: If an -letter
long text describes independent facts in a random but
consistent way then the text contains at least
different words.
In the formal statement, two specific postulates are
adopted. Firstly, the words are understood as the nonterminal
symbols of the shortest grammar-based encoding of the
text. Secondly, the texts are assumed to be emitted by a
nonergodic source, with the described facts being binary IID
variables that are asymptotically predictable in a
shift-invariant way.
The proof of the formal proposition applies several new tools.
These are: a construction of universal grammar-based codes for
which the differences of code lengths can be bounded easily,
ergodic decomposition theorems for mutual information between the
past and future of a stationary process, and a lemma that bounds
differences of a sublinear function.
The linguistic relevance of presented modeling assumptions,
theorems, definitions, and examples is discussed in
parallel.While searching for concrete processes to which our
proposition can be applied, we introduce several instances of
strongly nonergodic processes. In particular, we define the
subclass of accessible description processes, which formalizes
the notion of texts that describe facts in a self-contained way
Using Visual Salience in Empirical Game Theory
Coordination games often have salient âfocal pointsâ. In games where choices are locations in images, we test for the effect of salience, predicted a priori using a neuroscience-based algorithm, Concentration of salience is correlated with the rate of matching when players are trying to match (r=.64). In hider-seeker games, all players choose salient locations more often, creating a âseekerâs advantageâ (seekers win 9% of games). Salience-choice relations are explained by a salience-enhanced cognitive hierarchy model. The novel prediction that time pressure will increases seekerâs advantage, by biasing choices toward salience, is confirmed. Other links to salience in economics are suggested
Tokenization and the Noiseless Channel
Subword tokenization is a key part of many NLP pipelines. However, little is
known about why some tokenizer and hyperparameter combinations lead to better
downstream model performance than others. We propose that good tokenizers lead
to \emph{efficient} channel usage, where the channel is the means by which some
input is conveyed to the model and efficiency can be quantified in
information-theoretic terms as the ratio of the Shannon entropy to the maximum
possible entropy of the token distribution. Yet, an optimal encoding according
to Shannon entropy assigns extremely long codes to low-frequency tokens and
very short codes to high-frequency tokens. Defining efficiency in terms of
R\'enyi entropy, on the other hand, penalizes distributions with either very
high or very low-frequency tokens. In machine translation, we find that across
multiple tokenizers, the R\'enyi entropy with has a very strong
correlation with \textsc{Bleu}: in comparison to just for
compressed length.Comment: ACL 202
- âŠ