31,327 research outputs found
Information Theory and the Length Distribution of all Discrete Systems
We begin with the extraordinary observation that the length distribution of
80 million proteins in UniProt, the Universal Protein Resource, measured in
amino acids, is qualitatively identical to the length distribution of large
collections of computer functions measured in programming language tokens, at
all scales. That two such disparate discrete systems share important structural
properties suggests that yet other apparently unrelated discrete systems might
share the same properties, and certainly invites an explanation.
We demonstrate that this is inevitable for all discrete systems of components
built from tokens or symbols. Departing from existing work by embedding the
Conservation of Hartley-Shannon information (CoHSI) in a classical statistical
mechanics framework, we identify two kinds of discrete system, heterogeneous
and homogeneous. Heterogeneous systems contain components built from a unique
alphabet of tokens and yield an implicit CoHSI distribution with a sharp
unimodal peak asymptoting to a power-law. Homogeneous systems contain
components each built from just one kind of token unique to that component and
yield a CoHSI distribution corresponding to Zipf's law.
This theory is applied to heterogeneous systems, (proteome, computer
software, music); homogeneous systems (language texts, abundance of the
elements); and to systems in which both heterogeneous and homogeneous behaviour
co-exist (word frequencies and word length frequencies in language texts). In
each case, the predictions of the theory are tested and supported to high
levels of statistical significance. We also show that in the same heterogeneous
system, different but consistent alphabets must be related by a power-law. We
demonstrate this on a large body of music by excluding and including note
duration in the definition of the unique alphabet of notes.Comment: 70 pages, 53 figures, inc. 30 pages of Appendice
Exact scaling in the expansion-modification system
This work is devoted to the study of the scaling, and the consequent
power-law behavior, of the correlation function in a mutation-replication model
known as the expansion-modification system. The latter is a biology inspired
random substitution model for the genome evolution, which is defined on a
binary alphabet and depends on a parameter interpreted as a \emph{mutation
probability}. We prove that the time-evolution of this system is such that any
initial measure converges towards a unique stationary one exhibiting decay of
correlations not slower than a power-law. We then prove, for a significant
range of mutation probabilities, that the decay of correlations indeed follows
a power-law with scaling exponent smoothly depending on the mutation
probability. Finally we put forward an argument which allows us to give a
closed expression for the corresponding scaling exponent for all the values of
the mutation probability. Such a scaling exponent turns out to be a piecewise
smooth function of the parameter.Comment: 22 pages, 2 figure
Using Neural Generative Models to Release Synthetic Twitter Corpora with Reduced Stylometric Identifiability of Users
We present a method for generating synthetic versions of Twitter data using
neural generative models. The goal is protecting individuals in the source data
from stylometric re-identification attacks while still releasing data that
carries research value. Specifically, we generate tweet corpora that maintain
user-level word distributions by augmenting the neural language models with
user-specific components. We compare our approach to two standard text data
protection methods: redaction and iterative translation. We evaluate the three
methods on measures of risk and utility. We define risk following the
stylometric models of re-identification, and we define utility based on two
general word distribution measures and two common text analysis research tasks.
We find that neural models are able to significantly lower risk over previous
methods with little cost to utility. We also demonstrate that the neural models
allow data providers to actively control the risk-utility trade-off through
model tuning parameters. This work presents promising results for a new tool
addressing the problem of privacy for free text and sharing social media data
in a way that respects privacy and is ethically responsible
A Fast Algorithm Finding the Shortest Reset Words
In this paper we present a new fast algorithm finding minimal reset words for
finite synchronizing automata. The problem is know to be computationally hard,
and our algorithm is exponential. Yet, it is faster than the algorithms used so
far and it works well in practice. The main idea is to use a bidirectional BFS
and radix (Patricia) tries to store and compare resulted subsets. We give both
theoretical and practical arguments showing that the branching factor is
reduced efficiently. As a practical test we perform an experimental study of
the length of the shortest reset word for random automata with states and 2
input letters. We follow Skvorsov and Tipikin, who have performed such a study
using a SAT solver and considering automata up to states. With our
algorithm we are able to consider much larger sample of automata with up to
states. In particular, we obtain a new more precise estimation of the
expected length of the shortest reset word .Comment: COCOON 2013. The final publication is available at
http://link.springer.com/chapter/10.1007%2F978-3-642-38768-5_1
Learning Latent Representations for Speech Generation and Transformation
An ability to model a generative process and learn a latent representation
for speech in an unsupervised fashion will be crucial to process vast
quantities of unlabelled speech data. Recently, deep probabilistic generative
models such as Variational Autoencoders (VAEs) have achieved tremendous success
in modeling natural images. In this paper, we apply a convolutional VAE to
model the generative process of natural speech. We derive latent space
arithmetic operations to disentangle learned latent representations. We
demonstrate the capability of our model to modify the phonetic content or the
speaker identity for speech segments using the derived operations, without the
need for parallel supervisory data.Comment: Accepted to Interspeech 201
1-way quantum finite automata: strengths, weaknesses and generalizations
We study 1-way quantum finite automata (QFAs). First, we compare them with
their classical counterparts. We show that, if an automaton is required to give
the correct answer with a large probability (over 0.98), then the power of
1-way QFAs is equal to the power of 1-way reversible automata. However, quantum
automata giving the correct answer with smaller probabilities are more powerful
than reversible automata.
Second, we show that 1-way QFAs can be very space-efficient. Namely, we
construct a 1-way QFA which is exponentially smaller than any equivalent
classical (even randomized) finite automaton. This construction may be useful
for design of other space-efficient quantum algorithms.
Third, we consider several generalizations of 1-way QFAs. Here, our goal is
to find a model which is more powerful than 1-way QFAs keeping the quantum part
as simple as possible.Comment: 23 pages LaTeX, 1 figure, to appear at FOCS'9
Catching Attention with Automatic Pull Quote Selection
Pull quotes are an effective component of a captivating news article. These
spans of text are selected from an article and provided with more salient
presentation, with the aim of attracting readers with intriguing phrases and
making the article more visually interesting. In this paper, we introduce the
novel task of automatic pull quote selection, construct a dataset, and
benchmark the performance of a number of approaches ranging from hand-crafted
features to state-of-the-art sentence embeddings to cross-task models. We show
that pre-trained Sentence-BERT embeddings outperform all other approaches,
however the benefit over n-gram models is marginal. By closely examining the
results of simple models, we also uncover many unexpected properties of pull
quotes that should serve as inspiration for future approaches. We believe the
benefits of exploring this problem further are clear: pull quotes have been
found to increase enjoyment and readability, shape reader perceptions, and
facilitate learning.Comment: 14 pages (11 + 3 for refs), 3 figures, 6 table
Testing statistical laws in complex systems
The availability of large datasets requires an improved view on statistical
laws in complex systems, such as Zipf's law of word frequencies, the
Gutenberg-Richter law of earthquake magnitudes, or scale-free degree
distribution in networks. In this paper we discuss how the statistical analysis
of these laws are affected by correlations present in the observations, the
typical scenario for data from complex systems. We first show how standard
maximum-likelihood recipes lead to false rejections of statistical laws in the
presence of correlations. We then propose a conservative method (based on
shuffling and under-sampling the data) to test statistical laws and find that
accounting for correlations leads to smaller rejection rates and larger
confidence intervals on estimated parameters.Comment: 5-pages paper + 7-pages supplementary materia
Compositional generalization in a deep seq2seq model by separating syntax and semantics
Standard methods in deep learning for natural language processing fail to
capture the compositional structure of human language that allows for
systematic generalization outside of the training distribution. However, human
learners readily generalize in this way, e.g. by applying known grammatical
rules to novel words. Inspired by work in neuroscience suggesting separate
brain systems for syntactic and semantic processing, we implement a
modification to standard approaches in neural machine translation, imposing an
analogous separation. The novel model, which we call Syntactic Attention,
substantially outperforms standard methods in deep learning on the SCAN
dataset, a compositional generalization task, without any hand-engineered
features or additional supervision. Our work suggests that separating syntactic
from semantic learning may be a useful heuristic for capturing compositional
structure.Comment: 18 pages, 15 figures, preprint version of submission to NeurIPS 2019,
under revie
Sequence Training of DNN Acoustic Models With Natural Gradient
Deep Neural Network (DNN) acoustic models often use discriminative sequence
training that optimises an objective function that better approximates the word
error rate (WER) than frame-based training. Sequence training is normally
implemented using Stochastic Gradient Descent (SGD) or Hessian Free (HF)
training. This paper proposes an alternative batch style optimisation framework
that employs a Natural Gradient (NG) approach to traverse through the parameter
space. By correcting the gradient according to the local curvature of the
KL-divergence, the NG optimisation process converges more quickly than HF.
Furthermore, the proposed NG approach can be applied to any sequence
discriminative training criterion. The efficacy of the NG method is shown using
experiments on a Multi-Genre Broadcast (MGB) transcription task that
demonstrates both the computational efficiency and the accuracy of the
resulting DNN models.Comment: In Proceedings of IEEE ASRU 201
- …