31,327 research outputs found

    Information Theory and the Length Distribution of all Discrete Systems

    Full text link
    We begin with the extraordinary observation that the length distribution of 80 million proteins in UniProt, the Universal Protein Resource, measured in amino acids, is qualitatively identical to the length distribution of large collections of computer functions measured in programming language tokens, at all scales. That two such disparate discrete systems share important structural properties suggests that yet other apparently unrelated discrete systems might share the same properties, and certainly invites an explanation. We demonstrate that this is inevitable for all discrete systems of components built from tokens or symbols. Departing from existing work by embedding the Conservation of Hartley-Shannon information (CoHSI) in a classical statistical mechanics framework, we identify two kinds of discrete system, heterogeneous and homogeneous. Heterogeneous systems contain components built from a unique alphabet of tokens and yield an implicit CoHSI distribution with a sharp unimodal peak asymptoting to a power-law. Homogeneous systems contain components each built from just one kind of token unique to that component and yield a CoHSI distribution corresponding to Zipf's law. This theory is applied to heterogeneous systems, (proteome, computer software, music); homogeneous systems (language texts, abundance of the elements); and to systems in which both heterogeneous and homogeneous behaviour co-exist (word frequencies and word length frequencies in language texts). In each case, the predictions of the theory are tested and supported to high levels of statistical significance. We also show that in the same heterogeneous system, different but consistent alphabets must be related by a power-law. We demonstrate this on a large body of music by excluding and including note duration in the definition of the unique alphabet of notes.Comment: 70 pages, 53 figures, inc. 30 pages of Appendice

    Exact scaling in the expansion-modification system

    Full text link
    This work is devoted to the study of the scaling, and the consequent power-law behavior, of the correlation function in a mutation-replication model known as the expansion-modification system. The latter is a biology inspired random substitution model for the genome evolution, which is defined on a binary alphabet and depends on a parameter interpreted as a \emph{mutation probability}. We prove that the time-evolution of this system is such that any initial measure converges towards a unique stationary one exhibiting decay of correlations not slower than a power-law. We then prove, for a significant range of mutation probabilities, that the decay of correlations indeed follows a power-law with scaling exponent smoothly depending on the mutation probability. Finally we put forward an argument which allows us to give a closed expression for the corresponding scaling exponent for all the values of the mutation probability. Such a scaling exponent turns out to be a piecewise smooth function of the parameter.Comment: 22 pages, 2 figure

    Using Neural Generative Models to Release Synthetic Twitter Corpora with Reduced Stylometric Identifiability of Users

    Full text link
    We present a method for generating synthetic versions of Twitter data using neural generative models. The goal is protecting individuals in the source data from stylometric re-identification attacks while still releasing data that carries research value. Specifically, we generate tweet corpora that maintain user-level word distributions by augmenting the neural language models with user-specific components. We compare our approach to two standard text data protection methods: redaction and iterative translation. We evaluate the three methods on measures of risk and utility. We define risk following the stylometric models of re-identification, and we define utility based on two general word distribution measures and two common text analysis research tasks. We find that neural models are able to significantly lower risk over previous methods with little cost to utility. We also demonstrate that the neural models allow data providers to actively control the risk-utility trade-off through model tuning parameters. This work presents promising results for a new tool addressing the problem of privacy for free text and sharing social media data in a way that respects privacy and is ethically responsible

    A Fast Algorithm Finding the Shortest Reset Words

    Full text link
    In this paper we present a new fast algorithm finding minimal reset words for finite synchronizing automata. The problem is know to be computationally hard, and our algorithm is exponential. Yet, it is faster than the algorithms used so far and it works well in practice. The main idea is to use a bidirectional BFS and radix (Patricia) tries to store and compare resulted subsets. We give both theoretical and practical arguments showing that the branching factor is reduced efficiently. As a practical test we perform an experimental study of the length of the shortest reset word for random automata with nn states and 2 input letters. We follow Skvorsov and Tipikin, who have performed such a study using a SAT solver and considering automata up to n=100n=100 states. With our algorithm we are able to consider much larger sample of automata with up to n=300n=300 states. In particular, we obtain a new more precise estimation of the expected length of the shortest reset word ≈2.5n−5\approx 2.5\sqrt{n-5}.Comment: COCOON 2013. The final publication is available at http://link.springer.com/chapter/10.1007%2F978-3-642-38768-5_1

    Learning Latent Representations for Speech Generation and Transformation

    Full text link
    An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.Comment: Accepted to Interspeech 201

    1-way quantum finite automata: strengths, weaknesses and generalizations

    Get PDF
    We study 1-way quantum finite automata (QFAs). First, we compare them with their classical counterparts. We show that, if an automaton is required to give the correct answer with a large probability (over 0.98), then the power of 1-way QFAs is equal to the power of 1-way reversible automata. However, quantum automata giving the correct answer with smaller probabilities are more powerful than reversible automata. Second, we show that 1-way QFAs can be very space-efficient. Namely, we construct a 1-way QFA which is exponentially smaller than any equivalent classical (even randomized) finite automaton. This construction may be useful for design of other space-efficient quantum algorithms. Third, we consider several generalizations of 1-way QFAs. Here, our goal is to find a model which is more powerful than 1-way QFAs keeping the quantum part as simple as possible.Comment: 23 pages LaTeX, 1 figure, to appear at FOCS'9

    Catching Attention with Automatic Pull Quote Selection

    Full text link
    Pull quotes are an effective component of a captivating news article. These spans of text are selected from an article and provided with more salient presentation, with the aim of attracting readers with intriguing phrases and making the article more visually interesting. In this paper, we introduce the novel task of automatic pull quote selection, construct a dataset, and benchmark the performance of a number of approaches ranging from hand-crafted features to state-of-the-art sentence embeddings to cross-task models. We show that pre-trained Sentence-BERT embeddings outperform all other approaches, however the benefit over n-gram models is marginal. By closely examining the results of simple models, we also uncover many unexpected properties of pull quotes that should serve as inspiration for future approaches. We believe the benefits of exploring this problem further are clear: pull quotes have been found to increase enjoyment and readability, shape reader perceptions, and facilitate learning.Comment: 14 pages (11 + 3 for refs), 3 figures, 6 table

    Testing statistical laws in complex systems

    Full text link
    The availability of large datasets requires an improved view on statistical laws in complex systems, such as Zipf's law of word frequencies, the Gutenberg-Richter law of earthquake magnitudes, or scale-free degree distribution in networks. In this paper we discuss how the statistical analysis of these laws are affected by correlations present in the observations, the typical scenario for data from complex systems. We first show how standard maximum-likelihood recipes lead to false rejections of statistical laws in the presence of correlations. We then propose a conservative method (based on shuffling and under-sampling the data) to test statistical laws and find that accounting for correlations leads to smaller rejection rates and larger confidence intervals on estimated parameters.Comment: 5-pages paper + 7-pages supplementary materia

    Compositional generalization in a deep seq2seq model by separating syntax and semantics

    Full text link
    Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in neuroscience suggesting separate brain systems for syntactic and semantic processing, we implement a modification to standard approaches in neural machine translation, imposing an analogous separation. The novel model, which we call Syntactic Attention, substantially outperforms standard methods in deep learning on the SCAN dataset, a compositional generalization task, without any hand-engineered features or additional supervision. Our work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure.Comment: 18 pages, 15 figures, preprint version of submission to NeurIPS 2019, under revie

    Sequence Training of DNN Acoustic Models With Natural Gradient

    Full text link
    Deep Neural Network (DNN) acoustic models often use discriminative sequence training that optimises an objective function that better approximates the word error rate (WER) than frame-based training. Sequence training is normally implemented using Stochastic Gradient Descent (SGD) or Hessian Free (HF) training. This paper proposes an alternative batch style optimisation framework that employs a Natural Gradient (NG) approach to traverse through the parameter space. By correcting the gradient according to the local curvature of the KL-divergence, the NG optimisation process converges more quickly than HF. Furthermore, the proposed NG approach can be applied to any sequence discriminative training criterion. The efficacy of the NG method is shown using experiments on a Multi-Genre Broadcast (MGB) transcription task that demonstrates both the computational efficiency and the accuracy of the resulting DNN models.Comment: In Proceedings of IEEE ASRU 201
    • …
    corecore