841 research outputs found

    Stochastic model for the vocabulary growth in natural languages

    Full text link
    We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

    Ranked Enumeration of MSO Logic on Words

    Get PDF
    In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words

    Long-run patterns in the discovery of the adjacent possible

    Full text link
    The notion of the "adjacent possible" has been advanced to theorize the generation of novelty across many different research domains. This study is an attempt to examine in what way the notion can be made empirically useful for innovation studies. A theoretical framework is construed based on the notion of innovation a search process of recombining knowledge to discover the "adjacent possible". The framework makes testable predictions about the rate of innovation, the distribution of innovations across organizations, and the rate of diversification or product portfolios. The empirical section examines how well this framework predicts long-run patterns of new product introductions in Sweden, 1908-2016 and examines the long-run evolution of the product space of Swedish organizations. The results suggest that, remarkably, the rate of innovation depends linearly on cumulative innovations, which explains advantages of incumbent firms, but excludes the emergence of "winner takes all" distributions. The results also suggest that the rate of development of new types of products follows "Heaps' law", where the share of new product types within organizations declines over time. The topology of the Swedish product space carries information about future product diversifications, suggesting that the adjacent possible is not altogether `"unprestatable"

    Empirical Bayes estimators of structural distribution of words in Lithuanian texts

    Get PDF
    Lithuanian language has great inflexion, free word order and other features which distinguish it from other languages. This raises a problem of testing for the Lithuanian language validity of findings established for other languages. In the paper, an empirical study of a collection of Lithuanian texts is performed. It is supposed that authors of texts are basic elements of the population under study and its heterogeneity stems out of the heterogeneity of preferences and choices of the authors. An attempt to estimate structural distributions of words in a collection of texts of different authors is made by making use of a simple statistical model and empirical Bayes approach

    Markovian dynamics of concurrent systems

    Full text link
    Monoid actions of trace monoids over finite sets are powerful models of concurrent systems---for instance they encompass the class of 1-safe Petri nets. We characterise Markov measures attached to concurrent systems by finitely many parameters with suitable normalisation conditions. These conditions involve polynomials related to the combinatorics of the monoid and of the monoid action. These parameters generalise to concurrent systems the coefficients of the transition matrix of a Markov chain. A natural problem is the existence of the uniform measure for every concurrent system. We prove this existence under an irreducibility condition. The uniform measure of a concurrent system is characterised by a real number, the characteristic root of the action, and a function of pairs of states, the Parry cocyle. A new combinatorial inversion formula allows to identify a polynomial of which the characteristic root is the smallest positive root. Examples based on simple combinatorial tilings are studied.Comment: 35 pages, 6 figures, 33 reference

    "Cultural additivity" and how the values and norms of Confucianism, Buddhism, and Taoism co-exist, interact, and influence Vietnamese society: A Bayesian analysis of long-standing folktales, using R and Stan

    Full text link
    Every year, the Vietnamese people reportedly burned about 50,000 tons of joss papers, which took the form of not only bank notes, but iPhones, cars, clothes, even housekeepers, in hope of pleasing the dead. The practice was mistakenly attributed to traditional Buddhist teachings but originated in fact from China, which most Vietnamese were not aware of. In other aspects of life, there were many similar examples of Vietnamese so ready and comfortable with adding new norms, values, and beliefs, even contradictory ones, to their culture. This phenomenon, dubbed "cultural additivity", prompted us to study the co-existence, interaction, and influences among core values and norms of the Three Teachings--Confucianism, Buddhism, and Taoism--as shown through Vietnamese folktales. By applying Bayesian logistic regression, we evaluated the possibility of whether the key message of a story was dominated by a religion (dependent variables), as affected by the appearance of values and anti-values pertaining to the Three Teachings in the story (independent variables).Comment: 8 figures, 35 page

    Special features of RAD Sequencing data:implications for genotyping

    Get PDF
    Restriction site-associated DNA Sequencing (RAD-Seq) is an economical and efficient method for SNP discovery and genotyping. As with other sequencing-by-synthesis methods, RAD-Seq produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately. We show that there are several sources of bias specific to RAD-Seq that are not explicitly addressed by current genotyping tools, namely restriction fragment bias, restriction site heterozygosity and PCR GC content bias. We explore the performance of existing analysis tools given these biases and discuss approaches to limiting or handling biases in RAD-Seq data. While these biases need to be taken seriously, we believe RAD loci affected by them can be excluded or processed with relative ease in most cases and that most RAD loci will be accurately genotyped by existing tools
    corecore