841 research outputs found
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
Ranked Enumeration of MSO Logic on Words
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words
Long-run patterns in the discovery of the adjacent possible
The notion of the "adjacent possible" has been advanced to theorize the
generation of novelty across many different research domains. This study is an
attempt to examine in what way the notion can be made empirically useful for
innovation studies. A theoretical framework is construed based on the notion of
innovation a search process of recombining knowledge to discover the "adjacent
possible". The framework makes testable predictions about the rate of
innovation, the distribution of innovations across organizations, and the rate
of diversification or product portfolios. The empirical section examines how
well this framework predicts long-run patterns of new product introductions in
Sweden, 1908-2016 and examines the long-run evolution of the product space of
Swedish organizations. The results suggest that, remarkably, the rate of
innovation depends linearly on cumulative innovations, which explains
advantages of incumbent firms, but excludes the emergence of "winner takes all"
distributions. The results also suggest that the rate of development of new
types of products follows "Heaps' law", where the share of new product types
within organizations declines over time. The topology of the Swedish product
space carries information about future product diversifications, suggesting
that the adjacent possible is not altogether `"unprestatable"
Empirical Bayes estimators of structural distribution of words in Lithuanian texts
Lithuanian language has great inflexion, free word order and other features which distinguish it from other languages. This raises a problem of testing for the Lithuanian language validity of findings established for other languages. In the paper, an empirical study of a collection of Lithuanian texts is performed. It is supposed that authors of texts are basic elements of the population under study and its heterogeneity stems out of the heterogeneity of preferences and choices of the authors. An attempt to estimate structural distributions of words in a collection of texts of different authors is made by making use of a simple statistical model and empirical Bayes approach
Markovian dynamics of concurrent systems
Monoid actions of trace monoids over finite sets are powerful models of
concurrent systems---for instance they encompass the class of 1-safe Petri
nets. We characterise Markov measures attached to concurrent systems by
finitely many parameters with suitable normalisation conditions. These
conditions involve polynomials related to the combinatorics of the monoid and
of the monoid action. These parameters generalise to concurrent systems the
coefficients of the transition matrix of a Markov chain.
A natural problem is the existence of the uniform measure for every
concurrent system. We prove this existence under an irreducibility condition.
The uniform measure of a concurrent system is characterised by a real number,
the characteristic root of the action, and a function of pairs of states, the
Parry cocyle. A new combinatorial inversion formula allows to identify a
polynomial of which the characteristic root is the smallest positive root.
Examples based on simple combinatorial tilings are studied.Comment: 35 pages, 6 figures, 33 reference
"Cultural additivity" and how the values and norms of Confucianism, Buddhism, and Taoism co-exist, interact, and influence Vietnamese society: A Bayesian analysis of long-standing folktales, using R and Stan
Every year, the Vietnamese people reportedly burned about 50,000 tons of joss
papers, which took the form of not only bank notes, but iPhones, cars, clothes,
even housekeepers, in hope of pleasing the dead. The practice was mistakenly
attributed to traditional Buddhist teachings but originated in fact from China,
which most Vietnamese were not aware of. In other aspects of life, there were
many similar examples of Vietnamese so ready and comfortable with adding new
norms, values, and beliefs, even contradictory ones, to their culture. This
phenomenon, dubbed "cultural additivity", prompted us to study the
co-existence, interaction, and influences among core values and norms of the
Three Teachings--Confucianism, Buddhism, and Taoism--as shown through
Vietnamese folktales. By applying Bayesian logistic regression, we evaluated
the possibility of whether the key message of a story was dominated by a
religion (dependent variables), as affected by the appearance of values and
anti-values pertaining to the Three Teachings in the story (independent
variables).Comment: 8 figures, 35 page
Recommended from our members
Proceedings of the Workshop on Algorithmic Aspects of Advanced Programming Languages: WAAAPL'99: Paris, France, September 30, 1999
The first Workshop on Algorithmic Aspects of Advanced Programming Languages was held on September 30, 1999, in Paris, France, in conjunction with the PLI'99 conferences and workshops. The choice of programming languages has a huge effect on the algorithms and data structures that are to be implemented in that language. Traditionally, algorithms and data structures have been studied in the context of imperative languages. This workshop considers the algorithmic implications of choosing an advanced functional or logic programming language instead. A total of eight papers were selected for presentation at the workshop, together with an invited lecture by Robert Harper. We would like to thank Dider Remv, general chair of PLI'99, for his assistance in organizing this workshop
Special features of RAD Sequencing data:implications for genotyping
Restriction site-associated DNA Sequencing (RAD-Seq) is an economical and efficient method for SNP discovery and genotyping. As with other sequencing-by-synthesis methods, RAD-Seq produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately. We show that there are several sources of bias specific to RAD-Seq that are not explicitly addressed by current genotyping tools, namely restriction fragment bias, restriction site heterozygosity and PCR GC content bias. We explore the performance of existing analysis tools given these biases and discuss approaches to limiting or handling biases in RAD-Seq data. While these biases need to be taken seriously, we believe RAD loci affected by them can be excluded or processed with relative ease in most cases and that most RAD loci will be accurately genotyped by existing tools
- …