Search CORE

841 research outputs found

Stochastic model for the vocabulary growth in natural languages

Author: Altmann Eduardo G.
Gerlach Martin
Publication venue: 'American Physical Society (APS)'
Publication date: 04/04/2013
Field of study

We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

arXiv.org e-Print Archive

Directory of Open Access Journals

MPG.PuRe

Ranked Enumeration of MSO Logic on Words

Author: Bourhis Pierre
Grez Alejandro
Jachiet Louis
Riveros Cristian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

Long-run patterns in the discovery of the adjacent possible

Author: Taalbi Josef
Publication venue
Publication date: 18/12/2023
Field of study

The notion of the "adjacent possible" has been advanced to theorize the generation of novelty across many different research domains. This study is an attempt to examine in what way the notion can be made empirically useful for innovation studies. A theoretical framework is construed based on the notion of innovation a search process of recombining knowledge to discover the "adjacent possible". The framework makes testable predictions about the rate of innovation, the distribution of innovations across organizations, and the rate of diversification or product portfolios. The empirical section examines how well this framework predicts long-run patterns of new product introductions in Sweden, 1908-2016 and examines the long-run evolution of the product space of Swedish organizations. The results suggest that, remarkably, the rate of innovation depends linearly on cumulative innovations, which explains advantages of incumbent firms, but excludes the emergence of "winner takes all" distributions. The results also suggest that the rate of development of new types of products follows "Heaps' law", where the share of new product types within organizations declines over time. The topology of the Swedish product space carries information about future product diversifications, suggesting that the adjacent possible is not altogether `"unprestatable"

arXiv.org e-Print Archive

Empirical Bayes estimators of structural distribution of words in Lithuanian texts

Author: Piaseckienė Karolina
Radavičius Marijus
Publication venue: 'Vilnius University Press'
Publication date: 30/10/2014
Field of study

Lithuanian language has great inflexion, free word order and other features which distinguish it from other languages. This raises a problem of testing for the Lithuanian language validity of findings established for other languages. In the paper, an empirical study of a collection of Lithuanian texts is performed. It is supposed that authors of texts are basic elements of the population under study and its heterogeneity stems out of the heterogeneity of preferences and choices of the authors. An attempt to estimate structural distributions of words in a collection of texts of different authors is made by making use of a simple statistical model and empirical Bayes approach

Nonlinear Analysis: Modelling and Control

Markovian dynamics of concurrent systems

Author: Abbes Samy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Monoid actions of trace monoids over finite sets are powerful models of concurrent systems---for instance they encompass the class of 1-safe Petri nets. We characterise Markov measures attached to concurrent systems by finitely many parameters with suitable normalisation conditions. These conditions involve polynomials related to the combinatorics of the monoid and of the monoid action. These parameters generalise to concurrent systems the coefficients of the transition matrix of a Markov chain. A natural problem is the existence of the uniform measure for every concurrent system. We prove this existence under an irreducibility condition. The uniform measure of a concurrent system is characterised by a real number, the characteristic root of the action, and a function of pairs of states, the Parry cocyle. A new combinatorial inversion formula allows to identify a polynomial of which the characteristic root is the smallest positive root. Examples based on simple combinatorial tilings are studied.Comment: 35 pages, 6 figures, 33 reference

arXiv.org e-Print Archive

Hal-Diderot

"Cultural additivity" and how the values and norms of Confucianism, Buddhism, and Taoism co-exist, interact, and influence Vietnamese society: A Bayesian analysis of long-standing folktales, using R and Stan

Author: Cuong Nghiem Phu Kien
Ho Manh-Toan
Ho Manh-Tung
Khiem Bui Quang
La Viet-Phuong
Napier Nancy K.
Nguyen Hong-Kong T.
Nguyen Viet-Ha
Pham Hiep-Hung
Van Nhue Dam
Vuong Quan-Hoang
Vuong Thu-Trang
Publication venue
Publication date: 05/03/2018
Field of study

Every year, the Vietnamese people reportedly burned about 50,000 tons of joss papers, which took the form of not only bank notes, but iPhones, cars, clothes, even housekeepers, in hope of pleasing the dead. The practice was mistakenly attributed to traditional Buddhist teachings but originated in fact from China, which most Vietnamese were not aware of. In other aspects of life, there were many similar examples of Vietnamese so ready and comfortable with adding new norms, values, and beliefs, even contradictory ones, to their culture. This phenomenon, dubbed "cultural additivity", prompted us to study the co-existence, interaction, and influences among core values and norms of the Three Teachings--Confucianism, Buddhism, and Taoism--as shown through Vietnamese folktales. By applying Bayesian logistic regression, we evaluated the possibility of whether the key message of a story was dominated by a religion (dependent variables), as affected by the appearance of values and anti-values pertaining to the Three Teachings in the story (independent variables).Comment: 8 figures, 35 page

arXiv.org e-Print Archive

DI-fusion

Recommended from our members

Proceedings of the Workshop on Algorithmic Aspects of Advanced Programming Languages: WAAAPL'99: Paris, France, September 30, 1999

Author
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1999
Field of study

The first Workshop on Algorithmic Aspects of Advanced Programming Languages was held on September 30, 1999, in Paris, France, in conjunction with the PLI'99 conferences and workshops. The choice of programming languages has a huge effect on the algorithms and data structures that are to be implemented in that language. Traditionally, algorithms and data structures have been studied in the context of imperative languages. This workshop considers the algorithmic implications of choosing an advanced functional or logic programming language instead. A total of eight papers were selected for presentation at the workshop, together with an invited lecture by Robert Harper. We would like to thank Dider Remv, general chair of PLI'99, for his assistance in organizing this workshop

Columbia University Academic Commons

Special features of RAD Sequencing data:implications for genotyping

Author: Blaxter Mark L
Cezard Timothée
Davey John W
Eland Cathlene
Fuentes-Utrilla Pablo
Gharbi Karim
Publication venue: 'Wiley'
Publication date: 01/01/2012
Field of study

Restriction site-associated DNA Sequencing (RAD-Seq) is an economical and efficient method for SNP discovery and genotyping. As with other sequencing-by-synthesis methods, RAD-Seq produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately. We show that there are several sources of bias specific to RAD-Seq that are not explicitly addressed by current genotyping tools, namely restriction fragment bias, restriction site heterozygosity and PCR GC content bias. We explore the performance of existing analysis tools given these biases and discuss approaches to limiting or handling biases in RAD-Seq data. While these biases need to be taken seriously, we believe RAD loci affected by them can be excluded or processed with relative ease in most cases and that most RAD loci will be accurately genotyped by existing tools

CiteSeerX

Crossref

PubMed Central

Edinburgh Research Explorer