27 research outputs found

    Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation

    Full text link
    The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the 'history' of word-usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of ten famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this `nestedness' is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent, and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level we are able to show that in case of weak nesting, Zipf's law breaks down in a fast transition. Unlike previous attempts to understand Zipf's law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential, or self-organised critical mechanisms behind language formation, but simply used the empirically quantifiable parameter 'nestedness' to understand the statistics of word frequencies.Comment: 7 pages, 4 figures. Accepted for publication in the Journal of the Royal Society Interfac

    Heaps' law, statistics of shared components and temporal patterns from a sample-space-reducing process

    Get PDF
    Zipf's law is a hallmark of several complex systems with a modular structure, such as books composed by words or genomes composed by genes. In these component systems, Zipf's law describes the empirical power law distribution of component frequencies. Stochastic processes based on a sample-space-reducing (SSR) mechanism, in which the number of accessible states reduces as the system evolves, have been recently proposed as a simple explanation for the ubiquitous emergence of this law. However, many complex component systems are characterized by other statistical patterns beyond Zipf's law, such as a sublinear growth of the component vocabulary with the system size, known as Heap's law, and a specific statistics of shared components. This work shows, with analytical calculations and simulations, that these statistical properties can emerge jointly from a SSR mechanism, thus making it an appropriate parameter-poor representation for component systems. Several alternative (and equally simple) models, for example based on the preferential attachment mechanism, can also reproduce Heaps' and Zipf's laws, suggesting that additional statistical properties should be taken into account to select the most-likely generative process for a specific system. Along this line, we will show that the temporal component distribution predicted by the SSR model is markedly different from the one emerging from the popular rich-gets-richer mechanism. A comparison with empirical data from natural language indicates that the SSR process can be chosen as a better candidate model for text generation based on this statistical property. Finally, a limitation of the SSR model in reproducing the empirical "burstiness" of word appearances in texts will be pointed out, thus indicating a possible direction for extensions of the basic SSR process.Comment: 14 pages, 4 figure

    Annotation persistence over dynamic documents

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2005.Includes bibliographical references (p. 212-216).Annotations, as a routine practice of actively engaging with reading materials, are heavily used in the paper world to augment the usefulness of documents. By annotation, we include a large variety of creative manipulations by which the otherwise passive reader becomes actively involved in a document. Annotations in digital form possess many benefits paper annotations do not enjoy, such as annotation searching, annotation multi- referencing, and annotation sharing. The digital form also introduces challenges to the process of annotation. This study looks at one of them, annotation persistence over dynamic documents. With the development of annotation software, users now have the opportunity to annotate documents which they don't own, or to which they don't have write permission. In annotation software, annotations are normally created and saved independently of the document. The owners of the documents being annotated may have no knowledge of the fact that third parties are annotating their documents' contents. When document contents are modified, annotation software faces a difficult situation where annotations need to be reattached. Reattaching annotations in a revised version of a document is a crucial component in annotation system design. Annotation persistence over document versions is a complicated and challenging problem, as documents can go through various changes between versions. In this thesis, we treat annotation persistence over dynamic documents as a specialized information retrieval problem. We then design a scheme to reposition annotations between versions by three mechanisms: the meta-structure information match, the keywords match, and content semantics match.(cont.) Content semantics matching is the determining factor in our annotation persistence scheme design. Latent Semantic Analysis, an innovative information retrieval model, is used to extract and compare document semantics. Two editions of an introductory computer science textbook are used to evaluate the annotation persistence scheme proposed in this study. The evaluation provides substantial evidence that the annotation persistence scheme proposed in this thesis is able to make the right decisions on repositioning annotations based on their degree of modifications, i.e. to reattach annotations if modifications are light, and to orphan annotations if modifications are heavy.by Shaomin Wang.Ph.D

    Speech and neural network dynamics

    Get PDF

    Intelligent Systems

    Get PDF
    This book is dedicated to intelligent systems of broad-spectrum application, such as personal and social biosafety or use of intelligent sensory micro-nanosystems such as "e-nose", "e-tongue" and "e-eye". In addition to that, effective acquiring information, knowledge management and improved knowledge transfer in any media, as well as modeling its information content using meta-and hyper heuristics and semantic reasoning all benefit from the systems covered in this book. Intelligent systems can also be applied in education and generating the intelligent distributed eLearning architecture, as well as in a large number of technical fields, such as industrial design, manufacturing and utilization, e.g., in precision agriculture, cartography, electric power distribution systems, intelligent building management systems, drilling operations etc. Furthermore, decision making using fuzzy logic models, computational recognition of comprehension uncertainty and the joint synthesis of goals and means of intelligent behavior biosystems, as well as diagnostic and human support in the healthcare environment have also been made easier

    The evolution of language: Proceedings of the Joint Conference on Language Evolution (JCoLE)

    Get PDF

    Learning Functional Prepositions

    Full text link
    In first language acquisition, what does it mean for a grammatical category to have been acquired, and what are the mechanisms by which children learn functional categories in general? In the context of prepositions (Ps), if the lexical/functional divide cuts through the P category, as has been suggested in the theoretical literature, then constructivist accounts of language acquisition would predict that children develop adult-like competence with the more abstract units, functional Ps, at a slower rate compared to their acquisition of lexical Ps. Nativists instead assume that the features of functional P are made available by Universal Grammar (UG), and are mapped as quickly, if not faster, than the semantic features of their lexical counterparts. Conversely, if Ps are either all lexical or all functional, on both accounts of acquisition we should observe few differences in learning. Three empirical studies of the development of P were conducted via computer analysis of the English and Spanish sub-corpora of the CHILDES database. Study 1 analyzed errors in child usage of Ps, finding almost no errors in commission in either language, but that the English learners lag in their production of functional Ps relative to lexical Ps. That no such delay was found in the Spanish data suggests that the English pattern is not universal. Studies 2 and 3 applied novel measures of phrasal (P head + nominal complement) productivity to the data. Study 2 examined prepositional phrases (PPs) whose head-complement pairs appeared in both child and adult speech, while Study 3 considered PPs produced by children that never occurred in adult speech. In both studies the productivity of Ps for English children developed faster than that of lexical Ps. In Spanish there were few differences, suggesting that children had already mastered both orders of Ps early in acquisition. These empirical results suggest that at least in English P is indeed a split category, and that children acquire the syntax of the functional subset very quickly, committing almost no errors. The UG position is thus supported. Next, the dissertation investigates a \u27soft nativist\u27 acquisition strategy that composes the distributional analysis of input, minimal a priori knowledge of the possible co-occurrence of morphosyntactic features associated with functional elements, and linguistic knowledge that is presumably acquired via the experience of pragmatic, communicative situations. The output of the analysis consists in a mapping of morphemes to the feature bundles of nominative pronouns for English and Spanish, plus specific claims about the sort of knowledge required from experience. The acquisition model is then extended to adpositions, to examine what, if anything, distributional analysis can tell us about the functional sequences of PPs. The results confirm the theoretical position according to which spatiotemporal Ps are lexical in character, rooting their own extended projections, and that functional Ps express an aspectual sequence in the functional superstructure of the PP

    The major transitions in the evolution of language

    Get PDF
    corecore