27 research outputs found
Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation
The formation of sentences is a highly structured and history-dependent
process. The probability of using a specific word in a sentence strongly
depends on the 'history' of word-usage earlier in that sentence. We study a
simple history-dependent model of text generation assuming that the
sample-space of word usage reduces along sentence formation, on average. We
first show that the model explains the approximate Zipf law found in word
frequencies as a direct consequence of sample-space reduction. We then
empirically quantify the amount of sample-space reduction in the sentences of
ten famous English books, by analysis of corresponding word-transition tables
that capture which words can follow any given word in a text. We find a highly
nested structure in these transition tables and show that this `nestedness' is
tightly related to the power law exponents of the observed word frequency
distributions. With the proposed model it is possible to understand that the
nestedness of a text can be the origin of the actual scaling exponent, and that
deviations from the exact Zipf law can be understood by variations of the
degree of nestedness on a book-by-book basis. On a theoretical level we are
able to show that in case of weak nesting, Zipf's law breaks down in a fast
transition. Unlike previous attempts to understand Zipf's law in language the
sample-space reducing model is not based on assumptions of multiplicative,
preferential, or self-organised critical mechanisms behind language formation,
but simply used the empirically quantifiable parameter 'nestedness' to
understand the statistics of word frequencies.Comment: 7 pages, 4 figures. Accepted for publication in the Journal of the
Royal Society Interfac
Heaps' law, statistics of shared components and temporal patterns from a sample-space-reducing process
Zipf's law is a hallmark of several complex systems with a modular structure,
such as books composed by words or genomes composed by genes. In these
component systems, Zipf's law describes the empirical power law distribution of
component frequencies. Stochastic processes based on a sample-space-reducing
(SSR) mechanism, in which the number of accessible states reduces as the system
evolves, have been recently proposed as a simple explanation for the ubiquitous
emergence of this law. However, many complex component systems are
characterized by other statistical patterns beyond Zipf's law, such as a
sublinear growth of the component vocabulary with the system size, known as
Heap's law, and a specific statistics of shared components. This work shows,
with analytical calculations and simulations, that these statistical properties
can emerge jointly from a SSR mechanism, thus making it an appropriate
parameter-poor representation for component systems. Several alternative (and
equally simple) models, for example based on the preferential attachment
mechanism, can also reproduce Heaps' and Zipf's laws, suggesting that
additional statistical properties should be taken into account to select the
most-likely generative process for a specific system. Along this line, we will
show that the temporal component distribution predicted by the SSR model is
markedly different from the one emerging from the popular rich-gets-richer
mechanism. A comparison with empirical data from natural language indicates
that the SSR process can be chosen as a better candidate model for text
generation based on this statistical property. Finally, a limitation of the SSR
model in reproducing the empirical "burstiness" of word appearances in texts
will be pointed out, thus indicating a possible direction for extensions of the
basic SSR process.Comment: 14 pages, 4 figure
Annotation persistence over dynamic documents
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2005.Includes bibliographical references (p. 212-216).Annotations, as a routine practice of actively engaging with reading materials, are heavily used in the paper world to augment the usefulness of documents. By annotation, we include a large variety of creative manipulations by which the otherwise passive reader becomes actively involved in a document. Annotations in digital form possess many benefits paper annotations do not enjoy, such as annotation searching, annotation multi- referencing, and annotation sharing. The digital form also introduces challenges to the process of annotation. This study looks at one of them, annotation persistence over dynamic documents. With the development of annotation software, users now have the opportunity to annotate documents which they don't own, or to which they don't have write permission. In annotation software, annotations are normally created and saved independently of the document. The owners of the documents being annotated may have no knowledge of the fact that third parties are annotating their documents' contents. When document contents are modified, annotation software faces a difficult situation where annotations need to be reattached. Reattaching annotations in a revised version of a document is a crucial component in annotation system design. Annotation persistence over document versions is a complicated and challenging problem, as documents can go through various changes between versions. In this thesis, we treat annotation persistence over dynamic documents as a specialized information retrieval problem. We then design a scheme to reposition annotations between versions by three mechanisms: the meta-structure information match, the keywords match, and content semantics match.(cont.) Content semantics matching is the determining factor in our annotation persistence scheme design. Latent Semantic Analysis, an innovative information retrieval model, is used to extract and compare document semantics. Two editions of an introductory computer science textbook are used to evaluate the annotation persistence scheme proposed in this study. The evaluation provides substantial evidence that the annotation persistence scheme proposed in this thesis is able to make the right decisions on repositioning annotations based on their degree of modifications, i.e. to reattach annotations if modifications are light, and to orphan annotations if modifications are heavy.by Shaomin Wang.Ph.D
Intelligent Systems
This book is dedicated to intelligent systems of broad-spectrum application, such as personal and social biosafety or use of intelligent sensory micro-nanosystems such as "e-nose", "e-tongue" and "e-eye". In addition to that, effective acquiring information, knowledge management and improved knowledge transfer in any media, as well as modeling its information content using meta-and hyper heuristics and semantic reasoning all benefit from the systems covered in this book. Intelligent systems can also be applied in education and generating the intelligent distributed eLearning architecture, as well as in a large number of technical fields, such as industrial design, manufacturing and utilization, e.g., in precision agriculture, cartography, electric power distribution systems, intelligent building management systems, drilling operations etc. Furthermore, decision making using fuzzy logic models, computational recognition of comprehension uncertainty and the joint synthesis of goals and means of intelligent behavior biosystems, as well as diagnostic and human support in the healthcare environment have also been made easier
Learning Functional Prepositions
In first language acquisition, what does it mean for a grammatical category to have been acquired, and what are the mechanisms by which children learn functional categories in general? In the context of prepositions (Ps), if the lexical/functional divide cuts through the P category, as has been suggested in the theoretical literature, then constructivist accounts of language acquisition would predict that children develop adult-like competence with the more abstract units, functional Ps, at a slower rate compared to their acquisition of lexical Ps. Nativists instead assume that the features of functional P are made available by Universal Grammar (UG), and are mapped as quickly, if not faster, than the semantic features of their lexical counterparts. Conversely, if Ps are either all lexical or all functional, on both accounts of acquisition we should observe few differences in learning.
Three empirical studies of the development of P were conducted via computer analysis of the English and Spanish sub-corpora of the CHILDES database. Study 1 analyzed errors in child usage of Ps, finding almost no errors in commission in either language, but that the English learners lag in their production of functional Ps relative to lexical Ps. That no such delay was found in the Spanish data suggests that the English pattern is not universal. Studies 2 and 3 applied novel measures of phrasal (P head + nominal complement) productivity to the data. Study 2 examined prepositional phrases (PPs) whose head-complement pairs appeared in both child and adult speech, while Study 3 considered PPs produced by children that never occurred in adult speech. In both studies the productivity of Ps for English children developed faster than that of lexical Ps. In Spanish there were few differences, suggesting that children had already mastered both orders of Ps early in acquisition. These empirical results suggest that at least in English P is indeed a split category, and that children acquire the syntax of the functional subset very quickly, committing almost no errors. The UG position is thus supported.
Next, the dissertation investigates a \u27soft nativist\u27 acquisition strategy that composes the distributional analysis of input, minimal a priori knowledge of the possible co-occurrence of morphosyntactic features associated with functional elements, and linguistic knowledge that is presumably acquired via the experience of pragmatic, communicative situations. The output of the analysis consists in a mapping of morphemes to the feature bundles of nominative pronouns for English and Spanish, plus specific claims about the sort of knowledge required from experience.
The acquisition model is then extended to adpositions, to examine what, if anything, distributional analysis can tell us about the functional sequences of PPs. The results confirm the theoretical position according to which spatiotemporal Ps are lexical in character, rooting their own extended projections, and that functional Ps express an aspectual sequence in the functional superstructure of the PP