4 research outputs found
Identifying Content and Function Words in Non-Annotated Corpora
In every corpus of natural language texts there are some tendencies which occur due to common properties of language, as for example, the principle of least effort. One of those phenomema is a typical distribution of frequency classes: a relatively small number of word types covers the bulk of text, while on the other hand a huge part of the vocabulary occurs only one time. The latter types are called singletons or hapax legomena