5 research outputs found
Complex systems approach to natural language
The review summarizes the main methodological concepts used in studying
natural language from the perspective of complexity science and documents their
applicability in identifying both universal and system-specific features of
language in its written representation. Three main complexity-related research
trends in quantitative linguistics are covered. The first part addresses the
issue of word frequencies in texts and demonstrates that taking punctuation
into consideration restores scaling whose violation in the Zipf's law is often
observed for the most frequent words. The second part introduces methods
inspired by time series analysis, used in studying various kinds of
correlations in written texts. The related time series are generated on the
basis of text partition into sentences or into phrases between consecutive
punctuation marks. It turns out that these series develop features often found
in signals generated by complex systems, like long-range correlations or
(multi)fractal structures. Moreover, it appears that the distances between
punctuation marks comply with the discrete variant of the Weibull distribution.
In the third part, the application of the network formalism to natural language
is reviewed, particularly in the context of the so-called word-adjacency
networks. Parameters characterizing topology of such networks can be used for
classification of texts, for example, from a stylometric perspective. Network
approach can also be applied to represent the organization of word
associations. Structure of word-association networks turns out to be
significantly different from that observed in random networks, revealing
genuine properties of language. Finally, punctuation seems to have a
significant impact not only on the language's information-carrying ability but
also on its key statistical properties, hence it is recommended to consider
punctuation marks on a par with words.Comment: 113 pages, 49 figure