2 research outputs found
Speech tested for Zipfian fit using rigorous statistical techniques
Zipf’s law describes the relationship between the frequencies of words in a corpus and their rank. Its most basic form is a simple series, indicating that the frequency of a word is inverselyproportional to its rank:1/2, 1/3, 1/4,...The past two decades have seen the emergence of usage-based and cognitive approaches to language study. A key observation of these approaches, along with the importance of frequency, is that speech differs in substantial and structural ways from writing. Yet, except for a few older analyses performed on very small corpora, most studies of Zipf’s law have been done on written corpora. Further, a judgement of Zifianness in much of this work is based on loose and informal criteria.  In fact, sophisticated statistical techniques have been developed for curve fitting in recent years in the mathematics and physics literature. These include the use of the Kolmogorov-Smirnov statistic, along with maximum likelihood estimation to generate p-values and the use of the complementary error function for normal distributions. The latter helps determine if a corpus, failing a Zipfian fit, might be better described by another distribution. In this paper, we will:Show that three corpora of recorded speech follow a power law distribution using rigorous statis- tical techniques: Buckeye, Santa Barbara, MiCaseDescribe preliminary results showing that the techniques outlined in this paper may be useful in the diagnoses of those conditions that can include disordered speech.Explain how to do the analyses described in this paper.Explain how to download and use the R/Python code we have written and packaged as the Zipf Tool Ki
Recommended from our members
A First Look at Zipf's Law and the Speech of Children with Autism Spectrum Disorder
Zipf’s law asserts that words form a power law distribution: word frequency is inversely proportional to rank. Relatively recent cognitive and usage-based linguistics argue that speech differs structurally from writing. Except for a few older analyses performed on tiny corpora, studies of Zipf’s law prior to 2021 have been done on written corpora and use informal methods to determine Zipfianness.
We argue that recent work indicating that transcribed speech forms a Zipfian distribution can be extended to the speech of traditionally developing children. Further, we show that the transcribed speech of children with a clinical diagnosis of autism spectrum disorder is non-Zipfian. These judgements are made using formal statistical techniques developed in Clauset (2009). They include the Kolmogorov-Smirnov statistic for goodness-of-fit and likelihood ratio to rule out other distributions.
Clauset, A., Shalizi, C. R., and Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review, 51(4), 661–703