7 research outputs found

    Lognormals, power laws and double power laws in the distribution of frequencies of harmonic codewords from classical music

    Get PDF
    Zipf's law is a paradigm describing the importance of different elements in communication systems, especially in linguistics. Despite the complexity of the hierarchical structure of language, music has in some sense an even more complex structure, due to its multidimensional character (melody, harmony, rhythm, timbre, etc.). Thus, the relevance of Zipf's law in music is still an open question. Using discrete codewords representing harmonic content obtained from a large-scale analysis of classical composers, we show that a nearly universal Zipf-like law holds at a qualitative level. However, in an in-depth quantitative analysis, where we introduce the double power-law distribution as a new player in the classical debate between the superiority of Zipf's (power) law and that of the lognormal distribution, we conclude not only that universality does not hold, but also that there is not a unique probability distribution that best describes the usage of the different codewords by each composer

    A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

    No full text
    The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval

    Some topics in high-dimensional robust inference and graphical modeling

    Get PDF
    2021 Summer.Includes bibliographical references.In this dissertation, we focus on large-scale robust inference and high-dimensional graphical modeling. Especially, we study three problems: a large-scale inference method by a tail-robust regression, model specification tests for dependence structure of Gaussian Markov random fields, and a robust Gaussian graph estimation. First of all, we consider the problem of simultaneously testing a large number of general linear hypotheses, encompassing covariate-effect analysis, analysis of variance, and model comparisons. The new challenge that comes along with the overwhelmingly large number of tests is the ubiquitous presence of heavy-tailed and/or highly skewed measurement noise, which is the main reason for the failure of conventional least squares based methods. The new testing procedure is built on data-adaptive Huber regression, and a new covariance estimator of the regression estimate. Under mild conditions, we show that the proposed methods produce consistent estimates of the false discovery proportion. Extensive numerical experiments, along with an empirical study on quantitative linguistics, demonstrate the advantage of our proposal compared to many state-of-the-art methods when the data are generated from heavy-tailed and/or skewed distributions. In the next chapter, we focus on the Gaussian Markov random fields (GMRFs) and, by utilizing the connection between GMRFs and precision matrices, we propose an easily implemented procedure to assess the spatial structures modeled by GMRFs based on spatio-temporal observations. The new procedure is flexible to assess a variety of structures including the isotropic and directional dependence as well as the Matern class. A comprehensive simulation study has been conducted to demonstrate the finite sample performance of the procedure. Motivated from the efforts on modeling flu spread across the United States, we also apply our method to the Google Flu Trend data and report some very interesting epidemiological findings. Finally, we propose a high-dimensional precision matrix estimation method via nodewise distributionally robust regressions. The distributionally robust regression with an ambiguity set defined by Wasserstein-2 ball has a computationally tractable dual formulation, which is linked to square-root regressions. We propose an iterative algorithm that has a substantial advantage in terms of computation time. Extensive numerical experiments study the performance of the proposed method under various precision matrix structures and contamination models

    Zipf extensions and their applications for modeling the degree sequences of real networks

    Get PDF
    The Zipf distribution, also known as discrete Pareto distribution, attracts considerable attention because it helps describe skewed data from many natural as well as man-made systems. Under the Zipf distribution, the frequency of a given value is a power function of its size. Consequently, when plotting the frequencies versus the size in log-log scale for data following this distribution, one obtains a straight line. Nevertheless, for many data sets the linearity is only observed in the tail and when this happens, the Zipf is only adjusted for values larger than a given threshold. This procedure implies a loss of information, and unless one is only interested in the tail of the distribution, the need to have access to more flexible alternatives distributions is evidenced. The work conducted in this thesis revolves around four bi-parametric extensions of the Zipf distribution. The first two belong to the class of Random Stopped Extreme distributions. The third extension is the result of applying the concept of Poisson-Stopped-Sum to the Zipf distribution and, the last one, is obtained by including an additional parameter to the probability generating function of the Zipf. An interesting characteristic of three of the models presented is that they allow for a parameter interpretation that gives some insights about the mechanism that generates the data. In order to analyze the performance of these models, we have fitted the degree sequences of real networks from different areas as: social networks, protein interaction networks or collaboration networks. The fits obtained have been compared with those obtained with other bi-parametric models such as: the Zipf-Mandelbrot, the discrete Weibull or the negative binomial. To facilitate the use of the models presented, they have been implemented in the zipfextR package available in the Comprehensive R Archive Network.La distribuci贸n Zipf, tambi茅n conocida como distribuci贸n discreta de Pareto, atrae una atenci贸n considerable debido a su versatilidad para describir datos sesgados provenientes de diferentes entornos tanto naturales como artificiales. Bajo la distribuci贸n Zipf, la probabilidad de un valor dado es proporcional a una potencia negativa del mismo. En consecuencia, al dibujar en escala doble logar铆tmica las frecuencias, de datos provenientes de esta distribuci贸n, en funci贸n de su tama帽o, se obtiene una l铆nea recta. Sin embargo, en muchos conjuntos de datos, esta linealidad solo se observa en la cola, y cuando esto sucede, la distribuci贸n Zipf solo se ajusta para valores mayores que un umbral dado. Este procedimiento implica una p茅rdida de informaci贸n, y a menos que a uno solo le interese la cola de la distribuci贸n, se pone de manifiesto la necesidad de disponer de distribuciones alternativas con una mayor flexibilidad. El trabajo realizado en esta tesis gira en torno a cuatro extensiones bi-param茅tricas de la distribuci贸n Zipf. Las dos primeras pertenecen a la familia de distribuciones Random Stopped Extreme. La tercera extensi贸n es el resultado de aplicar el concepto Poisson-Stopped-Sum a la distribuci贸n Zipf y, la 煤ltima familia de distribuciones se obtiene al incluir un par谩metro adicional a la funci贸n generadora de probabilidad de la Zipf. Una caracter铆stica de tres de los modelos presentados es que proporcionan una interpretaci贸n directa de sus par谩metros, lo que permite extraer algunas ideas sobre el mecanismo subyacente que ha generado los datos. Con el objetivo de analizar la aplicabilidad de estos modelos, hemos ajustado secuencias de grados de redes reales de diferentes 谩reas tales como: redes sociales, redes de interacci贸n de prote铆nas y redes de colaboraci贸n. Los ajustes obtenidos se han comparado con los obtenidos con otros modelos bi-param茅tricos como: el Zipf-Mandelbrot, la distribuci贸n discreta de Weibull o la binomial negativa. Para facilitar el uso de los modelos presentados, estos se han implementado en el paquete de R zipfextR, disponible en el Comprehensive R Archive Network.Estad铆stica i Investigaci贸 Operativ
    corecore