30 research outputs found

    A joint text mining-rank size investigation of the rhetoric structures of the US Presidents’ speeches

    Get PDF
    © 2019 Elsevier Ltd This work presents a text mining context and its use for a deep analysis of the messages delivered by politicians. Specifically, we deal with an expert systems-based exploration of the rhetoric dynamics of a large collection of US Presidents’ speeches, ranging from Washington to Trump. In particular, speeches are viewed as complex expert systems whose structures can be effectively analyzed through rank-size laws. The methodological contribution of the paper is twofold. First, we develop a text mining-based procedure for the construction of the dataset by using a web scraping routine on the Miller Center website – the repository site collecting the speeches. Second, we explore the implicit structure of the discourse data by implementing a rank-size procedure over the individual speeches, being the words of each speech ranked in terms of their frequencies. The scientific significance of the proposed combination of text-mining and rank-size approaches can be found in its flexibility and generality, which let it be reproducible to a wide set of expert systems and text mining contexts. The usefulness of the proposed method and of the speeches analysis is demonstrated by the findings themselves. Indeed, in terms of impact, it is worth noting that interesting conclusions of social, political and linguistic nature on how 45 United States Presidents, from April 30, 1789 till February 28, 2017 delivered political messages can be carried out. Indeed, the proposed analysis shows some remarkable regularities, not only inside a given speech, but also among different speeches. Moreover, under a purely methodological perspective, the presented contribution suggests possible ways of generating a linguistic decision-making algorithm

    Zipf extensions and their applications for modeling the degree sequences of real networks

    Get PDF
    The Zipf distribution, also known as discrete Pareto distribution, attracts considerable attention because it helps describe skewed data from many natural as well as man-made systems. Under the Zipf distribution, the frequency of a given value is a power function of its size. Consequently, when plotting the frequencies versus the size in log-log scale for data following this distribution, one obtains a straight line. Nevertheless, for many data sets the linearity is only observed in the tail and when this happens, the Zipf is only adjusted for values larger than a given threshold. This procedure implies a loss of information, and unless one is only interested in the tail of the distribution, the need to have access to more flexible alternatives distributions is evidenced. The work conducted in this thesis revolves around four bi-parametric extensions of the Zipf distribution. The first two belong to the class of Random Stopped Extreme distributions. The third extension is the result of applying the concept of Poisson-Stopped-Sum to the Zipf distribution and, the last one, is obtained by including an additional parameter to the probability generating function of the Zipf. An interesting characteristic of three of the models presented is that they allow for a parameter interpretation that gives some insights about the mechanism that generates the data. In order to analyze the performance of these models, we have fitted the degree sequences of real networks from different areas as: social networks, protein interaction networks or collaboration networks. The fits obtained have been compared with those obtained with other bi-parametric models such as: the Zipf-Mandelbrot, the discrete Weibull or the negative binomial. To facilitate the use of the models presented, they have been implemented in the zipfextR package available in the Comprehensive R Archive Network.La distribuciĂłn Zipf, tambiĂ©n conocida como distribuciĂłn discreta de Pareto, atrae una atenciĂłn considerable debido a su versatilidad para describir datos sesgados provenientes de diferentes entornos tanto naturales como artificiales. Bajo la distribuciĂłn Zipf, la probabilidad de un valor dado es proporcional a una potencia negativa del mismo. En consecuencia, al dibujar en escala doble logarĂ­tmica las frecuencias, de datos provenientes de esta distribuciĂłn, en funciĂłn de su tamaño, se obtiene una lĂ­nea recta. Sin embargo, en muchos conjuntos de datos, esta linealidad solo se observa en la cola, y cuando esto sucede, la distribuciĂłn Zipf solo se ajusta para valores mayores que un umbral dado. Este procedimiento implica una pĂ©rdida de informaciĂłn, y a menos que a uno solo le interese la cola de la distribuciĂłn, se pone de manifiesto la necesidad de disponer de distribuciones alternativas con una mayor flexibilidad. El trabajo realizado en esta tesis gira en torno a cuatro extensiones bi-paramĂ©tricas de la distribuciĂłn Zipf. Las dos primeras pertenecen a la familia de distribuciones Random Stopped Extreme. La tercera extensiĂłn es el resultado de aplicar el concepto Poisson-Stopped-Sum a la distribuciĂłn Zipf y, la Ășltima familia de distribuciones se obtiene al incluir un parĂĄmetro adicional a la funciĂłn generadora de probabilidad de la Zipf. Una caracterĂ­stica de tres de los modelos presentados es que proporcionan una interpretaciĂłn directa de sus parĂĄmetros, lo que permite extraer algunas ideas sobre el mecanismo subyacente que ha generado los datos. Con el objetivo de analizar la aplicabilidad de estos modelos, hemos ajustado secuencias de grados de redes reales de diferentes ĂĄreas tales como: redes sociales, redes de interacciĂłn de proteĂ­nas y redes de colaboraciĂłn. Los ajustes obtenidos se han comparado con los obtenidos con otros modelos bi-paramĂ©tricos como: el Zipf-Mandelbrot, la distribuciĂłn discreta de Weibull o la binomial negativa. Para facilitar el uso de los modelos presentados, estos se han implementado en el paquete de R zipfextR, disponible en el Comprehensive R Archive Network.EstadĂ­stica i InvestigaciĂł Operativ

    A rank-size approach to the analysis of socio-economics data

    Get PDF
    Questa tesi \ue8 volta ad investigare due importanti fenomeni, uno naturale ed uno umano. Il primo riguarda i terremoti, mentre il secondo \ue8 legato al contenuto dei discorsi ufficiali dei presidenti americani. Per il primo caso, il nostro obiettivo \ue8 quello di definire un indicatore dei danni economici causati dai terremoti, proponendo un indice calibrato su una lunga serie di magnitudo rilevate in lunghi periodi di tempo. Mentre per il caso dei discorsi presidenziali, vogliamo quantificare il loro impatto sul mercato finanziario, in particolare studiamo l\u2019effetto che essi hanno sull\u2019indice \u201cStandard and Poor\u2019s 500\u201d. Il nostro obiettivo principale \ue8 quello di contribuire nell\u2019ambito delle scelte di politica economica prendendo in considerazione tali fenomeni ed analizzandoli con un approccio diverso ed innovativo. L\u2019analisi esposta in questa tesi \ue8 sviluppata per mezzo di strumenti econofisici strettamente collegati all\u2019ambito dell\u2019analisi \u201crank-size\u201d. Tale analisi consiste nell\u2019uso di una serie di funzioni particolarmente utili per l\u2019esplorazione delle propriet\ue0 di grandi dataset, anche quando essi sono distribuiti nel tempo e hanno bande di errore non perfettamente definite per via di particolari condizioni di campionamento. Nei capitoli che riguardano i terremoti cos\uec come in quelli dedicati all\u2019analisi dei discorsi dei presidenti americani sono mostrati e commentati i risultati di regressioni non lineari impiegate per stimare i coefficienti di varie leggi \u201crank-size\u201d. Tali stime sono state manipolate in modo tale da poter giungere a conclusioni dal rilievo economico. I risultati pi\uf9 robusti sono stati raggiunti grazie alla straordinaria capacit\ue0 di interpretare i dati da parte delle leggi \u201crank-size\u201d. Nell\u2019ambito della valutazione dell\u2019impatto economico dei discorsi presidenziali, un\u2019analisi aggiuntiva \ue8 stata svolta valutando diverse distanze tra serie storiche. In particolare considerando la serie storica delle parole semanticamente legate all\u2019economica e pronunciate dai presidenti americani nel corso della storia e le serie storiche del volume, dei prezzi e dei rendimenti dell\u2019indice \u201cStandard & Poor's 500\u201d. Per questa analisi abbiamo impiegato un approccio probabilistico ed anche uno meramente topologico. Infatti abbiamo misurato l\u2019entropia delle serie storiche e comparato le conclusioni valutando le differenze fra diverse misure di distanza vettoriale

    Networks, complexity and internet regulation: scale-free law

    Get PDF
    No description supplie

    The multi-line problem : an investigation of the relationships among the sales of the lines of multi-line products.

    Get PDF
    This thesis considers the analysis of sales patterns of multi-line products. That is, products which are sold in different sizes, different colours, or different patterns of what is basically the same commodity. Part 1 discusses the incidence of such products in present day capitalist markets. The reasons for the increasing numbers of these products are outlined, and the effects of their existence on manufacturing and distribution facilities are qualitatively described. Part 2 is concerned with the problem of developing a satisfactory quantitative description of the interrelationship, in sales units, between the lines of a product. Two empirical relationships are derived, which in view of their generality for the data so far analysed, I have had the temerity to call laws. Some applications of these are given and their relationship to other social science Part 3 describes my attempts to relate these empirical laws to more basic causes. Owing to the complexity of the factors involved, no great depth of analysis has been achieved. A model in terms of aggregated factors is given for one of the laws, and the other law is not negated by the model. I am grateful to the Colonial Sugar Refining Company for permission to use certain data, and for general assistance. My thanks also goes to other firms, most of whom prefer anonymity, for supplying data

    Learning and the structure of citation networks

    Get PDF
    The distribution of citations received by scientific publications can be approximated by a power law, a finding that has been explained by “cumulative advantage”. This paper argues that socially embedded learning is a plausible mechanism behind this cumulative advantage. A model assuming that scientists face a time trade-off between learning and writing papers, that they learn the papers known by their peers, and that they cite papers they know, generates a power law distribution of popularity, and a shifted power law for the distribution of citations received. The two distributions flatten if there is relatively more learning. The predicted exponent for the distribution of citations is independent of the average in-(or out-) degree, contrary to an untested prediction of the reference model (Price, 1976). Using publicly available citation networks, an estimate of the share of time devoted to learning (against producing) is given around two thirds
    corecore