150 research outputs found

    Equilibrium (Zipf) and Dynamic (Grasseberg-Procaccia) method based analyses of human texts. A comparison of natural (english) and artificial (esperanto) languages

    Full text link
    A comparison of two english texts from Lewis Carroll, one (Alice in wonderland), also translated into esperanto, the other (Through a looking glass) are discussed in order to observe whether natural and artificial languages significantly differ from each other. One dimensional time series like signals are constructed using only word frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations in the FTS and (ii) a Grassberger-Procaccia (GP) technique based method for finding correlations in LTS. Features are compared : different power laws are observed with characteristic exponents for the ranking properties, and the {\it phase space attractor dimensionality}. The Zipf exponent can take values much less than unity (ca.ca. 0.50 or 0.30) depending on how a sentence is defined. This non-universality is conjectured to be a measure of the author stylestyle. Moreover the attractor dimension rr is a simple function of the so called phase space dimension nn, i.e., r=nλr = n^{\lambda}, with λ=0.79\lambda = 0.79. Such an exponent should also conjecture to be a measure of the author creativitycreativity. However, even though there are quantitative differences between the original english text and its esperanto translation, the qualitative differences are very minutes, indicating in this case a translation relatively well respecting, along our analysis lines, the content of the author writing.Comment: 22 pages, 87 references, 5 tables, 8 figure

    Maximum likelihood estimation for constrained parameters of multinomial distributions - Application to Zipf-Mandelbrot models

    Get PDF
    A numerical maximum likelihood (ML) estimation procedure is developed for the constrained parameters of multinomial distributions. The main difficulty involved in computing the likelihood function is the precise and fast determination of the multinomial coefficients. For this the coefficients are rewritten into a telescopic product. The presented method is applied to the ML estimation of the Zipf–Mandelbrot (ZM) distribution, which provides a true model in many real-life cases. The examples discussed arise from ecological and medical observations. Based on the estimates, the hypothesis that the data is ZM distributed is tested using a chi-square test. The computer code of the presented procedure is available on request by the author

    Theoretical results on a weightless neural classifier and application to computational linguistics

    Get PDF
    WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado.WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado

    Fake news and propaganda: Trump’s democratic America and Hitler’s national socialist (Nazi) Germany

    Get PDF
    This paper features an analysis of President Trump’s two State of the Union addresses, which are analysed by means of various data mining techniques, including sentiment analysis. The intention is to explore the contents and sentiments of the messages contained, the degree to which they differ, and their potential implications for the national mood and state of the economy. We also apply Zipf and Mandelbrot’s power law to assess the degree to which they differ from common language patterns. To provide a contrast and some parallel context, analyses are also undertaken of President Obama’s last State of the Union address and Hitler’s 1933 Berlin Proclamation. The structure of these four political addresses is remarkably similar. The three US Presidential speeches are more positive emotionally than is Hitler’s relatively shorter address, which is characterised by a prevalence of negative emotions. Hitler’s speech deviates the most from common speech, but all three appear to target their audiences by use of non-complex speech. However, it should be said that the economic circumstances in contemporary America and Germany in the 1930s are vastly different

    A joint text mining-rank size investigation of the rhetoric structures of the US Presidents’ speeches

    Get PDF
    © 2019 Elsevier Ltd This work presents a text mining context and its use for a deep analysis of the messages delivered by politicians. Specifically, we deal with an expert systems-based exploration of the rhetoric dynamics of a large collection of US Presidents’ speeches, ranging from Washington to Trump. In particular, speeches are viewed as complex expert systems whose structures can be effectively analyzed through rank-size laws. The methodological contribution of the paper is twofold. First, we develop a text mining-based procedure for the construction of the dataset by using a web scraping routine on the Miller Center website – the repository site collecting the speeches. Second, we explore the implicit structure of the discourse data by implementing a rank-size procedure over the individual speeches, being the words of each speech ranked in terms of their frequencies. The scientific significance of the proposed combination of text-mining and rank-size approaches can be found in its flexibility and generality, which let it be reproducible to a wide set of expert systems and text mining contexts. The usefulness of the proposed method and of the speeches analysis is demonstrated by the findings themselves. Indeed, in terms of impact, it is worth noting that interesting conclusions of social, political and linguistic nature on how 45 United States Presidents, from April 30, 1789 till February 28, 2017 delivered political messages can be carried out. Indeed, the proposed analysis shows some remarkable regularities, not only inside a given speech, but also among different speeches. Moreover, under a purely methodological perspective, the presented contribution suggests possible ways of generating a linguistic decision-making algorithm

    Fake news and propaganda: Trump’s democratic America and Hitler’s national socialist (Nazi) Germany

    Get PDF
    This paper features an analysis of President Trump’s two State of the Union addresses, which are analysed by means of various data mining techniques, including sentiment analysis. The intention is to explore the contents and sentiments of the messages contained, the degree to which they differ, and their potential implications for the national mood and state of the economy. We also apply Zipf and Mandelbrot’s power law to assess the degree to which they differ from common language patterns. To provide a contrast and some parallel context, analyses are also undertaken of President Obama’s last State of the Union address and Hitler’s 1933 Berlin Proclamation. The structure of these four political addresses is remarkably similar. The three US Presidential speeches are more positive emotionally than is Hitler’s relatively shorter address, which is characterised by a prevalence of negative emotions. Hitler’s speech deviates the most from common speech, but all three appear to target their audiences by use of non-complex speech. However, it should be said that the economic circumstances in contemporary America and Germany in the 1930s are vastly different

    Words by the tail : assessing lexical diversity in scholarly titles using frequency-rank distribution tail fits

    Get PDF
    This research assesses the evolution of lexical diversity in scholarly titles using a new indicator based on zipfian frequency-rank distribution tail fits. At the operational level, while both head and tail fits of zipfian word distributions are more independent of corpus size than other lexical diversity indicators, the latter however neatly outperforms the former in that regard. This benchmark-setting performance of zipfian distribution tails proves extremely handy in distinguishing actual patterns in lexical diversity from the statistical noise generated by other indicators due to corpus size fluctuations. From an empirical perspective, analysis of Web of Science (WoS) article titles from 1975 to 2014 shows that the lexical concentration of scholarly titles in Natural Sciences & Engineering (NSE) and Social Sciences & Humanities (SSH) articles increases by a little less than 8% over the whole period. With the exception of the lexically concentrated Mathematics, Earth & Space, and Physics, NSE article titles all increased in lexical concentration, suggesting a probable convergence of concentration levels in the near future. As regards to SSH disciplines, aggregation effects observed at the disciplinary group level suggests that, behind the stable concentration levels of SSH disciplines, a cross-disciplinary homogenization of the highest word frequency ranks may be at work. Overall, these trends suggest a progressive standardization of title wording in scientific article titles, as article titles get written using an increasingly restricted and crossdisciplinary set of words

    HOW MANY WORDS ARE THERE?

    Get PDF
    The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated and a more complex model is offered
    corecore