Search CORE

150 research outputs found

Equilibrium (Zipf) and Dynamic (Grasseberg-Procaccia) method based analyses of human texts. A comparison of natural (english) and artificial (esperanto) languages

Author: Abe
Amancio
Amit
Baronchelli
Benedetto
Boulton
Bronlet
Carroll
Carroll
Chomsky
de Oliveira
Ebeling
Ferrer i Cancho
Ferrer i Cancho
Gabaix
Gibrat
Grassberger
Grassberger
Ha
Hatzigeorgiu
Kanter
Kanter
Karakotsou
Kawamura
Kawamura
Klinkenberg
Koliopanos
Kosmidis
Kosmidis
Kostelich
Koutsoudas
Lambiotte
Li
M. Ausloos
Manaris
Mandelbrot
Mandelbrot
Mandelbrot
Mantegna
Meadow
Montemurro
Montemurro
Montemurro
Nicolis
Nowak
Powers
Schenkel
Schulze
Simon
Steels
Steels
Takens
Theiler
Vilenski
West
Yule
Zanette
Zbilut
Zipf
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

A comparison of two english texts from Lewis Carroll, one (Alice in wonderland), also translated into esperanto, the other (Through a looking glass) are discussed in order to observe whether natural and artificial languages significantly differ from each other. One dimensional time series like signals are constructed using only word frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations in the FTS and (ii) a Grassberger-Procaccia (GP) technique based method for finding correlations in LTS. Features are compared : different power laws are observed with characteristic exponents for the ranking properties, and the {\it phase space attractor dimensionality}. The Zipf exponent can take values much less than unity (

ca.

0.50 or 0.30) depending on how a sentence is defined. This non-universality is conjectured to be a measure of the author

style

. Moreover the attractor dimension

r

is a simple function of the so called phase space dimension

n

, i.e.,

r = n^{\lambda}

, with

\lambda = 0.79

. Such an exponent should also conjecture to be a measure of the author

creativity

. However, even though there are quantitative differences between the original english text and its esperanto translation, the qualitative differences are very minutes, indicating in this case a translation relatively well respecting, along our analysis lines, the content of the author writing.Comment: 22 pages, 87 references, 5 tables, 8 figure

arXiv.org e-Print Archive

Crossref

Open Repository and Bibliography - Liège

Maximum likelihood estimation for constrained parameters of multinomial distributions - Application to Zipf-Mandelbrot models

Author: Izsák F.
Publication venue: Elsevier
Publication date: 01/01/2006
Field of study

A numerical maximum likelihood (ML) estimation procedure is developed for the constrained parameters of multinomial distributions. The main difﬁculty involved in computing the likelihood function is the precise and fast determination of the multinomial coefﬁcients. For this the coefﬁcients are rewritten into a telescopic product. The presented method is applied to the ML estimation of the Zipf–Mandelbrot (ZM) distribution, which provides a true model in many real-life cases. The examples discussed arise from ecological and medical observations. Based on the estimates, the hypothesis that the data is ZM distributed is tested using a chi-square test. The computer code of the presented procedure is available on request by the author

University of Twente Research Information

Theoretical results on a weightless neural classifier and application to computational linguistics

Author: Carneiro Hugo Cesar de Castro
Publication venue: 'Programa de Pos-graduacao em Ciencias Contabeis da UFRJ'
Publication date: 01/01/2017
Field of study

WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado.WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Pantheon

Fake news and propaganda: Trump’s democratic America and Hitler’s national socialist (Nazi) Germany

Author: Allen David E.
McAleer Michael
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/2019
Field of study

This paper features an analysis of President Trump’s two State of the Union addresses, which are analysed by means of various data mining techniques, including sentiment analysis. The intention is to explore the contents and sentiments of the messages contained, the degree to which they differ, and their potential implications for the national mood and state of the economy. We also apply Zipf and Mandelbrot’s power law to assess the degree to which they differ from common language patterns. To provide a contrast and some parallel context, analyses are also undertaken of President Obama’s last State of the Union address and Hitler’s 1933 Berlin Proclamation. The structure of these four political addresses is remarkably similar. The three US Presidential speeches are more positive emotionally than is Hitler’s relatively shorter address, which is characterised by a prevalence of negative emotions. Hitler’s speech deviates the most from common speech, but all three appear to target their audiences by use of non-complex speech. However, it should be said that the economic circumstances in contemporary America and Germany in the 1930s are vastly different

Research Online @ ECU

A joint text mining-rank size investigation of the rhetoric structures of the US Presidents’ speeches

Author: Ausloos M.
Ausloos M.
Cerqueti R.
Cerqueti R.
Ficcadenti V
Ficcadenti V
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

© 2019 Elsevier Ltd This work presents a text mining context and its use for a deep analysis of the messages delivered by politicians. Specifically, we deal with an expert systems-based exploration of the rhetoric dynamics of a large collection of US Presidents’ speeches, ranging from Washington to Trump. In particular, speeches are viewed as complex expert systems whose structures can be effectively analyzed through rank-size laws. The methodological contribution of the paper is twofold. First, we develop a text mining-based procedure for the construction of the dataset by using a web scraping routine on the Miller Center website – the repository site collecting the speeches. Second, we explore the implicit structure of the discourse data by implementing a rank-size procedure over the individual speeches, being the words of each speech ranked in terms of their frequencies. The scientific significance of the proposed combination of text-mining and rank-size approaches can be found in its flexibility and generality, which let it be reproducible to a wide set of expert systems and text mining contexts. The usefulness of the proposed method and of the speeches analysis is demonstrated by the findings themselves. Indeed, in terms of impact, it is worth noting that interesting conclusions of social, political and linguistic nature on how 45 United States Presidents, from April 30, 1789 till February 28, 2017 delivered political messages can be carried out. Indeed, the proposed analysis shows some remarkable regularities, not only inside a given speech, but also among different speeches. Moreover, under a purely methodological perspective, the presented contribution suggests possible ways of generating a linguistic decision-making algorithm

LSBU Research Open

Fake news and propaganda: Trump’s democratic America and Hitler’s national socialist (Nazi) Germany

Author: Allen David E.
McAleer Michael
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/2019
Field of study

Multidisciplinary Digital Publishing Institute

Docta Complutense

EUR Research Repository

Research Online @ ECU

Erasmus University Digital Repository

Words by the tail : assessing lexical diversity in scholarly titles using frequency-rank distribution tail fits

Author: Bérubé Nicolas
Larivière Vincent
Mongeon Philippe
Sainte-Marie Maxime
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

This research assesses the evolution of lexical diversity in scholarly titles using a new indicator based on zipfian frequency-rank distribution tail fits. At the operational level, while both head and tail fits of zipfian word distributions are more independent of corpus size than other lexical diversity indicators, the latter however neatly outperforms the former in that regard. This benchmark-setting performance of zipfian distribution tails proves extremely handy in distinguishing actual patterns in lexical diversity from the statistical noise generated by other indicators due to corpus size fluctuations. From an empirical perspective, analysis of Web of Science (WoS) article titles from 1975 to 2014 shows that the lexical concentration of scholarly titles in Natural Sciences & Engineering (NSE) and Social Sciences & Humanities (SSH) articles increases by a little less than 8% over the whole period. With the exception of the lexically concentrated Mathematics, Earth & Space, and Physics, NSE article titles all increased in lexical concentration, suggesting a probable convergence of concentration levels in the near future. As regards to SSH disciplines, aggregation effects observed at the disciplinary group level suggests that, behind the stable concentration levels of SSH disciplines, a cross-disciplinary homogenization of the highest word frequency ranks may be at work. Overall, these trends suggest a progressive standardization of title wording in scientific article titles, as article titles get written using an increasingly restricted and crossdisciplinary set of words

Directory of Open Access Journals

Dépôt Institutionnel Numérique

FigShare

Advances in the Applications of Distribution Theory - Improvements on Rank-Size Distributions and in Signal Processing

Author: Wiegand Martin
Publication venue
Publication date: 31/12/2019
Field of study

The University of Manchester - Institutional Repository

HOW MANY WORDS ARE THERE?

Author: Kornai András
Publication venue
Publication date: 01/01/2002
Field of study

The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated and a more complex model is offered

CiteSeerX

SZTAKI Publication Repository