2,713 research outputs found
Automated words stability and languages phylogeny
The idea of measuring distance between languages seems to have its roots in
the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected
comparative words lists of various languages during his voyages aboard the
Astrolabe from 1826 to1829 and, in his work about the geographical division of
the Pacific, he proposed a method to measure the degree of relation among
languages. The method used by modern glottochronology, developed by Morris
Swadesh in the 1950s (Swadesh 1952), measures distances from the percentage of
shared cognates, which are words with a common historical origin. Recently, we
proposed a new automated method which uses normalized Levenshtein distance
among words with the same meaning and averages on the words contained in a
list. Another classical problem in glottochronology is the study of the
stability of words corresponding to different meanings. Words, in fact, evolve
because of lexical changes, borrowings and replacement at a rate which is not
the same for all of them. The speed of lexical evolution is different for
different meanings and it is probably related to the frequency of use of the
associated words (Pagel et al. 2007). This problem is tackled here by an
automated methodology only based on normalized Levenshtein distance.Comment: XI International Conference "Cognitive Modeling in Linguistics-2009"
Constanca, Romania, September, 7-14, 200
Population Size and Rates of Language Change
Previous empirical studies of population size and language change have produced equivocal results. We therefore address the question with a new set of lexical data from nearly one-half of the world’s languages. We first show that relative population sizes of modern languages can be extrapolated to ancestral languages, albeit with diminishing accuracy, up to several thousand years into the past. We then test for an effect of population against the null hypothesis that the ultrametric inequality is satisfied by lexical distance among triples of related languages. The test shows mainly negligible effects of population, the exception being an apparently faster rate of change in the larger of two closely related variants. A possible explanation for the exception may be the influence on emerging standard (or cross-regional) variants from speakers who shift from different dialects to the standard. Our results strongly indicate that the sizes of speaker populations do not in and of themselves determine rates of language change. Comparison of this empirical finding with previously published computer simulations suggests that the most plausible model for language change is one in which changes propagate on a local level in a type of network in which the individuals have different degrees of connectivity
Probing Multilingual BERT for Genetic and Typological Signals
We probe the layers in multilingual BERT (mBERT) for phylogenetic and
geographic language signals across 100 languages and compute language distances
based on the mBERT representations. We 1) employ the language distances to
infer and evaluate language trees, finding that they are close to the reference
family tree in terms of quartet tree distance, 2) perform distance matrix
regression analysis, finding that the language distances can be best explained
by phylogenetic and worst by structural factors and 3) present a novel measure
for measuring diachronic meaning stability (based on cross-lingual
representation variability) which correlates significantly with published
ranked lists based on linguistic approaches. Our results contribute to the
nascent field of typological interpretability of cross-lingual text
representations.Comment: COLING 202
Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs
Conversational participants tend to immediately and unconsciously adapt to
each other's language styles: a speaker will even adjust the number of articles
and other function words in their next utterance in response to the number in
their partner's immediately preceding utterance. This striking level of
coordination is thought to have arisen as a way to achieve social goals, such
as gaining approval or emphasizing difference in status. But has the adaptation
mechanism become so deeply embedded in the language-generation process as to
become a reflex? We argue that fictional dialogs offer a way to study this
question, since authors create the conversations but don't receive the social
benefits (rather, the imagined characters do). Indeed, we find significant
coordination across many families of function words in our large movie-script
corpus. We also report suggestive preliminary findings on the effects of gender
and other features; e.g., surprisingly, for articles, on average, characters
adapt more to females than to males.Comment: data available at http://www.cs.cornell.edu/~cristian/movie
Recommended from our members
Local search: A guide for the information retrieval practitioner
There are a number of combinatorial optimisation problems in information retrieval in which the use of local search methods are worthwhile. The purpose of this paper is to show how local search can be used to solve some well known tasks in information retrieval (IR), how previous research in the field is piecemeal, bereft of a structure and methodologically flawed, and to suggest more rigorous ways of applying local search methods to solve IR problems. We provide a query based taxonomy for analysing the use of local search in IR tasks and an overview of issues such as fitness functions, statistical significance and test collections when conducting experiments on combinatorial optimisation problems. The paper gives a guide on the pitfalls and problems for IR practitioners who wish to use local search to solve their research issues, and gives practical advice on the use of such methods. The query based taxonomy is a novel structure which can be used by the IR practitioner in order to examine the use of local search in IR
Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter
Individual happiness is a fundamental societal metric. Normally measured
through self-report, happiness has often been indirectly characterized and
overshadowed by more readily quantifiable economic indicators such as gross
domestic product. Here, we examine expressions made on the online, global
microblog and social networking service Twitter, uncovering and explaining
temporal variations in happiness and information levels over timescales ranging
from hours to years. Our data set comprises over 46 billion words contained in
nearly 4.6 billion expressions posted over a 33 month span by over 63 million
unique users. In measuring happiness, we use a real-time, remote-sensing,
non-invasive, text-based approach---a kind of hedonometer. In building our
metric, made available with this paper, we conducted a survey to obtain
happiness evaluations of over 10,000 individual words, representing a tenfold
size improvement over similar existing word sets. Rather than being ad hoc, our
word list is chosen solely by frequency of usage and we show how a highly
robust metric can be constructed and defended.Comment: 27 pages, 17 figures, 3 tables. Supplementary Information: 1 table,
52 figure
- …