1,404 research outputs found
A Kernel Independence Test for Geographical Language Variation
Quantifying the degree of spatial dependence for linguistic variables is a
key task for analyzing dialectal variation. However, existing approaches have
important drawbacks. First, they are based on parametric models of dependence,
which limits their power in cases where the underlying parametric assumptions
are violated. Second, they are not applicable to all types of linguistic data:
some approaches apply only to frequencies, others to boolean indicators of
whether a linguistic variable is present. We present a new method for measuring
geographical language variation, which solves both of these problems. Our
approach builds on Reproducing Kernel Hilbert space (RKHS) representations for
nonparametric statistics, and takes the form of a test statistic that is
computed from pairs of individual geotagged observations without aggregation
into predefined geographical bins. We compare this test with prior work using
synthetic data as well as a diverse set of real datasets: a corpus of Dutch
tweets, a Dutch syntactic atlas, and a dataset of letters to the editor in
North American newspapers. Our proposed test is shown to support robust
inferences across a broad range of scenarios and types of data.Comment: In submission. 26 page
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
We introduce a new measure of distance between languages based on word
embedding, called word embedding language divergence (WELD). WELD is defined as
divergence between unified similarity distribution of words between languages.
Using such a measure, we perform language comparison for fifty natural
languages and twelve genetic languages. Our natural language dataset is a
collection of sentence-aligned parallel corpora from bible translations for
fifty languages spanning a variety of language families. Although we use
parallel corpora, which guarantees having the same content in all languages,
interestingly in many cases languages within the same family cluster together.
In addition to natural languages, we perform language comparison for the coding
regions in the genomes of 12 different organisms (4 plants, 6 animals, and two
human subjects). Our result confirms a significant high-level difference in the
genetic language model of humans/animals versus plants. The proposed method is
a step toward defining a quantitative measure of similarity between languages,
with applications in languages classification, genre identification, dialect
identification, and evaluation of translations
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Peer reviewe
Language classification from bilingual word embedding graphs
We study the role of the second language in bilingual word embeddings in
monolingual semantic evaluation tasks. We find strongly and weakly positive
correlations between down-stream task performance and second language
similarity to the target language. Additionally, we show how bilingual word
embeddings can be employed for the task of semantic language classification and
that joint semantic spaces vary in meaningful ways across second languages. Our
results support the hypothesis that semantic language similarity is influenced
by both structural similarity as well as geography/contact.Comment: To be published at Coling 201
Holistic corpus-based dialectology
This paper is concerned with sketching future directions for corpus-based dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw on computationally advanced multivariate analysis techniques (such as multidimensional scaling, cluster analysis, and principal component analysis), and (iii) aid interpretation of empirical results by marshalling state-of-the-art data visualization techniques. To exemplify this line of analysis, we present a case study which explores joint frequency variability of 57 morphosyntax features in 34 dialects all over Great Britain
- âŠ