2,224 research outputs found
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
Word embeddings have recently seen a strong increase in interest as a result
of strong performance gains on a variety of tasks. However, most of this
research also underlined the importance of benchmark datasets, and the
difficulty of constructing these for a variety of language-specific tasks.
Still, many of the datasets used in these tasks could prove to be fruitful
linguistic resources, allowing for unique observations into language use and
variability. In this paper we demonstrate the performance of multiple types of
embeddings, created with both count and prediction-based architectures on a
variety of corpora, in two language-specific tasks: relation evaluation, and
dialect identification. For the latter, we compare unsupervised methods with a
traditional, hand-crafted dictionary. With this research, we provide the
embeddings themselves, the relation evaluation task benchmark for use in
further research, and demonstrate how the benchmarked embeddings prove a useful
unsupervised linguistic resource, effectively used in a downstream task.Comment: in LREC 201
MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation
The Multi-target Challenge aims to assess how well current speech technology
is able to determine whether or not a recorded utterance was spoken by one of a
large number of blacklisted speakers. It is a form of multi-target speaker
detection based on real-world telephone conversations. Data recordings are
generated from call center customer-agent conversations. The task is to measure
how accurately one can detect 1) whether a test recording is spoken by a
blacklisted speaker, and 2) which specific blacklisted speaker was talking.
This paper outlines the challenge and provides its baselines, results, and
discussions.Comment: http://mce.csail.mit.edu . arXiv admin note: text overlap with
arXiv:1807.0666
Transductive Learning with String Kernels for Cross-Domain Text Classification
For many text classification tasks, there is a major problem posed by the
lack of labeled data in a target domain. Although classifiers for a target
domain can be trained on labeled text data from a related source domain, the
accuracy of such classifiers is usually lower in the cross-domain setting.
Recently, string kernels have obtained state-of-the-art results in various text
classification tasks such as native language identification or automatic essay
scoring. Moreover, classifiers based on string kernels have been found to be
robust to the distribution gap between different domains. In this paper, we
formally describe an algorithm composed of two simple yet effective
transductive learning approaches to further improve the results of string
kernels in cross-domain settings. By adapting string kernels to the test set
without using the ground-truth test labels, we report significantly better
accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap
with arXiv:1808.0840
Dialectometric analysis of language variation in Twitter
In the last few years, microblogging platforms such as Twitter have given
rise to a deluge of textual data that can be used for the analysis of informal
communication between millions of individuals. In this work, we propose an
information-theoretic approach to geographic language variation using a corpus
based on Twitter. We test our models with tens of concepts and their associated
keywords detected in Spanish tweets geolocated in Spain. We employ
dialectometric measures (cosine similarity and Jensen-Shannon divergence) to
quantify the linguistic distance on the lexical level between cells created in
a uniform grid over the map. This can be done for a single concept or in the
general case taking into account an average of the considered variants. The
latter permits an analysis of the dialects that naturally emerge from the data.
Interestingly, our results reveal the existence of two dialect macrovarieties.
The first group includes a region-specific speech spoken in small towns and
rural areas whereas the second cluster encompasses cities that tend to use a
more uniform variety. Since the results obtained with the two different metrics
qualitatively agree, our work suggests that social media corpora can be
efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
- …