40 research outputs found
Small vs. Big Data in Language Research: Challenges and Opportunities
<p>Mobile communication tools and platforms provide various opportunities for users to interact over social media. With the recent developments in computational research and machine learning, it has become possible to analyze large chunks of language related data automatically and fast. However, these tools are not readily available to handle data in all languages and there are also challenges handling social media data. Even when these issues are resolved, asking the right research question to the right set and amount of data becomes crucially important. Both qualitative and quantitative methods have attracted respectable researchers in language related areas of research. When tackling similar research problems, there is need for both top-down and bottom-up data-based approaches to reach a solution. Sometimes, this solution is hidden under an in-depth analysis of a small data set and sometimes it is revealed only through analyzing and experimenting with large amounts of data. However, in most cases, there is need for linking the findings of small data sets to understand the bigger picture revealed through patterns in large sets. Having worked with both small and large language related data in various forms, I will compare pros and cons of working with both types of data across media and contexts and share my own experiences with highlights and lowlights.</p>
Analyzing language change in syntax and multiword expressions : a case study of Turkish Spoken in the Netherlands
Detecting machine-translated subtitles in large parallel corpora
Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus
How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation
Open-domain chatbots are supposed to converse freely with humans without
being restricted to a topic, task or domain. However, the boundaries and/or
contents of open-domain conversations are not clear. To clarify the boundaries
of "openness", we conduct two studies: First, we classify the types of "speech
events" encountered in a chatbot evaluation data set (i.e., Meena by Google)
and find that these conversations mainly cover the "small talk" category and
exclude the other speech event categories encountered in real life human-human
communication. Second, we conduct a small-scale pilot study to generate online
conversations covering a wider range of speech event categories between two
humans vs. a human and a state-of-the-art chatbot (i.e., Blender by Facebook).
A human evaluation of these generated conversations indicates a preference for
human-human conversations, since the human-chatbot conversations lack coherence
in most speech event categories. Based on these results, we suggest (a) using
the term "small talk" instead of "open-domain" for the current chatbots which
are not that "open" in terms of conversational abilities yet, and (b) revising
the evaluation methods to test the chatbot conversations against other speech
events
Predicting dialect variation in immigrant contexts using light verb constructions
Languages spoken by immigrants change due to contact with the local languages. Capturing these changes is problematic for current language technologies, which are typically developed for speakers of the standard dialect only. Even when dialec-tal variants are available for such technolo-gies, we still need to predict which di-alect is being used. In this study, we dis-tinguish between the immigrant and the standard dialect of Turkish by focusing on Light Verb Constructions. We experiment with a number of grammatical and contex-tual features, achieving over 84 % accuracy (56 % baseline).
The Open-domain Paradox for Chatbots: Common Ground as the Basis for Human-like Dialogue
There is a surge in interest in the development of open-domain chatbots,
driven by the recent advancements of large language models. The "openness" of
the dialogue is expected to be maximized by providing minimal information to
the users about the common ground they can expect, including the presumed joint
activity. However, evidence suggests that the effect is the opposite. Asking
users to "just chat about anything" results in a very narrow form of dialogue,
which we refer to as the "open-domain paradox". In this position paper, we
explain this paradox through the theory of common ground as the basis for
human-like communication. Furthermore, we question the assumptions behind
open-domain chatbots and identify paths forward for enabling common ground in
human-computer dialogue.Comment: Accepted at SIGDIAL 202
Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective
Reproducibility is a key aspect for scientific advancement across
disciplines, and reducing barriers for open science is a focus area for the
theme of Interspeech 2023. Availability of source code is one of the indicators
that facilitates reproducibility. However, less is known about the rates of
reproducibility at Interspeech conferences in comparison to other conferences
in the field. In order to fill this gap, we have surveyed 27,717 papers at
seven conferences across speech and language processing disciplines. We find
that despite having a close number of accepted papers to the other conferences,
Interspeech has up to 40% less source code availability. In addition to
reporting the difficulties we have encountered during our research, we also
provide recommendations and possible directions to increase reproducibility for
further studies
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
Modeling the use of graffiti style features to signal social relations within a multi-domain learning paradigm
In this paper, we present a series of experiments in which we analyze the usage of graffiti style features for signaling personal gang identification in a large, online street gangs forum, with an accuracy as high as 83% at the gang alliance level and 72 % for the specific gang. We then build on that result in predicting how members of different gangs signal the relationship between their gangs within threads where they are interacting with one another, with a predictive accuracy as high as 66 % at this thread composition prediction task. Our work demonstrates how graffiti style features signal social identity both in terms of personal group affiliation and between group alliances and oppositions. When we predict thread composition by modeling identity and relationship simultaneously using a multi-domain learning framework paired with a rich feature representation, we achieve significantly higher predictive accuracy than state-of-the-art baselines using one or the other in isolation.