4,491 research outputs found
An Empirical Study of Topic Transition in Dialogue
Transitioning between topics is a natural component of human-human dialog.
Although topic transition has been studied in dialogue for decades, only a
handful of corpora based studies have been performed to investigate the
subtleties of topic transitions. Thus, this study annotates 215 conversations
from the switchboard corpus and investigates how variables such as length,
number of topic transitions, topic transitions share by participants and
turns/topic are related. This work presents an empirical study on topic
transition in switchboard corpus followed by modelling topic transition with a
precision of 83% for in-domain(id) test set and 82% on 10 out-of-domain}(ood)
test set. It is envisioned that this work will help in emulating human-human
like topic transition in open-domain dialog systems.Comment: 5 pages, 4 figures, 3 table
Topic Segmentation in the Wild: Towards Segmentation of Semi-structured & Unstructured Chats
Breaking down a document or a conversation into multiple contiguous segments
based on its semantic structure is an important and challenging problem in NLP,
which can assist many downstream tasks. However, current works on topic
segmentation often focus on segmentation of structured texts. In this paper, we
comprehensively analyze the generalization capabilities of state-of-the-art
topic segmentation models on unstructured texts. We find that: (a) Current
strategies of pre-training on a large corpus of structured text such as
Wiki-727K do not help in transferability to unstructured texts. (b) Training
from scratch with only a relatively small-sized dataset of the target
unstructured domain improves the segmentation results by a significant margin.Comment: NeurIPS 2022 : ENLS
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
Text segmentation is a fundamental task in natural language processing, where
documents are split into contiguous sections. However, prior research in this
area has been constrained by limited datasets, which are either small in scale,
synthesized, or only contain well-structured documents. In this paper, we
address these limitations by introducing a novel benchmark YTSeg focusing on
spoken content that is inherently more unstructured and both topically and
structurally diverse. As part of this work, we introduce an efficient
hierarchical segmentation model MiniSeg, that outperforms state-of-the-art
baselines. Lastly, we expand the notion of text segmentation to a more
practical "smart chaptering" task that involves the segmentation of
unstructured content, the generation of meaningful segment titles, and a
potential real-time application of the models.Comment: Accepted to EACL 202
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
- …