1,591 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Argumentation Mining in User-Generated Web Discourse
The goal of argumentation mining, an evolving research field in computational
linguistics, is to design methods capable of analyzing people's argumentation.
In this article, we go beyond the state of the art in several ways. (i) We deal
with actual Web data and take up the challenges given by the variety of
registers, multiple domains, and unrestricted noisy user-generated Web
discourse. (ii) We bridge the gap between normative argumentation theories and
argumentation phenomena encountered in actual data by adapting an argumentation
model tested in an extensive annotation study. (iii) We create a new gold
standard corpus (90k tokens in 340 documents) and experiment with several
machine learning methods to identify argument components. We offer the data,
source codes, and annotation guidelines to the community under free licenses.
Our findings show that argumentation mining in user-generated Web discourse is
a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in
User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17
A systematic literature review on Wikidata
To review the current status of research on Wikidata and, in particular, of articles that either describe applications of Wikidata or provide empirical evidence, in order to uncover the topics of interest, the fields that are benefiting from its applications and which researchers and institutions are leading the work
Ontology: A Linked Data Hub for Mathematics
In this paper, we present an ontology of mathematical knowledge concepts that
covers a wide range of the fields of mathematics and introduces a balanced
representation between comprehensive and sensible models. We demonstrate the
applications of this representation in information extraction, semantic search,
and education. We argue that the ontology can be a core of future integration
of math-aware data sets in the Web of Data and, therefore, provide mappings
onto relevant datasets, such as DBpedia and ScienceWISE.Comment: 15 pages, 6 images, 1 table, Knowledge Engineering and the Semantic
Web - 5th International Conferenc
VEWS: A Wikipedia Vandal Early Warning System
We study the problem of detecting vandals on Wikipedia before any human or
known vandalism detection system reports flagging potential vandals so that
such users can be presented early to Wikipedia administrators. We leverage
multiple classical ML approaches, but develop 3 novel sets of features. Our
Wikipedia Vandal Behavior (WVB) approach uses a novel set of user editing
patterns as features to classify some users as vandals. Our Wikipedia
Transition Probability Matrix (WTPM) approach uses a set of features derived
from a transition probability matrix and then reduces it via a neural net
auto-encoder to classify some users as vandals. The VEWS approach merges the
previous two approaches. Without using any information (e.g. reverts) provided
by other users, these algorithms each have over 85% classification accuracy.
Moreover, when temporal recency is considered, accuracy goes to almost 90%. We
carry out detailed experiments on a new data set we have created consisting of
about 33K Wikipedia users (including both a black list and a white list of
editors) and containing 770K edits. We describe specific behaviors that
distinguish between vandals and non-vandals. We show that VEWS beats ClueBot NG
and STiki, the best known algorithms today for vandalism detection. Moreover,
VEWS detects far more vandals than ClueBot NG and on average, detects them 2.39
edits before ClueBot NG when both detect the vandal. However, we show that the
combination of VEWS and ClueBot NG can give a fully automated vandal early
warning system with even higher accuracy.Comment: To appear in Proceedings of the 21st ACM SIGKDD Conference of
Knowledge Discovery and Data Mining (KDD 2015
Automatic and Human-AI Interactive Text Generation
In this tutorial, we focus on text-to-text generation, a class of natural
language generation (NLG) tasks, that takes a piece of text as input and then
generates a revision that is improved according to some specific criteria
(e.g., readability or linguistic styles), while largely retaining the original
meaning and the length of the text. This includes many useful applications,
such as text simplification, paraphrase generation, style transfer, etc. In
contrast to text summarization and open-ended text completion (e.g., story),
the text-to-text generation tasks we discuss in this tutorial are more
constrained in terms of semantic consistency and targeted language styles. This
level of control makes these tasks ideal testbeds for studying the ability of
models to generate text that is both semantically adequate and stylistically
appropriate. Moreover, these tasks are interesting from a technical standpoint,
as they require complex combinations of lexical and syntactical
transformations, stylistic control, and adherence to factual knowledge, -- all
at once. With a special focus on text simplification and revision, this
tutorial aims to provide an overview of the state-of-the-art natural language
generation research from four major aspects -- Data, Models, Human-AI
Collaboration, and Evaluation -- and to discuss and showcase a few significant
and recent advances: (1) the use of non-retrogressive approaches; (2) the shift
from fine-tuning to prompting with large language models; (3) the development
of new learnable metric and fine-grained human evaluation framework; (4) a
growing body of studies and datasets on non-English languages; (5) the rise of
HCI+NLP+Accessibility interdisciplinary research to create real-world writing
assistant systems.Comment: To appear at ACL 2024, Tutoria
- …