38 research outputs found
Investigating Rumor News Using Agreement-Aware Search
Recent years have witnessed a widespread increase of rumor news generated by
humans and machines. Therefore, tools for investigating rumor news have become
an urgent necessity. One useful function of such tools is to see ways a
specific topic or event is represented by presenting different points of view
from multiple sources.
In this paper, we propose Maester, a novel agreement-aware search framework
for investigating rumor news. Given an investigative question, Maester will
retrieve related articles to that question, assign and display top articles
from agree, disagree, and discuss categories to users. Splitting the results
into these three categories provides the user a holistic view towards the
investigative question. We build Maester based on the following two key
observations: (1) relatedness can commonly be determined by keywords and
entities occurring in both questions and articles, and (2) the level of
agreement between the investigative question and the related news article can
often be decided by a few key sentences. Accordingly, we use gradient boosting
tree models with keyword/entity matching features for relatedness detection,
and leverage recurrent neural network to infer the level of agreement. Our
experiments on the Fake News Challenge (FNC) dataset demonstrate up to an order
of magnitude improvement of Maester over the original FNC winning solution, for
agreement-aware search
TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World
To facilitate the research on intelligent and human-like chatbots with
multi-modal context, we introduce a new video-based multi-modal dialogue
dataset, called TikTalk. We collect 38K videos from a popular video-sharing
platform, along with 367K conversations posted by users beneath them. Users
engage in spontaneous conversations based on their multi-modal experiences from
watching videos, which helps recreate real-world chitchat context. Compared to
previous multi-modal dialogue datasets, the richer context types in TikTalk
lead to more diverse conversations, but also increase the difficulty in
capturing human interests from intricate multi-modal information to generate
personalized responses. Moreover, external knowledge is more frequently evoked
in our dataset. These facts reveal new challenges for multi-modal dialogue
models. We quantitatively demonstrate the characteristics of TikTalk, propose a
video-based multi-modal chitchat task, and evaluate several dialogue baselines.
Experimental results indicate that the models incorporating large language
models (LLM) can generate more diverse responses, while the model utilizing
knowledge graphs to introduce external knowledge performs the best overall.
Furthermore, no existing model can solve all the above challenges well. There
is still a large room for future improvements, even for LLM with visual
extensions. Our dataset is available at
\url{https://ruc-aimind.github.io/projects/TikTalk/}.Comment: Accepted to ACM Multimedia 202
An Empirical Study of Offensive Language in Online Interactions
In the past decade, usage of social media platforms has increased significantly. People use these platforms to connect with friends and family, share information, news and opinions. Platforms such as Facebook, Twitter are often used to propagate offensive and hateful content online. The open nature and anonymity of the internet fuels aggressive and inflamed conversations. The companies and federal institutions are striving to make social media cleaner, welcoming and unbiased. In this study, we first explore the underlying topics in popular offensive language datasets using statistical and neural topic modeling. The current state-of-the-art models for aggression detection only present a toxicity score based on the entire post. Content moderators often have to deal with lengthy texts without any word-level indicators. We propose a neural transformer approach for detecting the tokens that make a particular post aggressive. The pre-trained BERT model has achieved state-of-the-art results in various natural language processing tasks. However, the model is trained on general-purpose corpora and lacks aggressive social media linguistic features. We propose fBERT, a retrained BERT model with over million offensive tweets from the SOLID dataset. We demonstrate the effectiveness and portability of fBERT over BERT in various shared offensive language detection tasks. We further propose a new multi-task aggression detection (MAD) framework for post and token-level aggression detection using neural transformers. The experiments confirm the effectiveness of the multi-task learning model over individual models; particularly when the number of training data is limited
Grammatical Error Correction: A Survey of the State of the Art
Grammatical Error Correction (GEC) is the task of automatically detecting and
correcting errors in text. The task not only includes the correction of
grammatical errors, such as missing prepositions and mismatched subject-verb
agreement, but also orthographic and semantic errors, such as misspellings and
word choice errors respectively. The field has seen significant progress in the
last decade, motivated in part by a series of five shared tasks, which drove
the development of rule-based methods, statistical classifiers, statistical
machine translation, and finally neural machine translation systems which
represent the current dominant state of the art. In this survey paper, we
condense the field into a single article and first outline some of the
linguistic challenges of the task, introduce the most popular datasets that are
available to researchers (for both English and other languages), and summarise
the various methods and techniques that have been developed with a particular
focus on artificial error generation. We next describe the many different
approaches to evaluation as well as concerns surrounding metric reliability,
especially in relation to subjective human judgements, before concluding with
an overview of recent progress and suggestions for future work and remaining
challenges. We hope that this survey will serve as comprehensive resource for
researchers who are new to the field or who want to be kept apprised of recent
developments
The Web of False Information: Rumors, Fake News, Hoaxes, Clickbait, and Various Other Shenanigans
A new era of Information Warfare has arrived. Various actors, including
state-sponsored ones, are weaponizing information on Online Social Networks to
run false information campaigns with targeted manipulation of public opinion on
specific topics. These false information campaigns can have dire consequences
to the public: mutating their opinions and actions, especially with respect to
critical world events like major elections. Evidently, the problem of false
information on the Web is a crucial one, and needs increased public awareness,
as well as immediate attention from law enforcement agencies, public
institutions, and in particular, the research community. In this paper, we make
a step in this direction by providing a typology of the Web's false information
ecosystem, comprising various types of false information, actors, and their
motives. We report a comprehensive overview of existing research on the false
information ecosystem by identifying several lines of work: 1) how the public
perceives false information; 2) understanding the propagation of false
information; 3) detecting and containing false information on the Web; and 4)
false information on the political stage. In this work, we pay particular
attention to political false information as: 1) it can have dire consequences
to the community (e.g., when election results are mutated) and 2) previous work
show that this type of false information propagates faster and further when
compared to other types of false information. Finally, for each of these lines
of work, we report several future research directions that can help us better
understand and mitigate the emerging problem of false information dissemination
on the Web
Sources of Noise in Dialogue and How to Deal with Them
Training dialogue systems often entails dealing with noisy training examples
and unexpected user inputs. Despite their prevalence, there currently lacks an
accurate survey of dialogue noise, nor is there a clear sense of the impact of
each noise type on task performance. This paper addresses this gap by first
constructing a taxonomy of noise encountered by dialogue systems. In addition,
we run a series of experiments to show how different models behave when
subjected to varying levels of noise and types of noise. Our results reveal
that models are quite robust to label errors commonly tackled by existing
denoising algorithms, but that performance suffers from dialogue-specific
noise. Driven by these observations, we design a data cleaning algorithm
specialized for conversational settings and apply it as a proof-of-concept for
targeted dialogue denoising.Comment: 23 pages, 6 Figures, 5 tables. Accepted at SIGDIAL 202