850 research outputs found
Understanding Chat Messages for Sticker Recommendation in Messaging Apps
Stickers are popularly used in messaging apps such as Hike to visually
express a nuanced range of thoughts and utterances to convey exaggerated
emotions. However, discovering the right sticker from a large and ever
expanding pool of stickers while chatting can be cumbersome. In this paper, we
describe a system for recommending stickers in real time as the user is typing
based on the context of the conversation. We decompose the sticker
recommendation (SR) problem into two steps. First, we predict the message that
the user is likely to send in the chat. Second, we substitute the predicted
message with an appropriate sticker. Majority of Hike's messages are in the
form of text which is transliterated from users' native language to the Roman
script. This leads to numerous orthographic variations of the same message and
makes accurate message prediction challenging. To address this issue, we learn
dense representations of chat messages employing character level convolution
network in an unsupervised manner. We use them to cluster the messages that
have the same meaning. In the subsequent steps, we predict the message cluster
instead of the message. Our approach does not depend on human labelled data
(except for validation), leading to fully automatic updation and tuning
pipeline for the underlying models. We also propose a novel hybrid message
prediction model, which can run with low latency on low-end phones that have
severe computational limitations. Our described system has been deployed for
more than months and is being used by millions of users along with hundreds
of thousands of expressive stickers
MojiTalk: Generating Emotional Responses at Scale
Generating emotional language is a key step towards building empathetic
natural language processing agents. However, a major challenge for this line of
research is the lack of large-scale labeled training data, and previous studies
are limited to only small sets of human annotated sentiment labels.
Additionally, explicitly controlling the emotion and sentiment of generated
text is also difficult. In this paper, we take a more radical approach: we
exploit the idea of leveraging Twitter data that are naturally labeled with
emojis. More specifically, we collect a large corpus of Twitter conversations
that include emojis in the response, and assume the emojis convey the
underlying emotions of the sentence. We then introduce a reinforced conditional
variational encoder approach to train a deep generative model on these
conversations, which allows us to use emojis to control the emotion of the
generated text. Experimentally, we show in our quantitative and qualitative
analyses that the proposed models can successfully generate high-quality
abstractive conversation responses in accordance with designated emotions
An Exploratory Study of COVID-19 Misinformation on Twitter
During the COVID-19 pandemic, social media has become a home ground for
misinformation. To tackle this infodemic, scientific oversight, as well as a
better understanding by practitioners in crisis management, is needed. We have
conducted an exploratory study into the propagation, authors and content of
misinformation on Twitter around the topic of COVID-19 in order to gain early
insights. We have collected all tweets mentioned in the verdicts of
fact-checked claims related to COVID-19 by over 92 professional fact-checking
organisations between January and mid-July 2020 and share this corpus with the
community. This resulted in 1 500 tweets relating to 1 274 false and 276
partially false claims, respectively. Exploratory analysis of author accounts
revealed that the verified twitter handle(including Organisation/celebrity) are
also involved in either creating (new tweets) or spreading (retweet) the
misinformation. Additionally, we found that false claims propagate faster than
partially false claims. Compare to a background corpus of COVID-19 tweets,
tweets with misinformation are more often concerned with discrediting other
information on social media. Authors use less tentative language and appear to
be more driven by concerns of potential harm to others. Our results enable us
to suggest gaps in the current scientific coverage of the topic as well as
propose actions for authorities and social media users to counter
misinformation.Comment: 20 pages, nine figures, four tables. Submitted for peer review,
revision
TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations
We present TwHIN-BERT, a multilingual language model trained on in-domain
data from the popular social network Twitter. TwHIN-BERT differs from prior
pre-trained language models as it is trained with not only text-based
self-supervision, but also with a social objective based on the rich social
engagements within a Twitter heterogeneous information network (TwHIN). Our
model is trained on 7 billion tweets covering over 100 distinct languages
providing a valuable representation to model short, noisy, user-generated text.
We evaluate our model on a variety of multilingual social recommendation and
semantic understanding tasks and demonstrate significant metric improvement
over established pre-trained language models. We will freely open-source
TwHIN-BERT and our curated hashtag prediction and social engagement benchmark
datasets to the research community
- …