850 research outputs found

    Understanding Chat Messages for Sticker Recommendation in Messaging Apps

    Full text link
    Stickers are popularly used in messaging apps such as Hike to visually express a nuanced range of thoughts and utterances to convey exaggerated emotions. However, discovering the right sticker from a large and ever expanding pool of stickers while chatting can be cumbersome. In this paper, we describe a system for recommending stickers in real time as the user is typing based on the context of the conversation. We decompose the sticker recommendation (SR) problem into two steps. First, we predict the message that the user is likely to send in the chat. Second, we substitute the predicted message with an appropriate sticker. Majority of Hike's messages are in the form of text which is transliterated from users' native language to the Roman script. This leads to numerous orthographic variations of the same message and makes accurate message prediction challenging. To address this issue, we learn dense representations of chat messages employing character level convolution network in an unsupervised manner. We use them to cluster the messages that have the same meaning. In the subsequent steps, we predict the message cluster instead of the message. Our approach does not depend on human labelled data (except for validation), leading to fully automatic updation and tuning pipeline for the underlying models. We also propose a novel hybrid message prediction model, which can run with low latency on low-end phones that have severe computational limitations. Our described system has been deployed for more than 66 months and is being used by millions of users along with hundreds of thousands of expressive stickers

    MojiTalk: Generating Emotional Responses at Scale

    Full text link
    Generating emotional language is a key step towards building empathetic natural language processing agents. However, a major challenge for this line of research is the lack of large-scale labeled training data, and previous studies are limited to only small sets of human annotated sentiment labels. Additionally, explicitly controlling the emotion and sentiment of generated text is also difficult. In this paper, we take a more radical approach: we exploit the idea of leveraging Twitter data that are naturally labeled with emojis. More specifically, we collect a large corpus of Twitter conversations that include emojis in the response, and assume the emojis convey the underlying emotions of the sentence. We then introduce a reinforced conditional variational encoder approach to train a deep generative model on these conversations, which allows us to use emojis to control the emotion of the generated text. Experimentally, we show in our quantitative and qualitative analyses that the proposed models can successfully generate high-quality abstractive conversation responses in accordance with designated emotions

    An Exploratory Study of COVID-19 Misinformation on Twitter

    Get PDF
    During the COVID-19 pandemic, social media has become a home ground for misinformation. To tackle this infodemic, scientific oversight, as well as a better understanding by practitioners in crisis management, is needed. We have conducted an exploratory study into the propagation, authors and content of misinformation on Twitter around the topic of COVID-19 in order to gain early insights. We have collected all tweets mentioned in the verdicts of fact-checked claims related to COVID-19 by over 92 professional fact-checking organisations between January and mid-July 2020 and share this corpus with the community. This resulted in 1 500 tweets relating to 1 274 false and 276 partially false claims, respectively. Exploratory analysis of author accounts revealed that the verified twitter handle(including Organisation/celebrity) are also involved in either creating (new tweets) or spreading (retweet) the misinformation. Additionally, we found that false claims propagate faster than partially false claims. Compare to a background corpus of COVID-19 tweets, tweets with misinformation are more often concerned with discrediting other information on social media. Authors use less tentative language and appear to be more driven by concerns of potential harm to others. Our results enable us to suggest gaps in the current scientific coverage of the topic as well as propose actions for authorities and social media users to counter misinformation.Comment: 20 pages, nine figures, four tables. Submitted for peer review, revision

    TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

    Full text link
    We present TwHIN-BERT, a multilingual language model trained on in-domain data from the popular social network Twitter. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on a variety of multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We will freely open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community
    • …
    corecore