Search CORE

410 research outputs found

Multi-task Pairwise Neural Ranking for Hashtag Segmentation

Author: Maddela Mounica
Preoţiuc-Pietro Daniel
Xu Wei
Publication venue
Publication date: 01/01/2019
Field of study

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.Comment: 12 pages, ACL 201

arXiv.org e-Print Archive

Crossref

TA-COS 2018 : 2nd Workshop on Text Analytics for Cybersecurity and Online Safety : Proceedings

Author: De Pauw Guy
Desmet Bart
Lefever Els
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography

#Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds

Author: Bagasheva A.
Caleffi P.-M.
Cassell J.
Cook P.
Croft W.
Cunha E.
Eisenstein J.
Eisenstein J.
Giegerich H. J.
Hacken P.
Hong L.
Hu Y.
Lee C.-y.
Lerman K.
Lin Y.-R.
Lui M.
Léturgie A.
Medler D. A.
Milroy J.
Nguyen T.
Owoputi O.
Ritter A.
Ritter A.
Weng L.
Yang J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2015
Field of study

Compounding of natural language units is a very common phenomena. In this paper, we show, for the first time, that Twitter hashtags which, could be considered as correlates of such linguistic units, undergo compounding. We identify reasons for this compounding and propose a prediction model that can identify with 77.07% accuracy if a pair of hashtags compounding in the near future (i.e., 2 months after compounding) shall become popular. At longer times T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This technique has strong implications to trending hashtag recommendation since newly formed hashtag compounds can be recommended early, even before the compounding has taken place. Further, humans can predict compounds with an overall accuracy of only 48.7% (treated as baseline). Notably, while humans can discriminate the relatively easier cases, the automatic framework is successful in classifying the relatively harder cases.Comment: 14 pages, 4 figures, 9 tables, published in CSCW (Computer-Supported Cooperative Work and Social Computing) 2016. in Proceedings of 19th ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2016

arXiv.org e-Print Archive

Crossref

Simple effective named entity recognition for microblogs: Arabic as an example

Author: DARWISH Kareem
GAO Wei
Publication venue: The European Language Resources Association
Publication date: 01/05/2014
Field of study

Institutional Knowledge at Singapore Management University

Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

Author: Ro Jae
Sproat Richard
Zhang Hao
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%

arXiv.org e-Print Archive

Crossref

Using Linguistic Features to Estimate Suicide Probability of Chinese Microblog Users

Author: DM Blei
J Cull
J Golbeck
J Jashinsky
JW Pennebaker
M Kosinski
R Gao
S Adalı
S Bai
TD Ruder
V Silenzio
YX Zhu
Publication venue
Publication date: 04/11/2014
Field of study

If people with high risk of suicide can be identified through social media like microblog, it is possible to implement an active intervention system to save their lives. Based on this motivation, the current study administered the Suicide Probability Scale(SPS) to 1041 weibo users at Sina Weibo, which is a leading microblog service provider in China. Two NLP (Natural Language Processing) methods, the Chinese edition of Linguistic Inquiry and Word Count (LIWC) lexicon and Latent Dirichlet Allocation (LDA), are used to extract linguistic features from the Sina Weibo data. We trained predicting models by machine learning algorithm based on these two types of features, to estimate suicide probability based on linguistic features. The experiment results indicate that LDA can find topics that relate to suicide probability, and improve the performance of prediction. Our study adds value in prediction of suicidal probability of social network users with their behaviors

arXiv.org e-Print Archive

University of Memphis Digital Commons

Crossref

Treebanking user-generated content: A proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoğlu Özlem
Publication venue: ELRA ; IDS, Bibliothek
Publication date: 01/01/2020
Field of study

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD

MAnnheim DOCument Server

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Author: Amir Zeldes
Bosco Cristina
Cignarella Alessandra Teresa
Djam&#233
Ines Rehbein
Josef Ruppenhofer
Lauren Cassidy
Ozlem Cetinoglu
Sanguinetti Manuela
Teresa Lynn
Publication venue: ELRA – European Language Resources Association
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin