23 research outputs found
Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings
Keyword extraction is a fundamental task in natural language processing that
facilitates mapping of documents to a concise set of representative single and
multi-word phrases. Keywords from text documents are primarily extracted using
supervised and unsupervised approaches. In this paper, we present an
unsupervised technique that uses a combination of theme-weighted personalized
PageRank algorithm and neural phrase embeddings for extracting and ranking
keywords. We also introduce an efficient way of processing text documents and
training phrase embeddings using existing techniques. We share an evaluation
dataset derived from an existing dataset that is used for choosing the
underlying embedding model. The evaluations for ranked keyword extraction are
performed on two benchmark datasets comprising of short abstracts (Inspec), and
long scientific papers (SemEval 2010), and is shown to produce results better
than the state-of-the-art systems.Comment: preprint for paper accepted in Proceedings of 1st IEEE International
Conference on Multimedia Information Processing and Retrieva
A Multimodal Approach to Predict Social Media Popularity
Multiple modalities represent different aspects by which information is
conveyed by a data source. Modern day social media platforms are one of the
primary sources of multimodal data, where users use different modes of
expression by posting textual as well as multimedia content such as images and
videos for sharing information. Multimodal information embedded in such posts
could be useful in predicting their popularity. To the best of our knowledge,
no such multimodal dataset exists for the prediction of social media photos. In
this work, we propose a multimodal dataset consisiting of content, context, and
social information for popularity prediction. Specifically, we augment the
SMPT1 dataset for social media prediction in ACM Multimedia grand challenge
2017 with image content, titles, descriptions, and tags. Next, in this paper,
we propose a multimodal approach which exploits visual features (i.e., content
information), textual features (i.e., contextual information), and social
features (e.g., average views and group counts) to predict popularity of social
media photos in terms of view counts. Experimental results confirm that despite
our multimodal approach uses the half of the training dataset from SMP-T1, it
achieves comparable performance with that of state-of-the-art.Comment: Preprint version for paper accepted in Proceedings of 1st IEEE
International Conference on Multimedia Information Processing and Retrieva
#MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo Movement
In this paper, we present a dataset containing 9,973 tweets related to the
MeToo movement that were manually annotated for five different linguistic
aspects: relevance, stance, hate speech, sarcasm, and dialogue acts. We present
a detailed account of the data collection and annotation processes. The
annotations have a very high inter-annotator agreement (0.79 to 0.93 k-alpha)
due to the domain expertise of the annotators and clear annotation
instructions. We analyze the data in terms of geographical distribution, label
correlations, and keywords. Lastly, we present some potential use cases of this
dataset. We expect this dataset would be of great interest to psycholinguists,
socio-linguists, and computational linguists to study the discursive space of
digitally mobilized social movements on sensitive issues like sexual
harassment.Comment: Preprint of paper accepted at ICWSM 202