2,413 research outputs found
Detecting Sarcasm in Multimodal Social Platforms
Sarcasm is a peculiar form of sentiment expression, where the surface
sentiment differs from the implied sentiment. The detection of sarcasm in
social media platforms has been applied in the past mainly to textual
utterances where lexical indicators (such as interjections and intensifiers),
linguistic markers, and contextual information (such as user profiles, or past
conversations) were used to detect the sarcastic tone. However, modern social
media platforms allow to create multimodal messages where audiovisual content
is integrated with the text, making the analysis of a mode in isolation
partial. In our work, we first study the relationship between the textual and
visual aspects in multimodal posts from three major social media platforms,
i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to
quantify the extent to which images are perceived as necessary by human
annotators. Moreover, we propose two different computational frameworks to
detect sarcasm that integrate the textual and visual modalities. The first
approach exploits visual semantics trained on an external dataset, and
concatenates the semantics features with state-of-the-art textual features. The
second method adapts a visual neural network initialized with parameters
trained on ImageNet to multimodal sarcastic posts. Results show the positive
effect of combining modalities for the detection of sarcasm across platforms
and methods.Comment: 10 pages, 3 figures, final version published in the Proceedings of
ACM Multimedia 201
Text stylometry for chat bot identification and intelligence estimation.
Authorship identification is a technique used to identify the author of an unclaimed document, by attempting to find traits that will match those of the original author. Authorship identification has a great potential for applications in forensics. It can also be used in identifying chat bots, a form of intelligent software created to mimic the human conversations, by their unique style. The online criminal community is utilizing chat bots as a new way to steal private information and commit fraud and identity theft. The need for identifying chat bots by their style is becoming essential to overcome the danger of online criminal activities. Researchers realized the need to advance the understanding of chat bots and design programs to prevent criminal activities, whether it was an identity theft or even a terrorist threat. The more research work to advance chat bots’ ability to perceive humans, the more duties needed to be followed to confront those threats by the research community. This research went further by trying to study whether chat bots have behavioral drift. Studying text for Stylometry has been the goal for many researchers who have experimented many features and combinations of features in their experiments. A novel feature has been proposed that represented Term Frequency Inverse Document Frequency (TFIDF) and implemented that on a Byte level N-Gram. Term Frequency-Inverse Token Frequency (TF-ITF) used these terms and created the feature. The initial experiments utilizing collected data demonstrated the feasibility of this approach. Additional versions of the feature were created and tested for authorship identification. Results demonstrated that the feature was successfully used to identify authors of text, and additional experiments showed that the feature is language independent. The feature successfully identified authors of a German text. Furthermore, the feature was used in text similarities on a book level and a paragraph level. Finally, a selective combination of features was used to classify text that ranges from kindergarten level to scientific researches and novels. The feature combination measured the Quality of Writing (QoW) and the complexity of text, which were the first step to correlate that with the author’s IQ as a future goal
Recommended from our members
On stopwords, filtering and data sparsity for sentiment analysis of Twitter
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier’s feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
Linguistic Geometries for Unsupervised Dimensionality Reduction
Text documents are complex high dimensional objects. To effectively visualize
such data it is important to reduce its dimensionality and visualize the low
dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore
dimensionality reduction methods that draw upon domain knowledge in order to
achieve a better low dimensional embedding and visualization of documents. We
consider the use of geometries specified manually by an expert, geometries
derived automatically from corpus statistics, and geometries computed from
linguistic resources.Comment: 13 pages, 15 figure
- …