8 research outputs found
A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches
Vector-based word representations help countless Natural Language Processing
(NLP) tasks capture the language's semantic and syntactic regularities. In this
paper, we present the characteristics of existing word embedding approaches and
analyze them with regard to many classification tasks. We categorize the
methods into two main groups - Traditional approaches mostly use matrix
factorization to produce word representations, and they are not able to capture
the semantic and syntactic regularities of the language very well. On the other
hand, Neural-network-based approaches can capture sophisticated regularities of
the language and preserve the word relationships in the generated word
representations. We report experimental results on multiple classification
tasks and highlight the scenarios where one approach performs better than the
rest.Comment: 28 pages, 3 figures and 10 table
A social graph based text mining framework for chat log investigation
This paper presents a unified social graph based text mining framework to identify digital evidences from chat logs data. It considers both users' conversation and interaction data in group-chats to discover overlapping users' interests and their social ties. The proposed framework applies n-gram technique in association with a self-customized hyperlink-induced topic search (HITS) algorithm to identify key-terms representing users' interests, key-users, and key-sessions. We propose a social graph generation technique to model users' interactions, where ties (edges) between a pair of users (nodes) are established only if they participate in at least one common group-chat session, and weights are assigned to the ties based on the degree of overlap in users' interests and interactions. Finally, we present three possible cyber-crime investigation scenarios and a user-group identification method for each of them. We present our experimental results on a data set comprising 1100 chat logs of 11,143 chat sessions continued over a period of 29 months from January 2010 to May 2012. Experimental results suggest that the proposed framework is able to identify key-terms, key-users, key-sessions, and user-groups from chat logs data, all of which are crucial for cyber-crime investigation. Though the chat logs are recovered from a single computer, it is very likely that the logs are collected from multiple computers in real scenario. In this case, logs collected from multiple computers can be combined together to generate more enriched social graph. However, our experiments show that the objectives can be achieved even with logs recovered from a single computer by using group-chats data to draw relationships between every pair of users
Ranking radically influential web forum users
The growing popularity of online social media is leading to its widespread use among the online community for various purposes. In the recent past, it has been found that the web is also being used as a tool by radical or extremist groups and users to practice several kinds of mischievous acts with concealed agendas and promote ideologies in a sophisticated manner. Some of the web forums are predominantly being used for open discussions on critical issues influenced by radical thoughts. The influential users dominate and influence the newly joined innocent users through their radical thoughts. This paper presents an application of collocation theory to identify radically influential users in web forums. The radicalness of a user is captured by a measure based on the degree of match of the commented posts with a threat list. Eleven different collocation metrics are formulated to identify the association among users, and they are finally embedded in a customized PageRank algorithm to generate a ranked list of radically influential users. The experiments are conducted on a standard data set provided for a challenge at ISI-KDD'12 workshop to find radical and infectious threads, members, postings, ideas, and ideologies. Experimental results show that our proposed method outperforms the existing UserRank algorithm. We also found that the collocation theory is more effective to deal with such ranking problem than the textual and temporal similarity-based measures studied earlier
Graph-based learning model for detection of SMS spam on smart phones
Short Message Service (SMS) has been increasingly exploited through spam propagation schemes in recent years. This paper presents a new method for graph-based learning and classification of spam SMS on mobile devices and smart phones. Our approach is based on modeling the content and patterns of SMS syntax into a direct ed-weighted graph through exploiting modern composition style of messages. The graph attributes are then used to classify spam messages in real-time by using KL-Divergence measure. Experimental results on two real-world datasets show that our proposed method achieves high detection accuracy with
less false alarm rate to detect spam messages. Moreover, our approach requires relatively less memory and processing power, making it suitable to deploy on resource-constrained mobile devices and smart phones.status: publishe