19,444 research outputs found
POISED: Spotting Twitter Spam Off the Beaten Paths
Cybercriminals have found in online social networks a propitious medium to
spread spam and malicious content. Existing techniques for detecting spam
include predicting the trustworthiness of accounts and analyzing the content of
these messages. However, advanced attackers can still successfully evade these
defenses.
Online social networks bring people who have personal connections or share
common interests to form communities. In this paper, we first show that users
within a networked community share some topics of interest. Moreover, content
shared on these social network tend to propagate according to the interests of
people. Dissemination paths may emerge where some communities post similar
messages, based on the interests of those communities. Spam and other malicious
content, on the other hand, follow different spreading patterns.
In this paper, we follow this insight and present POISED, a system that
leverages the differences in propagation between benign and malicious messages
on social networks to identify spam and other unwanted content. We test our
system on a dataset of 1.3M tweets collected from 64K users, and we show that
our approach is effective in detecting malicious messages, reaching 91%
precision and 93% recall. We also show that POISED's detection is more
comprehensive than previous systems, by comparing it to three state-of-the-art
spam detection systems that have been proposed by the research community in the
past. POISED significantly outperforms each of these systems. Moreover, through
simulations, we show how POISED is effective in the early detection of spam
messages and how it is resilient against two well-known adversarial machine
learning attacks
Investigation of the use of navigation tools in web-based learning: A data mining approach
Web-based learning is widespread in educational settings. The popularity of Web-based learning is in great measure because of its flexibility. Multiple navigation tools provided some of this flexibility. Different navigation tools offer different functions. Therefore, it is important to understand how the navigation tools are used by learners with different backgrounds, knowledge, and skills. This article presents two empirical studies in which data-mining approaches were used to analyze learners' navigation behavior. The results indicate that prior knowledge and subject content are two potential factors influencing the use of navigation tools. In addition, the lack of appropriate use of navigation tools may adversely influence learning performance. The results have been integrated into a model that can help designers develop Web-based learning programs and other Web-based applications that can be tailored to learners' needs
Automatic document classification of biological literature
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.
Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept
Mutual information based clustering of market basket data for profiling users
Attraction and commercial success of web sites depend heavily on the additional values visitors may find. Here, individual, automatically obtained and maintained user profiles are the key for user satisfaction. This contribution shows for the example of a cooking information site how user profiles might be obtained using category information provided by cooking recipes. It is shown that metrical distance functions and standard clustering procedures lead to erroneous results. Instead, we propose a new mutual information based clustering approach and outline its implications for the example of user profiling
- …