3,906 research outputs found
Self-Learning Classifier for Internet traffic
Network visibility is a critical part of traffic engineering, network management, and security. Recently, unsupervised algorithms have been envisioned as a viable alternative to automatically identify classes of traffic. However, the accuracy achieved so far does not allow to use them for traffic classification in practical scenario. In this paper, we propose SeLeCT, a Self-Learning Classifier for Internet traffic. It uses unsupervised algorithms along with an adaptive learning approach to automatically let classes of traffic emerge, being identified and (easily) labeled. SeLeCT automatically groups flows into pure (or homogeneous) clusters using alternating simple clustering and filtering phases to remove outliers. SeLeCT uses an adaptive learning approach to boost its ability to spot new protocols and applications. Finally, SeLeCT also simplifies label assignment (which is still based on some manual intervention) so that proper class labels can be easily discovered. We evaluate the performance of SeLeCT using traffic traces collected in different years from various ISPs located in 3 different continents. Our experiments show that SeLeCT achieves overall accuracy close to 98%. Unlike state-of-art classifiers, the biggest advantage of SeLeCT is its ability to help discovering new protocols and applications in an almost automated fashio
Unsupervised User Stance Detection on Twitter
We present a highly effective unsupervised framework for detecting the stance
of prolific Twitter users with respect to controversial topics. In particular,
we use dimensionality reduction to project users onto a low-dimensional space,
followed by clustering, which allows us to find core users that are
representative of the different stances. Our framework has three major
advantages over pre-existing methods, which are based on supervised or
semi-supervised classification. First, we do not require any prior labeling of
users: instead, we create clusters, which are much easier to label manually
afterwards, e.g., in a matter of seconds or minutes instead of hours. Second,
there is no need for domain- or topic-level knowledge either to specify the
relevant stances (labels) or to conduct the actual labeling. Third, our
framework is robust in the face of data skewness, e.g., when some users or some
stances have greater representation in the data. We experiment with different
combinations of user similarity features, dataset sizes, dimensionality
reduction methods, and clustering algorithms to ascertain the most effective
and most computationally efficient combinations across three different datasets
(in English and Turkish). We further verified our results on additional tweet
sets covering six different controversial topics. Our best combination in terms
of effectiveness and efficiency uses retweeted accounts as features, UMAP for
dimensionality reduction, and Mean Shift for clustering, and yields a small
number of high-quality user clusters, typically just 2--3, with more than 98\%
purity. The resulting user clusters can be used to train downstream
classifiers. Moreover, our framework is robust to variations in the
hyper-parameter values and also with respect to random initialization
Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval
Semantic similarity based retrieval is playing an increasingly important role
in many IR systems such as modern web search, question-answering, similar
document retrieval etc. Improvements in retrieval of semantically similar
content are very significant to applications like Quora, Stack Overflow, Siri
etc. We propose a novel unsupervised model for semantic similarity based
content retrieval, where we construct semantic flow graphs for each query, and
introduce the concept of "soft seeding" in graph based semi-supervised learning
(SSL) to convert this into an unsupervised model.
We demonstrate the effectiveness of our model on an equivalent question
retrieval problem on the Stack Exchange QA dataset, where our unsupervised
approach significantly outperforms the state-of-the-art unsupervised models,
and produces comparable results to the best supervised models. Our research
provides a method to tackle semantic similarity based retrieval without any
training data, and allows seamless extension to different domain QA
communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM '17
- …