15,386 research outputs found
Detecting Family Resemblance: Automated Genre Classification.
This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.
CharBot: A Simple and Effective Method for Evading DGA Classifiers
Domain generation algorithms (DGAs) are commonly leveraged by malware to
create lists of domain names which can be used for command and control (C&C)
purposes. Approaches based on machine learning have recently been developed to
automatically detect generated domain names in real-time. In this work, we
present a novel DGA called CharBot which is capable of producing large numbers
of unregistered domain names that are not detected by state-of-the-art
classifiers for real-time detection of DGAs, including the recently published
methods FANCI (a random forest based on human-engineered features) and LSTM.MI
(a deep learning approach). CharBot is very simple, effective and requires no
knowledge of the targeted DGA classifiers. We show that retraining the
classifiers on CharBot samples is not a viable defense strategy. We believe
these findings show that DGA classifiers are inherently vulnerable to
adversarial attacks if they rely only on the domain name string to make a
decision. Designing a robust DGA classifier may, therefore, necessitate the use
of additional information besides the domain name alone. To the best of our
knowledge, CharBot is the simplest and most efficient black-box adversarial
attack against DGA classifiers proposed to date
Exploiting the user interaction context for automatic task detection
Detecting the task a user is performing on her computer desktop is important for providing her with contextualized and personalized support. Some recent approaches propose to perform automatic user task detection by means of classifiers using captured user context data. In this paper we improve on that by using an ontology-based user interaction context model that can be automatically populated by (i) capturing simple user interaction events on the computer desktop and (ii) applying rule-based and information extraction mechanisms. We present evaluation results from a large user study we have carried out in a knowledge-intensive business environment, showing that our ontology-based approach provides new contextual features yielding good task detection performance. We also argue that good results can be achieved by training task classifiers `online' on user context data gathered in laboratory settings. Finally, we isolate a combination of contextual features that present a significantly better discriminative power than classical ones
On Refining Twitter Lists as Ground Truth Data for Multi-Community User Classification
To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance
Software defect prediction: do different classifiers find the same defects?
Open Access: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.During the last 10 years, hundreds of different defect prediction models have been published. The performance of the classifiers used in these models is reported to be similar with models rarely performing above the predictive performance ceiling of about 80% recall. We investigate the individual defects that four classifiers predict and analyse the level of prediction uncertainty produced by these classifiers. We perform a sensitivity analysis to compare the performance of Random Forest, NaĂŻve Bayes, RPart and SVM classifiers when predicting defects in NASA, open source and commercial datasets. The defect predictions that each classifier makes is captured in a confusion matrix and the prediction uncertainty of each classifier is compared. Despite similar predictive performance values for these four classifiers, each detects different sets of defects. Some classifiers are more consistent in predicting defects than others. Our results confirm that a unique subset of defects can be detected by specific classifiers. However, while some classifiers are consistent in the predictions they make, other classifiers vary in their predictions. Given our results, we conclude that classifier ensembles with decision-making strategies not based on majority voting are likely to perform best in defect prediction.Peer reviewedFinal Published versio
Transfer Learning for Multi-language Twitter Election Classification
Both politicians and citizens are increasingly embracing social media as a means to disseminate information and comment on various topics, particularly during significant political events, such as elections. Such commentary during elections is also of interest to social scientists and pollsters. To facilitate the study of social media during elections, there is a need to automatically identify posts that are topically related to those elections. However, current studies have focused on elections within English-speaking regions, and hence the resultant election content classifiers are only applicable for elections in countries where the predominant language is English. On the other hand, as social media is becoming more prevalent worldwide, there is an increasing need for election classifiers that can be generalised across different languages, without building a training dataset for each election. In this paper, based upon transfer learning, we study the development of effective and reusable election classifiers for use on social media across multiple languages. We combine transfer learning with different classifiers such as Support Vector Machines (SVM) and state-of-the-art Convolutional Neural Networks (CNN), which make use of word embedding representations for each social media post. We generalise the learned classifier models for cross-language classification by using a linear translation approach to map the word embedding vectors from one language into another. Experiments conducted over two election datasets in different languages show that without using any training data from the target language, linear translations outperform a classical transfer learning approach, namely Transfer Component Analysis (TCA), by 80% in recall and 25% in F1 measure
- …