5,681 research outputs found
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
Active learning in annotating micro-blogs dealing with e-reputation
Elections unleash strong political views on Twitter, but what do people
really think about politics? Opinion and trend mining on micro blogs dealing
with politics has recently attracted researchers in several fields including
Information Retrieval and Machine Learning (ML). Since the performance of ML
and Natural Language Processing (NLP) approaches are limited by the amount and
quality of data available, one promising alternative for some tasks is the
automatic propagation of expert annotations. This paper intends to develop a
so-called active learning process for automatically annotating French language
tweets that deal with the image (i.e., representation, web reputation) of
politicians. Our main focus is on the methodology followed to build an original
annotated dataset expressing opinion from two French politicians over time. We
therefore review state of the art NLP-based ML algorithms to automatically
annotate tweets using a manual initiation step as bootstrap. This paper focuses
on key issues about active learning while building a large annotated data set
from noise. This will be introduced by human annotators, abundance of data and
the label distribution across data and entities. In turn, we show that Twitter
characteristics such as the author's name or hashtags can be considered as the
bearing point to not only improve automatic systems for Opinion Mining (OM) and
Topic Classification but also to reduce noise in human annotations. However, a
later thorough analysis shows that reducing noise might induce the loss of
crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science -
Vol 3 - Contextualisation digitale - 201
Detecting Sockpuppets in Deceptive Opinion Spam
This paper explores the problem of sockpuppet detection in deceptive opinion
spam using authorship attribution and verification approaches. Two methods are
explored. The first is a feature subsampling scheme that uses the KL-Divergence
on stylistic language models of an author to find discriminative features. The
second is a transduction scheme, spy induction that leverages the diversity of
authors in the unlabeled test set by sending a set of spies (positive samples)
from the training set to retrieve hidden samples in the unlabeled test set
using nearest and farthest neighbors. Experiments using ground truth sockpuppet
data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on
Intelligent Text Processing and Computational Linguistic
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
ICE: Enabling Non-Experts to Build Models Interactively for Large-Scale Lopsided Problems
Quick interaction between a human teacher and a learning machine presents
numerous benefits and challenges when working with web-scale data. The human
teacher guides the machine towards accomplishing the task of interest. The
learning machine leverages big data to find examples that maximize the training
value of its interaction with the teacher. When the teacher is restricted to
labeling examples selected by the machine, this problem is an instance of
active learning. When the teacher can provide additional information to the
machine (e.g., suggestions on what examples or predictive features should be
used) as the learning task progresses, then the problem becomes one of
interactive learning.
To accommodate the two-way communication channel needed for efficient
interactive learning, the teacher and the machine need an environment that
supports an interaction language. The machine can access, process, and
summarize more examples than the teacher can see in a lifetime. Based on the
machine's output, the teacher can revise the definition of the task or make it
more precise. Both the teacher and the machine continuously learn and benefit
from the interaction.
We have built a platform to (1) produce valuable and deployable models and
(2) support research on both the machine learning and user interface challenges
of the interactive learning problem. The platform relies on a dedicated,
low-latency, distributed, in-memory architecture that allows us to construct
web-scale learning machines with quick interaction speed. The purpose of this
paper is to describe this architecture and demonstrate how it supports our
research efforts. Preliminary results are presented as illustrations of the
architecture but are not the primary focus of the paper
- …