635 research outputs found
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
Sparsity-accuracy trade-off in MKL
We empirically investigate the best trade-off between sparse and
uniformly-weighted multiple kernel learning (MKL) using the elastic-net
regularization on real and simulated datasets. We find that the best trade-off
parameter depends not only on the sparsity of the true kernel-weight spectrum
but also on the linear dependence among kernels and the number of samples.Comment: 8pages, 2 figure
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Using Twitter to learn about the autism community
Considering the raising socio-economic burden of autism spectrum disorder
(ASD), timely and evidence-driven public policy decision making and
communication of the latest guidelines pertaining to the treatment and
management of the disorder is crucial. Yet evidence suggests that policy makers
and medical practitioners do not always have a good understanding of the
practices and relevant beliefs of ASD-afflicted individuals' carers who often
follow questionable recommendations and adopt advice poorly supported by
scientific data. The key goal of the present work is to explore the idea that
Twitter, as a highly popular platform for information exchange, could be used
as a data-mining source to learn about the population affected by ASD -- their
behaviour, concerns, needs etc. To this end, using a large data set of over 11
million harvested tweets as the basis for our investigation, we describe a
series of experiments which examine a range of linguistic and semantic aspects
of messages posted by individuals interested in ASD. Our findings, the first of
their nature in the published scientific literature, strongly motivate
additional research on this topic and present a methodological basis for
further work.Comment: Social Network Analysis and Mining, 201
Completing Low-Rank Matrices with Corrupted Samples from Few Coefficients in General Basis
Subspace recovery from corrupted and missing data is crucial for various
applications in signal processing and information theory. To complete missing
values and detect column corruptions, existing robust Matrix Completion (MC)
methods mostly concentrate on recovering a low-rank matrix from few corrupted
coefficients w.r.t. standard basis, which, however, does not apply to more
general basis, e.g., Fourier basis. In this paper, we prove that the range
space of an matrix with rank can be exactly recovered from few
coefficients w.r.t. general basis, though and the number of corrupted
samples are both as high as . Our model covers
previous ones as special cases, and robust MC can recover the intrinsic matrix
with a higher rank. Moreover, we suggest a universal choice of the
regularization parameter, which is . By our
filtering algorithm, which has theoretical guarantees, we can
further reduce the computational cost of our model. As an application, we also
find that the solutions to extended robust Low-Rank Representation and to our
extended robust MC are mutually expressible, so both our theory and algorithm
can be applied to the subspace clustering problem with missing values under
certain conditions. Experiments verify our theories.Comment: To appear in IEEE Transactions on Information Theor
- …