14,661 research outputs found
Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers
Machine Learning (ML) algorithms are used to train computers to perform a
variety of complex tasks and improve with experience. Computers learn how to
recognize patterns, make unintended decisions, or react to a dynamic
environment. Certain trained machines may be more effective than others because
they are based on more suitable ML algorithms or because they were trained
through superior training sets. Although ML algorithms are known and publicly
released, training sets may not be reasonably ascertainable and, indeed, may be
guarded as trade secrets. While much research has been performed about the
privacy of the elements of training sets, in this paper we focus our attention
on ML classifiers and on the statistical information that can be unconsciously
or maliciously revealed from them. We show that it is possible to infer
unexpected but useful information from ML classifiers. In particular, we build
a novel meta-classifier and train it to hack other classifiers, obtaining
meaningful information about their training sets. This kind of information
leakage can be exploited, for example, by a vendor to build more effective
classifiers or to simply acquire trade secrets from a competitor's apparatus,
potentially violating its intellectual property rights
A Multiple Classifier System Identifies Novel Cannabinoid CB2 Receptor Ligands
open access articleDrugs have become an essential part of our lives due to their ability to improve people’s
health and quality of life. However, for many diseases, approved drugs are not yet available
or existing drugs have undesirable side effects, making the pharmaceutical industry strive to
discover new drugs and active compounds. The development of drugs is an expensive
process, which typically starts with the detection of candidate molecules (screening) for an
identified protein target. To this end, the use of high-performance screening techniques has
become a critical issue in order to palliate the high costs. Therefore, the popularity of
computer-based screening (often called virtual screening or in-silico screening) has rapidly
increased during the last decade. A wide variety of Machine Learning (ML) techniques has
been used in conjunction with chemical structure and physicochemical properties for
screening purposes including (i) simple classifiers, (ii) ensemble methods, and more recently
(iii) Multiple Classifier Systems (MCS). In this work, we apply an MCS for virtual screening
(D2-MCS) using circular fingerprints. We applied our technique to a dataset of cannabinoid
CB2 ligands obtained from the ChEMBL database. The HTS collection of Enamine
(1.834.362 compounds), was virtually screened to identify 48.432 potential active molecules
using D2-MCS. This list was subsequently clustered based on circular fingerprints and from
each cluster, the most active compound was maintained. From these, the top 60 were kept,
and 21 novel compounds were purchased. Experimental validation confirmed six highly
active hits (>50% displacement at 10 μM and subsequent Ki determination) and an
additional five medium active hits (>25% displacement at 10 μM). D2-MCS hence provided a
hit rate of 29% for highly active compounds and an overall hit rate of 52%
EC3: Combining Clustering and Classification for Ensemble Learning
Classification and clustering algorithms have been proved to be successful
individually in different contexts. Both of them have their own advantages and
limitations. For instance, although classification algorithms are more powerful
than clustering methods in predicting class labels of objects, they do not
perform well when there is a lack of sufficient manually labeled reliable data.
On the other hand, although clustering algorithms do not produce label
information for objects, they provide supplementary constraints (e.g., if two
objects are clustered together, it is more likely that the same label is
assigned to both of them) that one can leverage for label prediction of a set
of unknown objects. Therefore, systematic utilization of both these types of
algorithms together can lead to better prediction performance. In this paper,
We propose a novel algorithm, called EC3 that merges classification and
clustering together in order to support both binary and multi-class
classification. EC3 is based on a principled combination of multiple
classification and multiple clustering methods using an optimization function.
We theoretically show the convexity and optimality of the problem and solve it
by block coordinate descent method. We additionally propose iEC3, a variant of
EC3 that handles imbalanced training data. We perform an extensive experimental
analysis by comparing EC3 and iEC3 with 14 baseline methods (7 well-known
standalone classifiers, 5 ensemble classifiers, and 2 existing methods that
merge classification and clustering) on 13 standard benchmark datasets. We show
that our methods outperform other baselines for every single dataset, achieving
at most 10% higher AUC. Moreover our methods are faster (1.21 times faster than
the best baseline), more resilient to noise and class imbalance than the best
baseline method.Comment: 14 pages, 7 figures, 11 table
Meta learning of bounds on the Bayes classifier error
Meta learning uses information from base learners (e.g. classifiers or
estimators) as well as information about the learning problem to improve upon
the performance of a single base learner. For example, the Bayes error rate of
a given feature space, if known, can be used to aid in choosing a classifier,
as well as in feature selection and model selection for the base classifiers
and the meta classifier. Recent work in the field of f-divergence functional
estimation has led to the development of simple and rapidly converging
estimators that can be used to estimate various bounds on the Bayes error. We
estimate multiple bounds on the Bayes error using an estimator that applies
meta learning to slowly converging plug-in estimators to obtain the parametric
convergence rate. We compare the estimated bounds empirically on simulated data
and then estimate the tighter bounds on features extracted from an image patch
analysis of sunspot continuum and magnetogram images.Comment: 6 pages, 3 figures, to appear in proceedings of 2015 IEEE Signal
Processing and SP Education Worksho
- …