3,191 research outputs found
A Convex Relaxation for Weakly Supervised Classifiers
This paper introduces a general multi-class approach to weakly supervised
classification. Inferring the labels and learning the parameters of the model
is usually done jointly through a block-coordinate descent algorithm such as
expectation-maximization (EM), which may lead to local minima. To avoid this
problem, we propose a cost function based on a convex relaxation of the
soft-max loss. We then propose an algorithm specifically designed to
efficiently solve the corresponding semidefinite program (SDP). Empirically,
our method compares favorably to standard ones on different datasets for
multiple instance learning and semi-supervised learning as well as on
clustering tasks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Detection of Dispersed Radio Pulses: A machine learning approach to candidate identification and classification
Searching for extraterrestrial, transient signals in astronomical data sets
is an active area of current research. However, machine learning techniques are
lacking in the literature concerning single-pulse detection. This paper
presents a new, two-stage approach for identifying and classifying dispersed
pulse groups (DPGs) in single-pulse search output. The first stage identified
DPGs and extracted features to characterize them using a new peak
identification algorithm which tracks sloping tendencies around local maxima in
plots of signal-to-noise ratio vs. dispersion measure. The second stage used
supervised machine learning to classify DPGs. We created four benchmark data
sets: one unbalanced and three balanced versions using three different
imbalance treatments.We empirically evaluated 48 classifiers by training and
testing binary and multiclass versions of six machine learning algorithms on
each of the four benchmark versions. While each classifier had advantages and
disadvantages, all classifiers with imbalance treatments had higher recall
values than those with unbalanced data, regardless of the machine learning
algorithm used. Based on the benchmarking results, we selected a subset of
classifiers to classify the full, unlabelled data set of over 1.5 million DPGs
identified in 42,405 observations made by the Green Bank Telescope. Overall,
the classifiers using a multiclass ensemble tree learner in combination with
two oversampling imbalance treatments were the most efficient; they identified
additional known pulsars not in the benchmark data set and provided six
potential discoveries, with significantly less false positives than the other
classifiers.Comment: 13 pages, accepted for publication in MNRAS, ref. MN-15-1713-MJ.R
Semi-Supervised First-Person Activity Recognition in Body-Worn Video
Body-worn cameras are now commonly used for logging daily life, sports, and
law enforcement activities, creating a large volume of archived footage. This
paper studies the problem of classifying frames of footage according to the
activity of the camera-wearer with an emphasis on application to real-world
police body-worn video. Real-world datasets pose a different set of challenges
from existing egocentric vision datasets: the amount of footage of different
activities is unbalanced, the data contains personally identifiable
information, and in practice it is difficult to provide substantial training
footage for a supervised approach. We address these challenges by extracting
features based exclusively on motion information then segmenting the video
footage using a semi-supervised classification algorithm. On publicly available
datasets, our method achieves results comparable to, if not better than,
supervised and/or deep learning methods using a fraction of the training data.
It also shows promising results on real-world police body-worn video
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
- …