12,994 research outputs found
Automated novelty detection in the WISE survey with one-class support vector machines
Wide-angle photometric surveys of previously uncharted sky areas or
wavelength regimes will always bring in unexpected sources whose existence and
properties cannot be easily predicted from earlier observations: novelties or
even anomalies. Such objects can be efficiently sought for with novelty
detection algorithms. Here we present an application of such a method, called
one-class support vector machines (OCSVM), to search for anomalous patterns
among sources preselected from the mid-infrared AllWISE catalogue covering the
whole sky. To create a model of expected data we train the algorithm on a set
of objects with spectroscopic identifications from the SDSS DR13 database,
present also in AllWISE. OCSVM detects as anomalous those sources whose
patterns - WISE photometric measurements in this case - are inconsistent with
the model. Among the detected anomalies we find artefacts, such as objects with
spurious photometry due to blending, but most importantly also real sources of
genuine astrophysical interest. Among the latter, OCSVM has identified a sample
of heavily reddened AGN/quasar candidates distributed uniformly over the sky
and in a large part absent from other WISE-based AGN catalogues. It also
allowed us to find a specific group of sources of mixed types, mostly stars and
compact galaxies. By combining the semi-supervised OCSVM algorithm with
standard classification methods it will be possible to improve the latter by
accounting for sources which are not present in the training sample but are
otherwise well-represented in the target set. Anomaly detection adds
flexibility to automated source separation procedures and helps verify the
reliability and representativeness of the training samples. It should be thus
considered as an essential step in supervised classification schemes to ensure
completeness and purity of produced catalogues.Comment: 14 pages, 15 figure
A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification
Nearest Neighbors (NN) is one of the most widely used supervised
learning algorithms to classify Gaussian distributed data, but it does not
achieve good results when it is applied to nonlinear manifold distributed data,
especially when a very limited amount of labeled samples are available. In this
paper, we propose a new graph-based NN algorithm which can effectively
handle both Gaussian distributed data and nonlinear manifold distributed data.
To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by
constructing an -level nearest-neighbor strengthened tree over the graph,
and then compute a TRW matrix for similarity measurement purposes. After this,
the nearest neighbors are identified according to the TRW matrix and the class
label of a query point is determined by the sum of all the TRW weights of its
nearest neighbors. To deal with online situations, we also propose a new
algorithm to handle sequential samples based a local neighborhood
reconstruction. Comparison experiments are conducted on both synthetic data
sets and real-world data sets to demonstrate the validity of the proposed new
NN algorithm and its improvements to other version of NN algorithms.
Given the widespread appearance of manifold structures in real-world problems
and the popularity of the traditional NN algorithm, the proposed manifold
version NN shows promising potential for classifying manifold-distributed
data.Comment: 32 pages, 12 figures, 7 table
Semi-supervised LC/MS alignment for differential proteomics
Motivation: Mass spectrometry (MS) combined with high-performance liquid chromatography (LC) has received considerable attention for high-throughput analysis of proteomes. Isotopic labeling techniques such as ICAT [5,6] have been successfully applied to derive differential quantitative information for two protein samples, however at the price of significantly increased complexity of the experimental setup. To overcome these limitations, we consider a label-free setting where correspondences between elements of two samples have to be established prior to the comparative analysis. The alignment between samples is achieved by nonlinear robust ridge regression. The correspondence estimates are guided in a semi-supervised fashion by prior information which is derived from sequenced tandem mass spectra. Results: The semi-supervised method for finding correspondences was successfully applied to aligning highly complex protein samples, even if they exhibit large variations due to different biological conditions. A large-scale experiment clearly demonstrates that the proposed method bridges the gap between statistical data analysis and label-free quantitative differential proteomics. Availability: The software will be available on the website Contact: [email protected]
Radio Galaxy Zoo: Knowledge Transfer Using Rotationally Invariant Self-Organising Maps
With the advent of large scale surveys the manual analysis and classification
of individual radio source morphologies is rendered impossible as existing
approaches do not scale. The analysis of complex morphological features in the
spatial domain is a particularly important task. Here we discuss the challenges
of transferring crowdsourced labels obtained from the Radio Galaxy Zoo project
and introduce a proper transfer mechanism via quantile random forest
regression. By using parallelized rotation and flipping invariant Kohonen-maps,
image cubes of Radio Galaxy Zoo selected galaxies formed from the FIRST radio
continuum and WISE infrared all sky surveys are first projected down to a
two-dimensional embedding in an unsupervised way. This embedding can be seen as
a discretised space of shapes with the coordinates reflecting morphological
features as expressed by the automatically derived prototypes. We find that
these prototypes have reconstructed physically meaningful processes across two
channel images at radio and infrared wavelengths in an unsupervised manner. In
the second step, images are compared with those prototypes to create a
heat-map, which is the morphological fingerprint of each object and the basis
for transferring the user generated labels. These heat-maps have reduced the
feature space by a factor of 248 and are able to be used as the basis for
subsequent ML methods. Using an ensemble of decision trees we achieve upwards
of 85.7% and 80.7% accuracy when predicting the number of components and peaks
in an image, respectively, using these heat-maps. We also question the
currently used discrete classification schema and introduce a continuous scale
that better reflects the uncertainty in transition between two classes, caused
by sensitivity and resolution limits
Incremental Generalized Category Discovery
We explore the problem of Incremental Generalized Category Discovery (IGCD).
This is a challenging category incremental learning setting where the goal is
to develop models that can correctly categorize images from previously seen
categories, in addition to discovering novel ones. Learning is performed over a
series of time steps where the model obtains new labeled and unlabeled data,
and discards old data, at each iteration. The difficulty of the problem is
compounded in our generalized setting as the unlabeled data can contain images
from categories that may or may not have been observed before. We present a new
method for IGCD which combines non-parametric categorization with efficient
image sampling to mitigate catastrophic forgetting. To quantify performance, we
propose a new benchmark dataset named iNatIGCD that is motivated by a
real-world fine-grained visual categorization task. In our experiments we
outperform existing related methodsComment: This paper is accepted at ICCV 202
- …