11,099 research outputs found
Viewpoints: A high-performance high-dimensional exploratory data analysis tool
Scientific data sets continue to increase in both size and complexity. In the
past, dedicated graphics systems at supercomputing centers were required to
visualize large data sets, but as the price of commodity graphics hardware has
dropped and its capability has increased, it is now possible, in principle, to
view large complex data sets on a single workstation. To do this in practice,
an investigator will need software that is written to take advantage of the
relevant graphics hardware. The Viewpoints visualization package described
herein is an example of such software. Viewpoints is an interactive tool for
exploratory visual analysis of large, high-dimensional (multivariate) data. It
leverages the capabilities of modern graphics boards (GPUs) to run on a single
workstation or laptop. Viewpoints is minimalist: it attempts to do a small set
of useful things very well (or at least very quickly) in comparison with
similar packages today. Its basic feature set includes linked scatter plots
with brushing, dynamic histograms, normalization and outlier detection/removal.
Viewpoints was originally designed for astrophysicists, but it has since been
used in a variety of fields that range from astronomy, quantum chemistry, fluid
dynamics, machine learning, bioinformatics, and finance to information
technology server log mining. In this article, we describe the Viewpoints
package and show examples of its usage.Comment: 18 pages, 3 figures, PASP in press, this version corresponds more
closely to that to be publishe
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Anomaly detection for machine learning redshifts applied to SDSS galaxies
We present an analysis of anomaly detection for machine learning redshift
estimation. Anomaly detection allows the removal of poor training examples,
which can adversely influence redshift estimates. Anomalous training examples
may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies
with one or more poorly measured photometric quantity. We select 2.5 million
'clean' SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730
'anomalous' galaxies with spectroscopic redshift measurements which are flagged
as unreliable. We contaminate the clean base galaxy sample with galaxies with
unreliable redshifts and attempt to recover the contaminating galaxies using
the Elliptical Envelope technique. We then train four machine learning
architectures for redshift analysis on both the contaminated sample and on the
preprocessed 'anomaly-removed' sample and measure redshift statistics on a
clean validation sample generated without any preprocessing. We find an
improvement on all measured statistics of up to 80% when training on the
anomaly removed sample as compared with training on the contaminated sample for
each of the machine learning routines explored. We further describe a method to
estimate the contamination fraction of a base data sample.Comment: 13 pages, 8 figures, 1 table, minor text updates to macth MNRAS
accepted versio
- …