67 research outputs found
Five Tales of Random Forest Regression
We present a set of variations on the theme of Random Forest regression: two applications to the problem of estimating galactic distances based on photometry which produce results comparable to or better than all other current approaches to the problem, an extension of the methodology to produce error distribution variance estimates for individual regression estimates which property appears unique among non-parametric regression estimators, an exponential asymptotic improvement in algorithmic training speed over the current de facto standard implementation which improvement was derived from a theoretical model of the training process combined with competent software engineering, a massively parallel implementation of the regression algorithm for a GPGPU cluster integrated with a distributed database management system resulting in a fast roundtrip ingest-analyze-archive procedure on a system with total power consumption under 1kW, and a novel theoretical comparison of the methodology with that of kernel regression relating the Random Forest bootstrap sample size to the kernel regression bandwidth parameter, resulting in a novel extension of the Random Forest methodology which offers lower mean-squared error than the standard methodology
New Approaches To Photometric Redshift Prediction Via Gaussian Process Regression In The Sloan Digital Sky Survey
Expanding upon the work of Way and Srivastava 2006 we demonstrate how the use
of training sets of comparable size continue to make Gaussian process
regression (GPR) a competitive approach to that of neural networks and other
least-squares fitting methods. This is possible via new large size matrix
inversion techniques developed for Gaussian processes (GPs) that do not require
that the kernel matrix be sparse. This development, combined with a
neural-network kernel function appears to give superior results for this
problem. Our best fit results for the Sloan Digital Sky Survey (SDSS) Main
Galaxy Sample using u,g,r,i,z filters gives an rms error of 0.0201 while our
results for the same filters in the luminous red galaxy sample yield 0.0220. We
also demonstrate that there appears to be a minimum number of training-set
galaxies needed to obtain the optimal fit when using our GPR rank-reduction
methods. We find that morphological information included with many photometric
surveys appears, for the most part, to make the photometric redshift evaluation
slightly worse rather than better. This would indicate that most morphological
information simply adds noise from the GP point of view in the data used
herein. In addition, we show that cross-match catalog results involving
combinations of the Two Micron All Sky Survey, SDSS, and Galaxy Evolution
Explorer have to be evaluated in the context of the resulting cross-match
magnitude and redshift distribution. Otherwise one may be misled into overly
optimistic conclusions.Comment: 32 pages, ApJ in Press, 2 new figures, 1 new table of comparison
methods, updated discussion, references and typos to reflect version in Pres
ArborZ: Photometric Redshifts Using Boosted Decision Trees
Precision photometric redshifts will be essential for extracting cosmological
parameters from the next generation of wide-area imaging surveys. In this paper
we introduce a photometric redshift algorithm, ArborZ, based on the
machine-learning technique of Boosted Decision Trees. We study the algorithm
using galaxies from the Sloan Digital Sky Survey and from mock catalogs
intended to simulate both the SDSS and the upcoming Dark Energy Survey. We show
that it improves upon the performance of existing algorithms. Moreover, the
method naturally leads to the reconstruction of a full probability density
function (PDF) for the photometric redshift of each galaxy, not merely a single
"best estimate" and error, and also provides a photo-z quality figure-of-merit
for each galaxy that can be used to reject outliers. We show that the stacked
PDFs yield a more accurate reconstruction of the redshift distribution N(z). We
discuss limitations of the current algorithm and ideas for future work.Comment: 10 pages, 13 figures, submitted to Ap
Reconstructing galaxy fundamental distributions and scaling relations from photometric redshift surveys. Applications to the SDSS early-type sample
Noisy distance estimates associated with photometric rather than
spectroscopic redshifts lead to a mis-estimate of the luminosities, and produce
a correlated mis-estimate of the sizes. We consider a sample of early-type
galaxies from the SDSS DR6 for which both spectroscopic and photometric
information is available, and apply the generalization of the V_max method to
correct for these biases. We show that our technique recovers the true
redshift, magnitude and size distributions, as well as the true size-luminosity
relation. We find that using only 10% of the spectroscopic information randomly
spaced in our catalog is sufficient for the reconstructions to be accurate
within about 3%, when the photometric redshift error is dz = 0.038. We then
address the problem of extending our method to deep redshift catalogs, where
only photometric information is available. In addition to the specific
applications outlined here, our technique impacts a broader range of studies,
when at least one distance-dependent quantity is involved. It is particularly
relevant for the next generation of surveys, some of which will only have
photometric information.Comment: 14 pages, 12 figures, 1 table, new section 3.1 and appendix added,
MNRAS in pres
Automated measurement of redshift from mid-infrared low resolution spectroscopy
We present a new SED-fitting based routine for redshift determination that is
optimised for mid-infrared (MIR) low-resolution spectroscopy. Its flexible
template scaling increases the sensitivity to slope changes and small scale
features in the spectrum, while a new selection algorithm called Maximum
Combined Pseudo-Likelihood (MCPL) provides increased accuracy and a lower
number of outliers compared to the standard maximum-likelihood (ML) approach.
Unlike ML, MCPL searches for local (instead of absolute) maxima of a
'pseudo-likelihood' (PL) function, and combines results obtained for all the
templates in the library to weed out spurious redshift solutions. The
capabilities of MCPL are demonstrated by comparing its results to those of
regular ML and to the optical spectroscopic redshifts of a sample of 491
Spitzer/IRS spectra from sources at 0<z<3.7. MCPL achieves a redshift accuracy
dz/(1+z)<0.005 for 78% of the galaxies in the sample compared to 68% for ML.
The rate of outliers (dz/(1+z)>0.02) is 14% for MCPL and 22% for ML. chi^2
values for ML solutions are found to correlate with the SNR of the spectra, but
not with redshift accuracy. By contrast, the peak value of the normalised
combined PL (gamma) is found to provide a good indication on the reliability of
the MCPL solution for individual sources. The accuracy and reliability of the
redshifts depends strongly on the MIR SED. Sources with significant polycyclic
aromatic hydrocarbon emission obtain much better results compared to sources
dominated by AGN continuum. Nevertheless, for a given gamma the frequency of
accurate solutions and outliers is largely independent on their SED type. This
reliability indicator for MCPL solutions allows to select subsamples with
highly reliable redshifts. In particular, a gamma>0.15 threshold retains 79% of
the sources with dz/(1+z)<0.005 while reducing the outlier rate to 3.8%
(abridged).Comment: 23 pages, 12 figures, 5 tables. Accepted for publication in MNRA
Automated measurement of redshift from mid-infrared low resolution spectroscopy
We present a new SED-fitting based routine for redshift determination that is
optimised for mid-infrared (MIR) low-resolution spectroscopy. Its flexible
template scaling increases the sensitivity to slope changes and small scale
features in the spectrum, while a new selection algorithm called Maximum
Combined Pseudo-Likelihood (MCPL) provides increased accuracy and a lower
number of outliers compared to the standard maximum-likelihood (ML) approach.
Unlike ML, MCPL searches for local (instead of absolute) maxima of a
'pseudo-likelihood' (PL) function, and combines results obtained for all the
templates in the library to weed out spurious redshift solutions. The
capabilities of MCPL are demonstrated by comparing its results to those of
regular ML and to the optical spectroscopic redshifts of a sample of 491
Spitzer/IRS spectra from sources at 0<z<3.7. MCPL achieves a redshift accuracy
dz/(1+z)<0.005 for 78% of the galaxies in the sample compared to 68% for ML.
The rate of outliers (dz/(1+z)>0.02) is 14% for MCPL and 22% for ML. chi^2
values for ML solutions are found to correlate with the SNR of the spectra, but
not with redshift accuracy. By contrast, the peak value of the normalised
combined PL (gamma) is found to provide a good indication on the reliability of
the MCPL solution for individual sources. The accuracy and reliability of the
redshifts depends strongly on the MIR SED. Sources with significant polycyclic
aromatic hydrocarbon emission obtain much better results compared to sources
dominated by AGN continuum. Nevertheless, for a given gamma the frequency of
accurate solutions and outliers is largely independent on their SED type. This
reliability indicator for MCPL solutions allows to select subsamples with
highly reliable redshifts. In particular, a gamma>0.15 threshold retains 79% of
the sources with dz/(1+z)<0.005 while reducing the outlier rate to 3.8%
(abridged).Comment: 23 pages, 12 figures, 5 tables. Accepted for publication in MNRA
A Comparison of Photometric Redshift Techniques for Large Radio Surveys
Future radio surveys will generate catalogs of tens of millions of radio sources, for which redshift estimates will be essential to achieve many of the science goals. However, spectroscopic data will be available for only a small fraction of these sources, and in most cases even the optical and infrared photometry will be of limited quality. Furthermore, radio sources tend to be at higher redshift than most optical sources (most radio surveys have a median redshift greater than 1) and so a significant fraction of radio sources hosts differ from those for which most photometric redshift templates are designed. We therefore need to develop new techniques for estimating the redshifts of radio sources. As a starting point in this process, we evaluate a number of machine-learning techniques for estimating redshift, together with a conventional template-fitting technique. We pay special attention to how the performance is affected by the incompleteness of the training sample and by sparseness of the parameter space or by limited availability of ancillary multiwavelength data. As expected, we find that the quality of the photometric-redshift degrades as the quality of the photometry decreases, but that even with the limited quality of photometry available for all-sky-surveys, useful redshift information is available for the majority of sources, particularly at low redshift. We find that a template-fitting technique performs best in the presence of high-quality and almost complete multi-band photometry, especially if radio sources that are also X-ray emitting are treated separately, using specific templates and priors. When we reduced the quality of photometry to match that available for the EMU all-sky radio survey, the quality of the template-fitting degraded and became comparable to some of the machine-learning methods. Machine learning techniques currently perform better at low redshift than at high redshift, because of incompleteness of the currently available training data at high redshifts
A comparison of photometric redshift techniques for large radio surveys
Future radio surveys will generate catalogs of tens of millions of radio sources, for which redshift estimates will be essential to achieve many of the science goals. However, spectroscopic data will be available for only a small fraction of these sources, and in most cases even the optical and infrared photometry will be of limited quality. Furthermore, radio sources tend to be at higher redshift than most optical sources (most radio surveys have a median redshift greater than 1) and so a significant fraction of radio sources hosts differ from those for which most photometric redshift templates are designed. We therefore need to develop new techniques for estimating the redshifts of radio sources. As a starting point in this process, we evaluate a number of machine-learning techniques for estimating redshift, together with a conventional template-fitting technique. We pay special attention to how the performance is affected by the incompleteness of the training sample and by sparseness of the parameter space or by limited availability of ancillary multiwavelength data. As expected, we find that the quality of the photometric-redshift degrades as the quality of the photometry decreases, but that even with the limited quality of photometry available for all-sky-surveys, useful redshift information is available for the majority of sources, particularly at low redshift. We find that a template-fitting technique performs best in the presence of high-quality and almost complete multi-band photometry, especially if radio sources that are also X-ray emitting are treated separately, using specific templates and priors. When we reduced the quality of photometry to match that available for the EMU all-sky radio survey, the quality of the template-fitting degraded and became comparable to some of the machine-learning methods. Machine learning techniques currently perform better at low redshift than at high redshift, because of incompleteness of the currently available training data at high redshifts
- …