67 research outputs found

    Five Tales of Random Forest Regression

    Get PDF
    We present a set of variations on the theme of Random Forest regression: two applications to the problem of estimating galactic distances based on photometry which produce results comparable to or better than all other current approaches to the problem, an extension of the methodology to produce error distribution variance estimates for individual regression estimates which property appears unique among non-parametric regression estimators, an exponential asymptotic improvement in algorithmic training speed over the current de facto standard implementation which improvement was derived from a theoretical model of the training process combined with competent software engineering, a massively parallel implementation of the regression algorithm for a GPGPU cluster integrated with a distributed database management system resulting in a fast roundtrip ingest-analyze-archive procedure on a system with total power consumption under 1kW, and a novel theoretical comparison of the methodology with that of kernel regression relating the Random Forest bootstrap sample size to the kernel regression bandwidth parameter, resulting in a novel extension of the Random Forest methodology which offers lower mean-squared error than the standard methodology

    New Approaches To Photometric Redshift Prediction Via Gaussian Process Regression In The Sloan Digital Sky Survey

    Full text link
    Expanding upon the work of Way and Srivastava 2006 we demonstrate how the use of training sets of comparable size continue to make Gaussian process regression (GPR) a competitive approach to that of neural networks and other least-squares fitting methods. This is possible via new large size matrix inversion techniques developed for Gaussian processes (GPs) that do not require that the kernel matrix be sparse. This development, combined with a neural-network kernel function appears to give superior results for this problem. Our best fit results for the Sloan Digital Sky Survey (SDSS) Main Galaxy Sample using u,g,r,i,z filters gives an rms error of 0.0201 while our results for the same filters in the luminous red galaxy sample yield 0.0220. We also demonstrate that there appears to be a minimum number of training-set galaxies needed to obtain the optimal fit when using our GPR rank-reduction methods. We find that morphological information included with many photometric surveys appears, for the most part, to make the photometric redshift evaluation slightly worse rather than better. This would indicate that most morphological information simply adds noise from the GP point of view in the data used herein. In addition, we show that cross-match catalog results involving combinations of the Two Micron All Sky Survey, SDSS, and Galaxy Evolution Explorer have to be evaluated in the context of the resulting cross-match magnitude and redshift distribution. Otherwise one may be misled into overly optimistic conclusions.Comment: 32 pages, ApJ in Press, 2 new figures, 1 new table of comparison methods, updated discussion, references and typos to reflect version in Pres

    ArborZ: Photometric Redshifts Using Boosted Decision Trees

    Full text link
    Precision photometric redshifts will be essential for extracting cosmological parameters from the next generation of wide-area imaging surveys. In this paper we introduce a photometric redshift algorithm, ArborZ, based on the machine-learning technique of Boosted Decision Trees. We study the algorithm using galaxies from the Sloan Digital Sky Survey and from mock catalogs intended to simulate both the SDSS and the upcoming Dark Energy Survey. We show that it improves upon the performance of existing algorithms. Moreover, the method naturally leads to the reconstruction of a full probability density function (PDF) for the photometric redshift of each galaxy, not merely a single "best estimate" and error, and also provides a photo-z quality figure-of-merit for each galaxy that can be used to reject outliers. We show that the stacked PDFs yield a more accurate reconstruction of the redshift distribution N(z). We discuss limitations of the current algorithm and ideas for future work.Comment: 10 pages, 13 figures, submitted to Ap

    Reconstructing galaxy fundamental distributions and scaling relations from photometric redshift surveys. Applications to the SDSS early-type sample

    Full text link
    Noisy distance estimates associated with photometric rather than spectroscopic redshifts lead to a mis-estimate of the luminosities, and produce a correlated mis-estimate of the sizes. We consider a sample of early-type galaxies from the SDSS DR6 for which both spectroscopic and photometric information is available, and apply the generalization of the V_max method to correct for these biases. We show that our technique recovers the true redshift, magnitude and size distributions, as well as the true size-luminosity relation. We find that using only 10% of the spectroscopic information randomly spaced in our catalog is sufficient for the reconstructions to be accurate within about 3%, when the photometric redshift error is dz = 0.038. We then address the problem of extending our method to deep redshift catalogs, where only photometric information is available. In addition to the specific applications outlined here, our technique impacts a broader range of studies, when at least one distance-dependent quantity is involved. It is particularly relevant for the next generation of surveys, some of which will only have photometric information.Comment: 14 pages, 12 figures, 1 table, new section 3.1 and appendix added, MNRAS in pres

    Automated measurement of redshift from mid-infrared low resolution spectroscopy

    Get PDF
    We present a new SED-fitting based routine for redshift determination that is optimised for mid-infrared (MIR) low-resolution spectroscopy. Its flexible template scaling increases the sensitivity to slope changes and small scale features in the spectrum, while a new selection algorithm called Maximum Combined Pseudo-Likelihood (MCPL) provides increased accuracy and a lower number of outliers compared to the standard maximum-likelihood (ML) approach. Unlike ML, MCPL searches for local (instead of absolute) maxima of a 'pseudo-likelihood' (PL) function, and combines results obtained for all the templates in the library to weed out spurious redshift solutions. The capabilities of MCPL are demonstrated by comparing its results to those of regular ML and to the optical spectroscopic redshifts of a sample of 491 Spitzer/IRS spectra from sources at 0<z<3.7. MCPL achieves a redshift accuracy dz/(1+z)<0.005 for 78% of the galaxies in the sample compared to 68% for ML. The rate of outliers (dz/(1+z)>0.02) is 14% for MCPL and 22% for ML. chi^2 values for ML solutions are found to correlate with the SNR of the spectra, but not with redshift accuracy. By contrast, the peak value of the normalised combined PL (gamma) is found to provide a good indication on the reliability of the MCPL solution for individual sources. The accuracy and reliability of the redshifts depends strongly on the MIR SED. Sources with significant polycyclic aromatic hydrocarbon emission obtain much better results compared to sources dominated by AGN continuum. Nevertheless, for a given gamma the frequency of accurate solutions and outliers is largely independent on their SED type. This reliability indicator for MCPL solutions allows to select subsamples with highly reliable redshifts. In particular, a gamma>0.15 threshold retains 79% of the sources with dz/(1+z)<0.005 while reducing the outlier rate to 3.8% (abridged).Comment: 23 pages, 12 figures, 5 tables. Accepted for publication in MNRA

    Automated measurement of redshift from mid-infrared low resolution spectroscopy

    Full text link
    We present a new SED-fitting based routine for redshift determination that is optimised for mid-infrared (MIR) low-resolution spectroscopy. Its flexible template scaling increases the sensitivity to slope changes and small scale features in the spectrum, while a new selection algorithm called Maximum Combined Pseudo-Likelihood (MCPL) provides increased accuracy and a lower number of outliers compared to the standard maximum-likelihood (ML) approach. Unlike ML, MCPL searches for local (instead of absolute) maxima of a 'pseudo-likelihood' (PL) function, and combines results obtained for all the templates in the library to weed out spurious redshift solutions. The capabilities of MCPL are demonstrated by comparing its results to those of regular ML and to the optical spectroscopic redshifts of a sample of 491 Spitzer/IRS spectra from sources at 0<z<3.7. MCPL achieves a redshift accuracy dz/(1+z)<0.005 for 78% of the galaxies in the sample compared to 68% for ML. The rate of outliers (dz/(1+z)>0.02) is 14% for MCPL and 22% for ML. chi^2 values for ML solutions are found to correlate with the SNR of the spectra, but not with redshift accuracy. By contrast, the peak value of the normalised combined PL (gamma) is found to provide a good indication on the reliability of the MCPL solution for individual sources. The accuracy and reliability of the redshifts depends strongly on the MIR SED. Sources with significant polycyclic aromatic hydrocarbon emission obtain much better results compared to sources dominated by AGN continuum. Nevertheless, for a given gamma the frequency of accurate solutions and outliers is largely independent on their SED type. This reliability indicator for MCPL solutions allows to select subsamples with highly reliable redshifts. In particular, a gamma>0.15 threshold retains 79% of the sources with dz/(1+z)<0.005 while reducing the outlier rate to 3.8% (abridged).Comment: 23 pages, 12 figures, 5 tables. Accepted for publication in MNRA

    A Comparison of Photometric Redshift Techniques for Large Radio Surveys

    Get PDF
    Future radio surveys will generate catalogs of tens of millions of radio sources, for which redshift estimates will be essential to achieve many of the science goals. However, spectroscopic data will be available for only a small fraction of these sources, and in most cases even the optical and infrared photometry will be of limited quality. Furthermore, radio sources tend to be at higher redshift than most optical sources (most radio surveys have a median redshift greater than 1) and so a significant fraction of radio sources hosts differ from those for which most photometric redshift templates are designed. We therefore need to develop new techniques for estimating the redshifts of radio sources. As a starting point in this process, we evaluate a number of machine-learning techniques for estimating redshift, together with a conventional template-fitting technique. We pay special attention to how the performance is affected by the incompleteness of the training sample and by sparseness of the parameter space or by limited availability of ancillary multiwavelength data. As expected, we find that the quality of the photometric-redshift degrades as the quality of the photometry decreases, but that even with the limited quality of photometry available for all-sky-surveys, useful redshift information is available for the majority of sources, particularly at low redshift. We find that a template-fitting technique performs best in the presence of high-quality and almost complete multi-band photometry, especially if radio sources that are also X-ray emitting are treated separately, using specific templates and priors. When we reduced the quality of photometry to match that available for the EMU all-sky radio survey, the quality of the template-fitting degraded and became comparable to some of the machine-learning methods. Machine learning techniques currently perform better at low redshift than at high redshift, because of incompleteness of the currently available training data at high redshifts

    A comparison of photometric redshift techniques for large radio surveys

    Get PDF
    Future radio surveys will generate catalogs of tens of millions of radio sources, for which redshift estimates will be essential to achieve many of the science goals. However, spectroscopic data will be available for only a small fraction of these sources, and in most cases even the optical and infrared photometry will be of limited quality. Furthermore, radio sources tend to be at higher redshift than most optical sources (most radio surveys have a median redshift greater than 1) and so a significant fraction of radio sources hosts differ from those for which most photometric redshift templates are designed. We therefore need to develop new techniques for estimating the redshifts of radio sources. As a starting point in this process, we evaluate a number of machine-learning techniques for estimating redshift, together with a conventional template-fitting technique. We pay special attention to how the performance is affected by the incompleteness of the training sample and by sparseness of the parameter space or by limited availability of ancillary multiwavelength data. As expected, we find that the quality of the photometric-redshift degrades as the quality of the photometry decreases, but that even with the limited quality of photometry available for all-sky-surveys, useful redshift information is available for the majority of sources, particularly at low redshift. We find that a template-fitting technique performs best in the presence of high-quality and almost complete multi-band photometry, especially if radio sources that are also X-ray emitting are treated separately, using specific templates and priors. When we reduced the quality of photometry to match that available for the EMU all-sky radio survey, the quality of the template-fitting degraded and became comparable to some of the machine-learning methods. Machine learning techniques currently perform better at low redshift than at high redshift, because of incompleteness of the currently available training data at high redshifts
    corecore