7 research outputs found
Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy
Astrophysics and cosmology are rich with data. The advent of wide-area
digital cameras on large aperture telescopes has led to ever more ambitious
surveys of the sky. Data volumes of entire surveys a decade ago can now be
acquired in a single night and real-time analysis is often desired. Thus,
modern astronomy requires big data know-how, in particular it demands highly
efficient machine learning and image analysis algorithms. But scalability is
not the only challenge: Astronomy applications touch several current machine
learning research questions, such as learning from biased data and dealing with
label and measurement noise. We argue that this makes astronomy a great domain
for computer science research, as it pushes the boundaries of data analysis. In
the following, we will present this exciting application area for data
scientists. We will focus on exemplary results, discuss main challenges, and
highlight some recent methodological advancements in machine learning and image
analysis triggered by astronomical applications
Instructions for setting up and running the experiments
Instructions for setting up and running the experiments, as well as how to interpret the output
Script to run experiments from Stensbo-Smidt et al. (2016)
Script to run the experiments from Stensbo-Smidt et al. (2016), estimating photometric redshifts and specific star formation rates for galaxies in SDSS using only magnitudes as inputs.<div><br></div><div>The script requires <i>numpy</i>, <i>pandas</i> and <i>scikit-learn</i> to run. Also, for feature selection, you will need the <i>speedynn</i> package: https://github.com/gieseke/speedynn</div
Adaptive Cholesky Gaussian Processes
We present a method to approximate Gaussian process regression models for
large datasets by considering only a subset of the data. Our approach is novel
in that the size of the subset is selected on the fly during exact inference
with little computational overhead. From an empirical observation that the
log-marginal likelihood often exhibits a linear trend once a sufficient subset
of a dataset has been observed, we conclude that many large datasets contain
redundant information that only slightly affects the posterior. Based on this,
we provide probabilistic bounds on the full model evidence that can identify
such subsets. Remarkably, these bounds are largely composed of terms that
appear in intermediate steps of the standard Cholesky decomposition, allowing
us to modify the algorithm to adaptively stop the decomposition once enough
data have been observed