4,324 research outputs found
The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams
Stream mining poses unique challenges to machine learning: predictive models
are required to be scalable, incrementally trainable, must remain bounded in
size (even when the data stream is arbitrarily long), and be nonparametric in
order to achieve high accuracy even in complex and dynamic environments.
Moreover, the learning system must be parameterless ---traditional tuning
methods are problematic in streaming settings--- and avoid requiring prior
knowledge of the number of distinct class labels occurring in the stream. In
this paper, we introduce a new algorithmic approach for nonparametric learning
in data streams. Our approach addresses all above mentioned challenges by
learning a model that covers the input space using simple local classifiers.
The distribution of these classifiers dynamically adapts to the local (unknown)
complexity of the classification problem, thus achieving a good balance between
model complexity and predictive accuracy. We design four variants of our
approach of increasing adaptivity. By means of an extensive empirical
evaluation against standard nonparametric baselines, we show state-of-the-art
results in terms of accuracy versus model size. For the variant that imposes a
strict bound on the model size, we show better performance against all other
methods measured at the same model size value. Our empirical analysis is
complemented by a theoretical performance guarantee which does not rely on any
stochastic assumption on the source generating the stream
GenArchBench: Porting and Optimizing a Genomics Benchmark Suite to Arm-based HPC Processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the second position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While the majority of genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines, let alone exploit the advantages of the newly introduced Scalable Vector Extensions (SVE). This thesis presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected a set of computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. The porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. All in all, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Moreover, our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. In this work, we present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different architectures (i.e., A64FX, Graviton3, Intel Xeon Platinum, and AMD EPYC). Additionally, as proof of the impact of this work, we study the performance improvement in a production-ready genomics pipeline using the GenArchBench optimized kernels
Data Mining and Machine Learning in Astronomy
We review the current state of data mining and machine learning in astronomy.
'Data Mining' can have a somewhat mixed connotation from the point of view of a
researcher in this field. If used correctly, it can be a powerful approach,
holding the potential to fully exploit the exponentially increasing amount of
available data, promising great scientific advance. However, if misused, it can
be little more than the black-box application of complex computing algorithms
that may give little physical insight, and provide questionable results. Here,
we give an overview of the entire data mining process, from data collection
through to the interpretation of results. We cover common machine learning
algorithms, such as artificial neural networks and support vector machines,
applications from a broad range of astronomy, emphasizing those where data
mining techniques directly resulted in improved science, and important current
and future directions, including probability density functions, parallel
algorithms, petascale computing, and the time domain. We conclude that, so long
as one carefully selects an appropriate algorithm, and is guided by the
astronomical problem at hand, data mining can be very much the powerful tool,
and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra
figures, some minor additions to the tex
REDS: Random Ensemble Deep Spatial prediction
There has been a great deal of recent interest in the development of spatial
prediction algorithms for very large datasets and/or prediction domains. These
methods have primarily been developed in the spatial statistics community, but
there has been growing interest in the machine learning community for such
methods, primarily driven by the success of deep Gaussian process regression
approaches and deep convolutional neural networks. These methods are often
computationally expensive to train and implement and consequently, there has
been a resurgence of interest in random projections and deep learning models
based on random weights -- so called reservoir computing methods. Here, we
combine several of these ideas to develop the Random Ensemble Deep Spatial
(REDS) approach to predict spatial data. The procedure uses random Fourier
features as inputs to an extreme learning machine (a deep neural model with
random weights), and with calibrated ensembles of outputs from this model based
on different random weights, it provides a simple uncertainty quantification.
The REDS method is demonstrated on simulated data and on a classic large
satellite data set
Data-driven Flood Emulation: Speeding up Urban Flood Predictions by Deep Convolutional Neural Networks
Computational complexity has been the bottleneck of applying physically-based
simulations on large urban areas with high spatial resolution for efficient and
systematic flooding analyses and risk assessments. To address this issue of
long computational time, this paper proposes that the prediction of maximum
water depth rasters can be considered as an image-to-image translation problem
where the results are generated from input elevation rasters using the
information learned from data rather than by conducting simulations, which can
significantly accelerate the prediction process. The proposed approach was
implemented by a deep convolutional neural network trained on flood simulation
data of 18 designed hyetographs on three selected catchments. Multiple tests
with both designed and real rainfall events were performed and the results show
that the flood predictions by neural network uses only 0.5 % of time comparing
with physically-based approaches, with promising accuracy and ability of
generalizations. The proposed neural network can also potentially be applied to
different but relevant problems including flood predictions for urban layout
planning
- …