1,134 research outputs found
Multi-dimensional Functional Principal Component Analysis
Functional principal component analysis is one of the most commonly employed
approaches in functional and longitudinal data analysis and we extend it to
analyze functional/longitudinal data observed on a general -dimensional
domain. The computational issues emerging in the extension are fully addressed
with our proposed solutions. The local linear smoothing technique is employed
to perform estimation because of its capabilities of performing large-scale
smoothing and of handling data with different sampling schemes (possibly on
irregular domain) in addition to its nice theoretical properties. Besides
taking the fast Fourier transform strategy in smoothing, the modern GPGPU
(general-purpose computing on graphics processing units) architecture is
applied to perform parallel computation to save computation time. To resolve
the out-of-memory issue due to large-scale data, the random projection
procedure is applied in the eigendecomposition step. We show that the proposed
estimators can achieve the classical nonparametric rates for longitudinal data
and the optimal convergence rates for functional data if the number of
observations per sample is of the order . Finally, the
performance of our approach is demonstrated with simulation studies and the
fine particulate matter (PM 2.5) data measured in Taiwan
Compressed Multi-Row Storage Format for Sparse Matrices on Graphics Processing Units
A new format for storing sparse matrices is proposed for efficient sparse
matrix-vector (SpMV) product calculation on modern graphics processing units
(GPUs). This format extends the standard compressed row storage (CRS) format
and can be quickly converted to and from it. Computational performance of two
SpMV kernels for the new format is determined for over 130 sparse matrices on
Fermi-class and Kepler-class GPUs and compared with that of five existing
generic algorithms and industrial implementations, including Nvidia cuSparse
CSR and HYB kernels. We found the speedup of up to over the best
of the five alternative kernels
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
We study the factors affecting training time in multi-device deep learning
systems. Given a specification of a convolutional neural network, our goal is
to minimize the time to train this model on a cluster of commodity CPUs and
GPUs. We first focus on the single-node setting and show that by using standard
batching and data-parallel techniques, throughput can be improved by at least
5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training
speed directly proportional to the throughput of a device regardless of its
underlying hardware, allowing each node in the cluster to be treated as a black
box. Our second contribution is a theoretical and empirical study of the
tradeoffs affecting end-to-end training time in a multiple-device setting. We
identify the degree of asynchronous parallelization as a key factor affecting
both hardware and statistical efficiency. We see that asynchrony can be viewed
as introducing a momentum term. Our results imply that tuning momentum is
critical in asynchronous parallel configurations, and suggest that published
results that have not been fully tuned might report suboptimal performance for
some configurations. For our third contribution, we use our novel understanding
of the interaction between system and optimization dynamics to provide an
efficient hyperparameter optimizer. Our optimizer involves a predictive model
for the total time to convergence and selects an allocation of resources to
minimize that time. We demonstrate that the most popular distributed deep
learning systems fall within our tradeoff space, but do not optimize within the
space. By doing this optimization, our prototype runs 1.9x to 12x faster than
the fastest state-of-the-art systems
OpenCL Performance Prediction using Architecture-Independent Features
OpenCL is an attractive model for heterogeneous high-performance computing
systems, with wide support from hardware vendors and significant performance
portability. To support efficient scheduling on HPC systems it is necessary to
perform accurate performance predictions for OpenCL workloads on varied compute
devices, which is challenging due to diverse computation, communication and
memory access characteristics which result in varying performance between
devices. The Architecture Independent Workload Characterization (AIWC) tool can
be used to characterize OpenCL kernels according to a set of
architecture-independent features. This work presents a methodology where AIWC
features are used to form a model capable of predicting accelerator execution
times. We used this methodology to predict execution times for a set of 37
computational kernels running on 15 different devices representing a broad
range of CPU, GPU and MIC architectures. The predictions are highly accurate,
differing from the measured experimental run-times by an average of only 1.2%,
and correspond to actual execution time mispredictions of 9 {\mu}s to 1 sec
according to problem size. A previously unencountered code can be instrumented
once and the AIWC metrics embedded in the kernel, to allow performance
prediction across the full range of modelled devices. The results suggest that
this methodology supports correct selection of the most appropriate device for
a previously unencountered code, which is highly relevant to the HPC scheduling
setting.Comment: 9 pages, 6 figures, International Workshop on High Performance and
Dynamic Reconfigurable Systems and Networks (DRSN-2018) published in
conjunction with The 2018 International Conference on High Performance
Computing & Simulation (HPCS 2018
Scalable learning for geostatistics and speaker recognition
With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular.
Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition.
In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance.
Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation
A Real-Time, GPU-Based, Non-Imaging Back-End for Radio Telescopes
Since the discovery of RRATs, interest in single pulse radio searches has
increased dramatically. Due to the large data volumes generated by these
searches, especially in planned surveys for future radio telescopes, such
searches have to be conducted in real-time. This has led to the development of
a multitude of search techniques and real-time pipeline prototypes. In this
work we investigated the applicability of GPUs. We have designed and
implemented a scalable, flexibile, GPU-based, transient search pipeline
composed of several processing stages, including RFI mitigation, dedispersion,
event detection and classification, as well as data quantisation and
persistence. These stages are encapsulated as a standalone framework. The
optimised GPU implementation of direct dedispersion achieves a speedup of more
than an order of magnitude when compared to an optimised CPU implementation. We
use a density-based clustering algorithm, coupled with a candidate selection
mechanism to group detections caused by the same event together and
automatically classify them as either RFI or of celestial origin. This setup
was deployed at the Medicina BEST-II array where several test observations were
conducted. Finally, we calculate the number of GPUs required to process all the
beams for the SKA1-mid non-imaging pipeline. We have also investigated the
applicability of GPUs for beamforming, where our implementation achieves more
than 50% of the peak theoretical performance. We also demonstrate that for
large arrays, and in observations where the generated beams need to be
processed outside of the GPU, the system will become PCIe bandwidth limited.
This can be alleviated by processing the synthesised beams on the GPU itself,
and we demonstrate this by integrating the beamformer to the transient
detection pipeline.Comment: PhD Thesis, 153 pages. arXiv admin note: text overlap with
arXiv:0904.0633, arXiv:1106.5836, arXiv:1201.5380, arXiv:1109.3186,
arXiv:1106.5817, arXiv:1112.2579 by other author
Real-Time Face and Landmark Localization for Eyeblink Detection
Pavlovian eyeblink conditioning is a powerful experiment used in the field of
neuroscience to measure multiple aspects of how we learn in our daily life. To
track the movement of the eyelid during an experiment, researchers have
traditionally made use of potentiometers or electromyography. More recently,
the use of computer vision and image processing alleviated the need for these
techniques but currently employed methods require human intervention and are
not fast enough to enable real-time processing. In this work, a face- and
landmark-detection algorithm have been carefully combined in order to provide
fully automated eyelid tracking, and have further been accelerated to make the
first crucial step towards online, closed-loop experiments. Such experiments
have not been achieved so far and are expected to offer significant insights in
the workings of neurological and psychiatric disorders. Based on an extensive
literature search, various different algorithms for face detection and landmark
detection have been analyzed and evaluated. Two algorithms were identified as
most suitable for eyelid detection: the Histogram-of-Oriented-Gradients (HOG)
algorithm for face detection and the Ensemble-of-Regression-Trees (ERT)
algorithm for landmark detection. These two algorithms have been accelerated on
GPU and CPU, achieving speedups of 1,753 and 11, respectively.
To demonstrate the usefulness of our eyelid-detection algorithm, a research
hypothesis was formed and a well-established neuroscientific experiment was
employed: eyeblink detection. Our experimental evaluation reveals an overall
application runtime of 0.533 ms per frame, which is 1,101 faster than
the sequential implementation and well within the real-time requirements of
eyeblink conditioning in humans, i.e. faster than 500 frames per second.Comment: Added public gitlab repo link with paper source cod
Forest Density Estimation
We study graph estimation and density estimation in high dimensions, using a
family of density estimators based on forest structured undirected graphical
models. For density estimation, we do not assume the true distribution
corresponds to a forest; rather, we form kernel density estimates of the
bivariate and univariate marginals, and apply Kruskal's algorithm to estimate
the optimal forest on held out data. We prove an oracle inequality on the
excess risk of the resulting estimator relative to the risk of the best forest.
For graph estimation, we consider the problem of estimating forests with
restricted tree sizes. We prove that finding a maximum weight spanning forest
with restricted tree size is NP-hard, and develop an approximation algorithm
for this problem. Viewing the tree size as a complexity parameter, we then
select a forest using data splitting, and prove bounds on excess risk and
structure selection consistency of the procedure. Experiments with simulated
data and microarray data indicate that the methods are a practical alternative
to Gaussian graphical models.Comment: Extended version of earlier paper titled "Tree density estimation
Hardware-Aware Machine Learning: Modeling and Optimization
Recent breakthroughs in Deep Learning (DL) applications have made DL models a
key component in almost every modern computing system. The increased popularity
of DL applications deployed on a wide-spectrum of platforms have resulted in a
plethora of design challenges related to the constraints introduced by the
hardware itself. What is the latency or energy cost for an inference made by a
Deep Neural Network (DNN)? Is it possible to predict this latency or energy
consumption before a model is trained? If yes, how can machine learners take
advantage of these models to design the hardware-optimal DNN for deployment?
From lengthening battery life of mobile devices to reducing the runtime
requirements of DL models executing in the cloud, the answers to these
questions have drawn significant attention.
One cannot optimize what isn't properly modeled. Therefore, it is important
to understand the hardware efficiency of DL models during serving for making an
inference, before even training the model. This key observation has motivated
the use of predictive models to capture the hardware performance or energy
efficiency of DL applications. Furthermore, DL practitioners are challenged
with the task of designing the DNN model, i.e., of tuning the hyper-parameters
of the DNN architecture, while optimizing for both accuracy of the DL model and
its hardware efficiency. Therefore, state-of-the-art methodologies have
proposed hardware-aware hyper-parameter optimization techniques. In this paper,
we provide a comprehensive assessment of state-of-the-art work and selected
results on the hardware-aware modeling and optimization for DL applications. We
also highlight several open questions that are poised to give rise to novel
hardware-aware designs in the next few years, as DL applications continue to
significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape
Five Tales of Random Forest Regression
We present a set of variations on the theme of Random Forest regression: two applications to the problem of estimating galactic distances based on photometry which produce results comparable to or better than all other current approaches to the problem, an extension of the methodology to produce error distribution variance estimates for individual regression estimates which property appears unique among non-parametric regression estimators, an exponential asymptotic improvement in algorithmic training speed over the current de facto standard implementation which improvement was derived from a theoretical model of the training process combined with competent software engineering, a massively parallel implementation of the regression algorithm for a GPGPU cluster integrated with a distributed database management system resulting in a fast roundtrip ingest-analyze-archive procedure on a system with total power consumption under 1kW, and a novel theoretical comparison of the methodology with that of kernel regression relating the Random Forest bootstrap sample size to the kernel regression bandwidth parameter, resulting in a novel extension of the Random Forest methodology which offers lower mean-squared error than the standard methodology
- …