1,134 research outputs found

    Multi-dimensional Functional Principal Component Analysis

    Full text link
    Functional principal component analysis is one of the most commonly employed approaches in functional and longitudinal data analysis and we extend it to analyze functional/longitudinal data observed on a general dd-dimensional domain. The computational issues emerging in the extension are fully addressed with our proposed solutions. The local linear smoothing technique is employed to perform estimation because of its capabilities of performing large-scale smoothing and of handling data with different sampling schemes (possibly on irregular domain) in addition to its nice theoretical properties. Besides taking the fast Fourier transform strategy in smoothing, the modern GPGPU (general-purpose computing on graphics processing units) architecture is applied to perform parallel computation to save computation time. To resolve the out-of-memory issue due to large-scale data, the random projection procedure is applied in the eigendecomposition step. We show that the proposed estimators can achieve the classical nonparametric rates for longitudinal data and the optimal convergence rates for functional data if the number of observations per sample is of the order (n/logn)d/4(n/ \log n)^{d/4}. Finally, the performance of our approach is demonstrated with simulation studies and the fine particulate matter (PM 2.5) data measured in Taiwan

    Compressed Multi-Row Storage Format for Sparse Matrices on Graphics Processing Units

    Full text link
    A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern graphics processing units (GPUs). This format extends the standard compressed row storage (CRS) format and can be quickly converted to and from it. Computational performance of two SpMV kernels for the new format is determined for over 130 sparse matrices on Fermi-class and Kepler-class GPUs and compared with that of five existing generic algorithms and industrial implementations, including Nvidia cuSparse CSR and HYB kernels. We found the speedup of up to 60\approx 60% over the best of the five alternative kernels

    Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

    Full text link
    We study the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, our goal is to minimize the time to train this model on a cluster of commodity CPUs and GPUs. We first focus on the single-node setting and show that by using standard batching and data-parallel techniques, throughput can be improved by at least 5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training speed directly proportional to the throughput of a device regardless of its underlying hardware, allowing each node in the cluster to be treated as a black box. Our second contribution is a theoretical and empirical study of the tradeoffs affecting end-to-end training time in a multiple-device setting. We identify the degree of asynchronous parallelization as a key factor affecting both hardware and statistical efficiency. We see that asynchrony can be viewed as introducing a momentum term. Our results imply that tuning momentum is critical in asynchronous parallel configurations, and suggest that published results that have not been fully tuned might report suboptimal performance for some configurations. For our third contribution, we use our novel understanding of the interaction between system and optimization dynamics to provide an efficient hyperparameter optimizer. Our optimizer involves a predictive model for the total time to convergence and selects an allocation of resources to minimize that time. We demonstrate that the most popular distributed deep learning systems fall within our tradeoff space, but do not optimize within the space. By doing this optimization, our prototype runs 1.9x to 12x faster than the fastest state-of-the-art systems

    OpenCL Performance Prediction using Architecture-Independent Features

    Full text link
    OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challenging due to diverse computation, communication and memory access characteristics which result in varying performance between devices. The Architecture Independent Workload Characterization (AIWC) tool can be used to characterize OpenCL kernels according to a set of architecture-independent features. This work presents a methodology where AIWC features are used to form a model capable of predicting accelerator execution times. We used this methodology to predict execution times for a set of 37 computational kernels running on 15 different devices representing a broad range of CPU, GPU and MIC architectures. The predictions are highly accurate, differing from the measured experimental run-times by an average of only 1.2%, and correspond to actual execution time mispredictions of 9 {\mu}s to 1 sec according to problem size. A previously unencountered code can be instrumented once and the AIWC metrics embedded in the kernel, to allow performance prediction across the full range of modelled devices. The results suggest that this methodology supports correct selection of the most appropriate device for a previously unencountered code, which is highly relevant to the HPC scheduling setting.Comment: 9 pages, 6 figures, International Workshop on High Performance and Dynamic Reconfigurable Systems and Networks (DRSN-2018) published in conjunction with The 2018 International Conference on High Performance Computing & Simulation (HPCS 2018

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    A Real-Time, GPU-Based, Non-Imaging Back-End for Radio Telescopes

    Full text link
    Since the discovery of RRATs, interest in single pulse radio searches has increased dramatically. Due to the large data volumes generated by these searches, especially in planned surveys for future radio telescopes, such searches have to be conducted in real-time. This has led to the development of a multitude of search techniques and real-time pipeline prototypes. In this work we investigated the applicability of GPUs. We have designed and implemented a scalable, flexibile, GPU-based, transient search pipeline composed of several processing stages, including RFI mitigation, dedispersion, event detection and classification, as well as data quantisation and persistence. These stages are encapsulated as a standalone framework. The optimised GPU implementation of direct dedispersion achieves a speedup of more than an order of magnitude when compared to an optimised CPU implementation. We use a density-based clustering algorithm, coupled with a candidate selection mechanism to group detections caused by the same event together and automatically classify them as either RFI or of celestial origin. This setup was deployed at the Medicina BEST-II array where several test observations were conducted. Finally, we calculate the number of GPUs required to process all the beams for the SKA1-mid non-imaging pipeline. We have also investigated the applicability of GPUs for beamforming, where our implementation achieves more than 50% of the peak theoretical performance. We also demonstrate that for large arrays, and in observations where the generated beams need to be processed outside of the GPU, the system will become PCIe bandwidth limited. This can be alleviated by processing the synthesised beams on the GPU itself, and we demonstrate this by integrating the beamformer to the transient detection pipeline.Comment: PhD Thesis, 153 pages. arXiv admin note: text overlap with arXiv:0904.0633, arXiv:1106.5836, arXiv:1201.5380, arXiv:1109.3186, arXiv:1106.5817, arXiv:1112.2579 by other author

    Real-Time Face and Landmark Localization for Eyeblink Detection

    Full text link
    Pavlovian eyeblink conditioning is a powerful experiment used in the field of neuroscience to measure multiple aspects of how we learn in our daily life. To track the movement of the eyelid during an experiment, researchers have traditionally made use of potentiometers or electromyography. More recently, the use of computer vision and image processing alleviated the need for these techniques but currently employed methods require human intervention and are not fast enough to enable real-time processing. In this work, a face- and landmark-detection algorithm have been carefully combined in order to provide fully automated eyelid tracking, and have further been accelerated to make the first crucial step towards online, closed-loop experiments. Such experiments have not been achieved so far and are expected to offer significant insights in the workings of neurological and psychiatric disorders. Based on an extensive literature search, various different algorithms for face detection and landmark detection have been analyzed and evaluated. Two algorithms were identified as most suitable for eyelid detection: the Histogram-of-Oriented-Gradients (HOG) algorithm for face detection and the Ensemble-of-Regression-Trees (ERT) algorithm for landmark detection. These two algorithms have been accelerated on GPU and CPU, achieving speedups of 1,753×\times and 11×\times, respectively. To demonstrate the usefulness of our eyelid-detection algorithm, a research hypothesis was formed and a well-established neuroscientific experiment was employed: eyeblink detection. Our experimental evaluation reveals an overall application runtime of 0.533 ms per frame, which is 1,101×\times faster than the sequential implementation and well within the real-time requirements of eyeblink conditioning in humans, i.e. faster than 500 frames per second.Comment: Added public gitlab repo link with paper source cod

    Forest Density Estimation

    Full text link
    We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models.Comment: Extended version of earlier paper titled "Tree density estimation

    Hardware-Aware Machine Learning: Modeling and Optimization

    Full text link
    Recent breakthroughs in Deep Learning (DL) applications have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)? Is it possible to predict this latency or energy consumption before a model is trained? If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment? From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention. One cannot optimize what isn't properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of DL applications. Furthermore, DL practitioners are challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardware-aware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for DL applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape

    Five Tales of Random Forest Regression

    Get PDF
    We present a set of variations on the theme of Random Forest regression: two applications to the problem of estimating galactic distances based on photometry which produce results comparable to or better than all other current approaches to the problem, an extension of the methodology to produce error distribution variance estimates for individual regression estimates which property appears unique among non-parametric regression estimators, an exponential asymptotic improvement in algorithmic training speed over the current de facto standard implementation which improvement was derived from a theoretical model of the training process combined with competent software engineering, a massively parallel implementation of the regression algorithm for a GPGPU cluster integrated with a distributed database management system resulting in a fast roundtrip ingest-analyze-archive procedure on a system with total power consumption under 1kW, and a novel theoretical comparison of the methodology with that of kernel regression relating the Random Forest bootstrap sample size to the kernel regression bandwidth parameter, resulting in a novel extension of the Random Forest methodology which offers lower mean-squared error than the standard methodology
    corecore