285 research outputs found

    Scalable data clustering using GPUs

    Get PDF
    The computational demands of multivariate clustering grow rapidly, and therefore processing large data sets, like those found in flow cytometry data, is very time consuming on a single CPU. Fortunately these techniques lend themselves naturally to large scale parallel processing. To address the computational demands, graphics processing units, specifically NVIDIA\u27s CUDA framework and Tesla architecture, were investigated as a low-cost, high performance solution to a number of clustering algorithms. C-means and Expectation Maximization with Gaussian mixture models were implemented using the CUDA framework. The algorithm implementations use a hybrid of CUDA, OpenMP, and MPI to scale to many GPUs on multiple nodes in a high performance computing environment. This framework is envisioned as part of a larger cloud-based workflow service where biologists can apply multiple algorithms and parameter sweeps to their data sets and quickly receive a thorough set of results that can be further analyzed by experts. Improvements over previous GPU-accelerated implementations range from 1.42x to 21x for C-means and 3.72x to 5.65x for the Gaussian mixture model on non-trivial data sets. Using a single NVIDIA GTX 260 speedups are on average 90x for C-means and 74x for Gaussians with flow cytometry files compared to optimized C code running on a single core of a modern Intel CPU. Using the TeraGrid Lincoln high performance cluster at NCSA C-means achieves 42% parallel efficiency and a CPU speedup of 4794x with 128 Tesla C1060 GPUs. The Gaussian mixture model achieves 72% parallel efficiency and a CPU speedup of 6286x

    Parallel implementation of Expectation-Maximisation algorithm for the training of Gaussian Mixture Models

    Get PDF
    Most machine learning algorithms need to handle large data sets. This feature often leads to limitations on processing time and memory. The Expectation-Maximization (EM) is one of such algorithms, which is used to train one of the most commonly used parametric statistical models, the Gaussian Mixture Models (GMM). All steps of the algorithm are potentially parallelizable once they iterate over the entire data set. In this study, we propose a parallel implementation of EM for training GMM using CUDA. Experiments are performed with a UCI dataset and results show a speedup of 7 if compared to the sequential version. We have also carried out modifications to the code in order to provide better access to global memory and shared memory usage. We have achieved up to 56.4% of achieved occupancy, regardless the number of Gaussians considered in the set of experiments

    Full covariance Gaussian mixture models evaluation on GPU

    Full text link

    A Self Organization-Based Optical Flow Estimator with GPU Implementation

    Get PDF
    This work describes a parallelizable optical flow estimator that uses a modified batch version of the Self Organizing Map (SOM). This gradient-based estimator handles the ill-posedness in motion estimation via a novel combination of regression and a self organization strategy. The aperture problem is explicitly modeled using an algebraic framework that partitions motion estimates obtained from regression into two sets, one (set Hc) with estimates with high confidence and another (set Hp) with low confidence estimates. The self organization step uses a uniquely designed pair of training set (Q=Hc) and the initial weights set (W=Hc U Hp). It is shown that with this specific choice of training and initial weights sets, the interpolation of flow vectors is achieved primarily due to the regularization property of SOM. Moreover, the computationally involved step of finding the winner unit in SOM simplifies to indexing into a 2D array making the algorithm parallelizable and highly scalable. To preserve flow discontinuities at occlusion boundaries, we have designed anisotropic neighborhood function for SOM that uses a novel OFCE residual-based distance measure. A multi-resolution or pyramidal approach is used to estimate large motion. As the algorithm is scalable, with sufficient number of computing cores (for example on a GPU), the implementation of the estimator can be made real-time. With the available true motion from Middlebury database, error metrics are computed

    FPGA-Based Acceleration of Expectation Maximization Algorithm using High Level Synthesis

    Get PDF
    Expectation Maximization (EM) is a soft clustering algorithm which partitions data iteratively into M clusters. It is one of the most popular data mining algorithms that uses Gaussian Mixture Models (GMM) for probability density modeling and is widely used in applications such as signal processing and Machine Learning (ML). EM requires high computation time and large amount of memory when dealing with large data sets. Conventionally, the HDL-based design methodology is used to program FPGAs for accelerating computationally intensive algorithms. In many real world applications, FPGA provide great speedup along with lower power consumption compared to multicore CPUs and GPUs. Intel FPGA SDK for OpenCL enables developers with no hardware knowledge to program the FPGAs with short development time. This thesis presents an optimized implementation of EM algorithm on Stratix V and Arria 10 FPGAs using Intel FPGA SDK for OpenCL. Comparison of performance and power consumption between CPU, GPU and FPGA is presented for various dimension and cluster sizes. Compared to an Intel(R) Xeon(R) CPU E5-2637 our fully optimized OpenCL model for EM targeting Arria 10 FPGA achieved up to 1000X speedup in terms of throughput (Tspeedup) and 5395X speedup in terms of throughput per unit of power consumed (T/Pspeedup). Compared to previous research on EM-GMM implementation on GPUs, Arria 10 FPGA obtained up to 64.74X Tspeedup and 486.78X T/Pspeedup

    Using Von Mises-fisher Distribution for Polymer Conformation Analysis in Multi-scale Framework

    Get PDF
    AbstractIn this study we consider the statistical representation of polymer conformations, which is very important when modeling rheology at macroscopic scales. It is impossible to track all relevant microscopic variables for each polymer in a polymer-laden solution due to the huge number degrees of freedom associated with such fluids. We applied this approach to one of the most descriptive kinetic models of polymer, Kramers bead-rod model, where the probability density function models the angle of each rod with respect to the fixed coordinate axes. Towards this goal we apply mixture of von Mises-Fisher distribution for modeling a polymer conformation. The Expectation-Maximization based clustering algorithms are used to estimate the conformation given the ensemble of polymers. Both distribution sampling and parameters estimation have been implemented in parallel using CPU and GPU based platforms

    A Fully-Pipelined Hardware Design for Gaussian Mixture Models

    No full text
    Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPCX2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications
    corecore