7 research outputs found

    CUDA DSP Filter for ECG Signals

    Get PDF
    Real time processing is very important and critical for analysis of ECG signals. Prior to each processing, the signal needs to be filtered to enable feature extraction and further analysis. In case of building a data processing center that analyzes thousands of connected ECG sensors, one expects that the signal processing needs to be done very fast. In this paper, we focus on parallelizing the sequential DSP filter for processing of heart signals on GPU cores. We set a hypothesis that a GPU version is much faster than the CPU version. In this paper we have provided several experiments to test the validity of this hypothesis and to compare the performance of the parallelized GPU code with the sequential code. Assuming that the hypothesis is valid, we would also like to find what is the optimal size of the threads per block to obtain the maximum speedup. Our analysis shows that parallelized GPU code achieves linear speedups and is much more efficient than the classical single processor sequential processing

    Use of CUDA for the Continuous Space Language Model

    Get PDF
    The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated

    Accelerating Stencil Computation on GPGPU by Novel Mapping Method Between the Global Memory and the Shared Memory

    Get PDF
    Acceleration of stencil computation can be effectively improved by utilizing the memory resource. In this paper, in order to reduce the branch divergence of traditional mapping method between the global memory and the shared memory, we devise a new mapping mechanism in which the conditional statements loading the boundary stencil computation points in every XY-tile are removed by aligning ghost zone to reduce the synchronization overhead. In addition, we make full use of single XY-tile loaded into registers in every stencil computation point, common sub-expression elimination and software prefetching to reduce overhead. At last detailed performance evaluation demonstrates our optimized policies are close to optimal in terms of memory bandwidth utilization and achieve higher performance of stencil computation

    Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time

    Get PDF
    Abstract-Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction
    corecore