1,144 research outputs found

    SELF-ADAPTING PARALLEL FRAMEWORK FOR LONG-TERM OBJECT TRACKING

    Get PDF
    Object tracking is a crucial field in computer vision that has many uses in human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, etc. Many implementations are introduced in practice, and yet recent methods emphasize on tracking objects adaptively by learning the object’s perspectives and rediscovering it when it becomes untraceable, so that object’s absence problem (in case of occlusion, cluttering or blurring) is resolved. Most of these algorithms have high computational burden on the computational units and need powerful CPUs to attain real-time tracking and high bitrate video processing. These computational units may handle no more than a single video source, making it unsuitable for large-scale implementations like multiple sources or higher resolution videos. In this thesis, we choose one popular algorithm called TLD, Tracking-Learning-Detection, study the core components of the algorithm that impede its performance, and implement these components in a parallel computational environment such as multi-core CPUs, GPUs, etc., also known as heterogeneous computing. OpenCL is used as a development platform to produce parallel kernels for the algorithm. The goals are to create an acceptable heterogeneous computing environment through utilizing current computer technologies, to imbue real-time applications with an alternative implementation methodology, and to circumvent the upcoming limitations of hardware in terms of cost, power, and speedup. We are able to bring true parallel speedup to the existing implementations, which greatly improves the frame rate for long-term object tracking and with some algorithm parameter modification, it provides more accurate object tracking. According to the experiments, developed kernels have achieved a range of performance improvement. As for reduction based kernels, a maximum of 78X speedup is achieved. While for window based kernels, a range of couple hundreds to 2000X speedup is achieved. And for the optical flow tracking kernel, a maximum of 5.7X speedup is recorded. Global speedup is highly dependent on the hardware specifications, especially for memory transfers. With the use of a medium sized input, the self-adapting parallel framework has successfully obtained a fast learning curve and converged to an average of 1.6X speedup compared to the original implementation. Lastly, for future programming convenience, an OpenCL based library is built to facilitate the use of OpenCL programming on parallel hardware devices, hide the complexity of building and compiling OpenCL kernels, and provide a C-based latency measurement tool that is compatible with several operating systems

    QCD simulations with staggered fermions on GPUs

    Full text link
    We report on our implementation of the RHMC algorithm for the simulation of lattice QCD with two staggered flavors on Graphics Processing Units, using the NVIDIA CUDA programming language. The main feature of our code is that the GPU is not used just as an accelerator, but instead the whole Molecular Dynamics trajectory is performed on it. After pointing out the main bottlenecks and how to circumvent them, we discuss the obtained performances. We present some preliminary results regarding OpenCL and multiGPU extensions of our code and discuss future perspectives.Comment: 22 pages, 14 eps figures, final version to be published in Computer Physics Communication

    A portable OpenCL-based unstructured edge-based Finite Element Navier-Stokes solver on graphics hardware

    Get PDF
    The rise of GPUs in modern high-performance systems increases the interest in porting portion of codes to such hardware. The current paper aims to explore the performance of a portable state-of-the-art FE solver on GPU accelerators. Performance evaluation is done by comparing with an existing highly-optimized OpenMP version of the solver. Code portability is ensured by writing the program using the OpenCL 1.1 specifications, while performance portability is sought through an optimization step performed at the beginning of the calculations to find out the optimal parameter set for the solver. The results show that the new implementation can be several times faster than the OpenMP version.Preprin
    • …
    corecore