Search CORE

1,236 research outputs found

GPU Accelerated Explicit Time Integration Methods for Electro-Quasistatic Fields

Author: Clemens Markus
Richter Christian
Schöps Sebastian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2016
Field of study

Electro-quasistatic field problems involving nonlinear materials are commonly discretized in space using finite elements. In this paper, it is proposed to solve the resulting system of ordinary differential equations by an explicit Runge-Kutta-Chebyshev time-integration scheme. This mitigates the need for Newton-Raphson iterations, as they are necessary within fully implicit time integration schemes. However, the electro-quasistatic system of ordinary differential equations has a Laplace-type mass matrix such that parts of the explicit time-integration scheme remain implicit. An iterative solver with constant preconditioner is shown to efficiently solve the resulting multiple right-hand side problem. This approach allows an efficient parallel implementation on a system featuring multiple graphic processing units.Comment: 4 pages, 5 figure

arXiv.org e-Print Archive

TUbiblio

GPU-based Streaming for Parallel Level of Detail on Massive Model Rendering

Author: Cao Yong
Peng Chao
Publication venue
Publication date: 01/01/2011
Field of study

Rendering massive 3D models in real-time has long been recognized as a very challenging problem because of the limited computational power and memory space available in a workstation. Most existing rendering techniques, especially level of detail (LOD) processing, have suffered from their sequential execution natures, and does not scale well with the size of the models. We present a GPU-based progressive mesh simplification approach which enables the interactive rendering of large 3D models with hundreds of millions of triangles. Our work contributes to the massive rendering research in two ways. First, we develop a novel data structure to represent the progressive LOD mesh, and design a parallel mesh simplification algorithm towards GPU architecture. Second, we propose a GPU-based streaming approach which adopt a frame-to-frame coherence scheme in order to minimize the high communication cost between CPU and GPU. Our results show that the parallel mesh simplification algorithm and GPU-based streaming approach significantly improve the overall rendering performance

Computer Science Technical Reports @Virginia Tech

FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis

Author: Oninda Mohammad Abdul Moin
Publication venue: 'University of Windsor Leddy Library'
Publication date: 17/11/2019
Field of study

One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively

Scholarship at UWindsor

FPGA-based High-Performance Collision Detection: An Enabling Technique for Image-Guided Robotic Surgery

Author: Abbott
Abolhassani
Akenine-Moller
Altomonte
Avril
Barequet
Basdogan
Bethea
Bowyer
Brost
Brown
Carter
Che
Chen
Chow
Collange
Cope
Courtecuisse
Fons
Gibson
Govindaraju
Huebner
Kestur
Kim
Kwok
Lee
Li
Liu
Maier-Hein
Maier-Hein
Mainzer
Meijden
Monmasson
Njiki
Okamura
Okamura
Pabst
Papadonikolakis
Peterlik
Redon
Sano
Scharstein
Schostek
Siciliano
Smach
Sridhar
Stoyanov
Stoyanov
Stoyanov
Stoyanov
Sudha
Vachhani
Vadakkepat
Wang
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2016
Field of study

Collision detection, which refers to the computational problem of finding the relative placement or con-figuration of two or more objects, is an essential component of many applications in computer graphics and robotics. In image-guided robotic surgery, real-time collision detection is critical for preserving healthy anatomical structures during the surgical procedure. However, the computational complexity of the problem usually results in algorithms that operate at low speed. In this paper, we present a fast and accurate algorithm for collision detection between Oriented-Bounding-Boxes (OBBs) that is suitable for real-time implementation. Our proposed Sweep and Prune algorithm can perform a preliminary filtering to reduce the number of objects that need to be tested by the classical Separating Axis Test algorithm, while the OBB pairs of interest are preserved. These OBB pairs are re-checked by the Separating Axis Test algorithm to obtain accurate overlapping status between them. To accelerate the execution, our Sweep and Prune algorithm is tailor-made for the proposed method. Meanwhile, a high performance scalable hardware architecture is proposed by analyzing the intrinsic parallelism of our algorithm, and is implemented on FPGA platform. Results show that our hardware design on the FPGA platform can achieve around 8X higher running speed than the software design on a CPU platform. As a result, the proposed algorithm can achieve a collision frame rate of 1 KHz, and fulfill the requirement for the medical surgery scenario of Robot Assisted Laparoscopy.published_or_final_versio

Crossref

Directory of Open Access Journals

Frontiers - Publisher Connector

UCL Discovery

HKU Scholars Hub

Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

Crossref

Caltech Authors

DR-NTU (Digital Repository of NTU)