8,192 research outputs found
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Assessing hyper parameter optimization and speedup for convolutional neural networks
The increased processing power of graphical processing units (GPUs) and the availability of large image datasets has fostered a renewed interest in extracting semantic information from images. Promising results for complex image categorization problems have been achieved using deep learning, with neural networks comprised of many layers. Convolutional neural networks (CNN) are one such architecture which provides more opportunities for image classification. Advances in CNN enable the development of training models using large labelled image datasets, but the hyper parameters need to be specified, which is challenging and complex due to the large number of parameters. A substantial amount of computational power and processing time is required to determine the optimal hyper parameters to define a model yielding good results. This article provides a survey of the hyper parameter search and optimization methods for CNN architectures
Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN
Over the past few years machine learning has seen a renewed explosion of
interest, following a number of studies showing the effectiveness of neural
networks in a range of tasks which had previously been considered incredibly
hard. Neural networks' effectiveness in the fields of image recognition and
natural language processing stems primarily from the vast amounts of data
available to companies and researchers, coupled with the huge amounts of
compute power available in modern accelerators such as GPUs, FPGAs and ASICs.
There are a number of approaches available to developers for utilizing GPGPU
technologies such as SYCL, OpenCL and CUDA, however many applications require
the same low level mathematical routines. Libraries dedicated to accelerating
these common routines allow developers to easily make full use of the available
hardware without requiring low level knowledge of the hardware themselves,
however such libraries are often provided by hardware manufacturers for
specific hardware such as cuDNN for Nvidia hardware or MIOpen for AMD hardware.
SYCL-DNN is a new open-source library dedicated to providing accelerated
routines for neural network operations which are hardware and vendor agnostic.
Built on top of the SYCL open standard and written entirely in standard C++,
SYCL-DNN allows a user to easily accelerate neural network code for a wide
range of hardware using a modern C++ interface. The library is tested on AMD's
OpenCL for GPU, Intel's OpenCL for CPU and GPU, ARM's OpenCL for Mali GPUs as
well as ComputeAorta's OpenCL for R-Car CV engine and host CPU. In this talk we
will present performance figures for SYCL-DNN on this range of hardware, and
discuss how high performance was achieved on such a varied set of accelerators
with such different hardware features.Comment: 4 pages, 3 figures. In International Workshop on OpenCL (IWOCL '19),
May 13-15, 2019, Bosto
Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors
Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models
Application of parallel distributed processing to space based systems
The concept of using Parallel Distributed Processing (PDP) to enhance automated experiment monitoring and control is explored. Recent very large scale integration (VLSI) advances have made such applications an achievable goal. The PDP machine has demonstrated the ability to automatically organize stored information, handle unfamiliar and contradictory input data and perform the actions necessary. The PDP machine has demonstrated that it can perform inference and knowledge operations with greater speed and flexibility and at lower cost than traditional architectures. In applications where the rule set governing an expert system's decisions is difficult to formulate, PDP can be used to extract rules by associating the information an expert receives with the actions taken
Recommended from our members
Computational Strategies for Scalable Genomics Analysis.
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications
An Extensible Timing Infrastructure for Adaptive Large-scale Applications
Real-time access to accurate and reliable timing information is necessary to
profile scientific applications, and crucial as simulations become increasingly
complex, adaptive, and large-scale. The Cactus Framework provides flexible and
extensible capabilities for timing information through a well designed
infrastructure and timing API. Applications built with Cactus automatically
gain access to built-in timers, such as gettimeofday and getrusage,
system-specific hardware clocks, and high-level interfaces such as PAPI. We
describe the Cactus timer interface, its motivation, and its implementation. We
then demonstrate how this timing information can be used by an example
scientific application to profile itself, and to dynamically adapt itself to a
changing environment at run time
Report from the MPP Working Group to the NASA Associate Administrator for Space Science and Applications
NASA's Office of Space Science and Applications (OSSA) gave a select group of scientists the opportunity to test and implement their computational algorithms on the Massively Parallel Processor (MPP) located at Goddard Space Flight Center, beginning in late 1985. One year later, the Working Group presented its report, which addressed the following: algorithms, programming languages, architecture, programming environments, the way theory relates, and performance measured. The findings point to a number of demonstrated computational techniques for which the MPP architecture is ideally suited. For example, besides executing much faster on the MPP than on conventional computers, systolic VLSI simulation (where distances are short), lattice simulation, neural network simulation, and image problems were found to be easier to program on the MPP's architecture than on a CYBER 205 or even a VAX. The report also makes technical recommendations covering all aspects of MPP use, and recommendations concerning the future of the MPP and machines based on similar architectures, expansion of the Working Group, and study of the role of future parallel processors for space station, EOS, and the Great Observatories era
- âŠ