231 research outputs found
High Performance Multiview Video Coding
Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Efficient multitemporal change detection techniques for hyperspectral images on GPU
Hyperspectral images contain hundreds of reflectance values for each pixel.
Detecting regions of change in multiple hyperspectral images of the same
scene taken at different times is of widespread interest for a large number of
applications. For remote sensing, in particular, a very common application is
land-cover analysis. The high dimensionality of the hyperspectral images
makes the development of computationally efficient processing schemes
critical. This thesis focuses on the development of change detection
approaches at object level, based on supervised direct multidate
classification, for hyperspectral datasets. The proposed approaches improve
the accuracy of current state of the art algorithms and their projection onto
Graphics Processing Units (GPUs) allows their execution in real-time
scenarios
Embedded System Optimization of Radar Post-processing in an ARM CPU Core
Algorithms executed on the radar processor system contributes to a significant performance bottleneck of the overall radar system. One key performance concern is
the latency in target detection when dealing with hard deadline systems. Research has shown software optimization as one major contributor to radar system performance
improvements. This thesis aims at software optimizations using a manual and automatic approach and analyzing the results to make informed future decisions
while working with an ARM processor system. In order to ascertain an optimized implementation, a question put forward was whether the algorithms on the ARM
processor could work with a 6-antenna implementation without a decline in the performance. However, an answer would also help project how many additional
algorithms can still be added without performance decline.
The manual optimization was done based on the quantitative analysis of the software execution time. The manual optimization approach looked at the vectorization
strategy using the NEON vector register on the ARM CPU to reimplement the initial Constant False Alarm Rate(CFAR) Detection algorithm. An additional
optimization approach was eliminating redundant loops while going through the Range Gates and Doppler filters. In order to determine the best compiler for automatic
code optimization for the radar algorithms on the ARM processor, the GCC and Clang compilers were used to compile the initial algorithms and the optimized
implementation on the radar post-processing stage.
Analysis of the optimization results showed that it is possible to run the radar post-processing algorithms on the ARM processor at the 6-antenna implementation
without system load stress. In addition, the results show an excellent headroom margin based on the defined scenario. The result analysis further revealed that the
effect of dynamic memory allocation could not be underrated in situations where performance is a significant concern. Additional statements from the result demonstrated
that the GCC and Clang compiler has their strength and weaknesses when used in the compilation. One limiting factor to note on the optimization using the
NEON register is the sample size’s effect on the optimization implementation. Although it fits into the test samples used based on the defined scenario, there might
be varying results in varying window cell size situations that might not necessarily improve the time constraints
Doctor of Philosophy
dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine
Semantic-Preserving Transformations for Stream Program Orchestration on Multicore Architectures
Because the demand for high performance with big data processing and distributed computing is increasing, the stream programming paradigm has been revisited for its abundance of parallelism in virtue of independent actors that communicate via data channels. The synchronous data-flow (SDF) programming model is frequently adopted with stream programming languages for its convenience to express stream programs as a set of nodes connected by data channels. Static data-rates of SDF programming model enable program transformations that greatly improve the performance of SDF programs on multicore architectures. The major application domain is for SDF programs are digital signal processing, audio, video, graphics kernels, networking, and security. This thesis makes the following three contributions that improve the performance of SDF programs: First, a new intermediate representation (IR) called LaminarIR is introduced. LaminarIR replaces FIFO queues with direct memory accesses to reduce the data communication overhead and explicates data dependencies between producer and consumer nodes. We provide transformations and their formal semantics to convert conventional, FIFO-queue based program representations to LaminarIR. Second, a compiler framework to perform sound and semantics-preserving program transformations from FIFO semantics to LaminarIR. We employ static program analysis to resolve token positions in FIFO queues and replace them by direct memory accesses. Third, a communication-cost-aware program orchestration method to establish a foundation of LaminarIR parallelization on multicore architectures. The LaminarIR framework, which consists of the aforementioned contributions together with the benchmarks that we used with the experimental evaluation, has been open-sourced to advocate further research on improving the performance of stream programming languages
Parallel Markov Chain Monte Carlo
The increasing availability of multi-core and multi-processor architectures provides
new opportunities for improving the performance of many computer simulations.
Markov Chain Monte Carlo (MCMC) simulations are widely used for approximate
counting problems, Bayesian inference and as a means for estimating very highdimensional
integrals. As such MCMC has found a wide variety of applications in
fields including computational biology and physics,financial econometrics, machine
learning and image processing.
This thesis presents a number of new method for reducing the runtime of
Markov Chain Monte Carlo simulations by using SMP machines and/or clusters.
Two of the methods speculatively perform iterations in parallel, reducing the runtime
of MCMC programs whilst producing statistically identical results to conventional
sequential implementations. The other methods apply only to problem domains
that can be presented as an image, and involve using various means of dividing
the image into subimages that can be proceed with some degree of independence.
Where possible the thesis includes a theoretical analysis of the reduction in
runtime that may be achieved using our technique under perfect conditions, and
in all cases the methods are tested and compared on selection of multi-core and
multi-processor architectures. A framework is provided to allow easy construction
of MCMC application that implement these parallelisation methods
GPGPU application in fusion science
GPGPUs have firmly earned their reputation in HPC (High Performance Computing) as hardware for massively parallel computation. However their application in fusion science is quite marginal and not considered a mainstream approach to numerical problems. Computation advances have increased immensely over the last decade and continue to accelerate. GPGPU boards were always an alternative and exotic approach to problem solving and scientific programming, which was cultivated only by enthusiasts and specialized programmers. Today it is about 10 years, since the first fully programmable GPUs appeared on the market. And due to exponential growth in processing power over the years GPGPUs are not the alternative choice any more, but they became the main choice for big problem solving. Originally developed for and dominating in fields such as image and media processing, image rendering, video encoding/decoding, image scaling, stereo vision and pattern recognition GPGPUs are
also becoming mainstream computation platforms in scientific fields such as signal processing, physics, finance and biology.
This PhD contains solutions and approaches to two relevant problems for fusion and plasma science using GPGPU processing. First problem belongs to the realms of plasma and accelerator physics. I will present number of plasma simulations built on a PIC (Particle In Cell) method such as plasma sheath simulation, electron beam simulation, negative ion beam simulation and space charge compensation simulation. Second problem belongs to the realms of tomography and real-time control. I will present
number of simulated tomographic plasma reconstructions of Fourier-Bessel type and their analysis all in real-time oriented approach, i.e. GPGPU based implementations are integrated into MARTe environment.
MARTe is a framework for real-time application developed at JET (Joint European Torus) and used in several european fusion labs.
These two sets of problems represent a complete spectrum of GPGPU operation capabilities. PIC based problems are large complex simulations operated as batch processes, which do not have a time constraint
and operate on huge amounts of memory. While tomographic plasma reconstructions are online (realtime) processes, which have a strict latency/time constraints suggested by the time scales of real-time
control and operate on relatively small amounts of memory. Such a variety of problems covers a very broad range of disciplines and fields of science: such as plasma physics, NBI (Neutral Beam Injector) physics, tokamak physics, parallel computing, iterative/direct matrix solvers, PIC method, tomography and so on. PhD thesis also includes an extended performance analysis of Nvidia GPU cards considering the applicability to the real-time control and real-time performance.
In order to approach the aforementioned problems I as a PhD candidate had to gain knowledge in those relevant fields and build a vast range of practical skills such as: parallel/sequential CPU programming,
GPU programming, MARTe programming, MatLab programming, IDL programming and Python programming
- …