11 research outputs found
Enabling Cross-Event Optimization in Discrete-Event Simulation Through Compile-Time Event Batching
A discrete-event simulation (DES) involves the execution of a sequence of
event handlers dynamically scheduled at runtime. As a consequence, a priori
knowledge of the control flow of the overall simulation program is limited. In
particular, powerful optimizations supported by modern compilers can only be
applied on the scope of individual event handlers, which frequently involve
only a few lines of code. We propose a method that extends the scope for
compiler optimizations in discrete-event simulations by generating batches of
multiple events that are subjected to compiler optimizations as contiguous
procedures. A runtime mechanism executes suitable batches at negligible
overhead. Our method does not require any compiler extensions and introduces
only minor additional effort during model development. The feasibility and
potential performance gains of the approach are illustrated on the example of
an idealized proof-ofconcept model. We believe that the applicability of the
approach extends to general event-driven programs
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM
Real-time dense computer vision and SLAM offer great potential for a new
level of scene modelling, tracking and real environmental interaction for many
types of robot, but their high computational requirements mean that use on mass
market embedded platforms is challenging. Meanwhile, trends in low-cost,
low-power processing are towards massive parallelism and heterogeneity, making
it difficult for robotics and vision researchers to implement their algorithms
in a performance-portable way. In this paper we introduce SLAMBench, a
publicly-available software framework which represents a starting point for
quantitative, comparable and validatable experimental research to investigate
trade-offs in performance, accuracy and energy consumption of a dense RGB-D
SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP,
OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D
sequences with trajectory and scene ground truth for reliable accuracy
comparison of different implementation and algorithms. We present an analysis
and breakdown of the constituent algorithmic elements of KinectFusion, and
experimentally investigate their execution time on a variety of multicore and
GPUaccelerated platforms. For a popular embedded platform, we also present an
analysis of energy efficiency for different configuration alternatives.Comment: 8 pages, ICRA 2015 conference pape
Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications Using HyperMapper
In this paper we investigate an emerging application, 3D scene understanding,
likely to be significant in the mobile space in the near future. The goal of
this exploration is to reduce execution time while meeting our quality of
result objectives. In previous work we showed for the first time that it is
possible to map this application to power constrained embedded systems,
highlighting that decision choices made at the algorithmic design-level have
the most impact.
As the algorithmic design space is too large to be exhaustively evaluated, we
use a previously introduced multi-objective Random Forest Active Learning
prediction framework dubbed HyperMapper, to find good algorithmic designs. We
show that HyperMapper generalizes on a recent cutting edge 3D scene
understanding algorithm and on a modern GPU-based computer architecture.
HyperMapper is able to beat an expert human hand-tuning the algorithmic
parameters of the class of Computer Vision applications taken under
consideration in this paper automatically. In addition, we use crowd-sourcing
using a 3D scene understanding Android app to show that the Pareto front
obtained on an embedded system can be used to accelerate the same application
on all the 83 smart-phones and tablets crowd-sourced with speedups ranging from
2 to over 12.Comment: 10 pages, Keywords: design space exploration, machine learning,
computer vision, SLAM, embedded systems, GPU, crowd-sourcin
IR2Vec: LLVM IR based Scalable Program Embeddings
We propose IR2Vec, a Concise and Scalable encoding infrastructure to
represent programs as a distributed embedding in continuous space. This
distributed embedding is obtained by combining representation learning methods
with flow information to capture the syntax as well as the semantics of the
input programs. As our infrastructure is based on the Intermediate
Representation (IR) of the source code, obtained embeddings are both language
and machine independent. The entities of the IR are modeled as relationships,
and their representations are learned to form a seed embedding vocabulary.
Using this infrastructure, we propose two incremental encodings:Symbolic and
Flow-Aware. Symbolic encodings are obtained from the seed embedding vocabulary,
and Flow-Aware encodings are obtained by augmenting the Symbolic encodings with
the flow information.
We show the effectiveness of our methodology on two optimization tasks
(Heterogeneous device mapping and Thread coarsening). Our way of representing
the programs enables us to use non-sequential models resulting in orders of
magnitude of faster training time. Both the encodings generated by IR2Vec
outperform the existing methods in both the tasks, even while using simple
machine learning models. In particular, our results improve or match the
state-of-the-art speedup in 11/14 benchmark-suites in the device mapping task
across two platforms and 53/68 benchmarks in the Thread coarsening task across
four different platforms. When compared to the other methods, our embeddings
are more scalable, is non-data-hungry, and has betterOut-Of-Vocabulary (OOV)
characteristics.Comment: Accepted in ACM TAC
Efficient execution of Java programs on GPU
Dissertação de mestrado em Informatics EngineeringWith the overwhelming increase of demand of computational power made by fields as Big
Data, Deep Machine learning and Image processing the Graphics Processing Units (GPUs)
has been seen as a valuable tool to compute the main workload involved. Nonetheless,
these solutions have limited support for object-oriented languages that often require manual
memory handling which is an obstacle to bringing together the large community of object oriented programmers and the high-performance computing field.
In this master thesis, different memory optimizations and their impacts were studied
in a GPU Java context using Aparapi. These include solutions for different identifiable
bottlenecks of commonly used kernels exploiting its full capabilities by studying the GPU
hardware and current techniques available. These results were set against common used
C/OpenCL benchmarks and respective optimizations proving, that high-level languages can
be a solution to high-performance software demand.Com o aumento de poder computacional requisitado por campos como Big Data, Deep Machine Learning e Processamento de Imagens, as unidades de processamento gráfico (GPUs) tem sido vistas como uma ferramenta valiosa para executar a principal carga de trabalho envolvida. No entanto, esta solução tem suporte limitado para linguagens orientadas a objetos. Frequentemente estas requerem manipulação manual de memória, o que é um obstáculo para reunir a grande comunidade de programadores orientados a objetos e o campo da computação de alto desempenho. Nesta dissertação de mestrado, diferentes otimizações de memória e os seus impactos foram estudados utilizando Aparapi. As otimizações estudadas pretendem solucionar bottle-necks identificáveis em kernels frequentemente utilizados. Os resultados obtidos foram comparados com benchmarks C / OpenCL populares e as suas respectivas otimizações, provando que as linguagens de alto nível podem ser uma solução para programas que requerem computação de alto desempenho
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%
Toward performance portability for CPUS and GPUS through algorithmic compositions
The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required.
This work presents my solution to the performance portability problem. I argue that a new language is required for replacing the current practices of programming systems to achieve practical performance portability. To support my argument, I first demonstrate the limited performance portability of the current practices by showing quantitative and qualitative evidences. I identify the main limiting issues of conventional programming languages. To overcome the issues, I propose a new modular, composition-based programming language that can effectively express an algorithmic design space with functional polymorphism, and a compiler that can effectively explore the design space and facilitate many high-level optimization techniques. This proposed approach achieves no less than 70% of the performance of highly optimized vendor libraries such as Intel MKL and NVIDIA CUBLAS/CUSPARSE on an Intel i7-3820 Sandy Bridge CPU, an NVIDIA C2050 Fermi GPU, and an NVIDIA K20c Kepler GPU
Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures
The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices.
However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times faster and more energy efficiently than one that is not.
Ideally, programmers should write their code using a single programming model, and the compiler would transform the program to run optimally on the target architecture.
In practice, however, programmers have to expend great effort to translate performance enjoyed on one platform to another.
As such, single-source code-based portability has gained substantial momentum and OpenCL, a bulk-synchronous programming language, has become a popular choice, among others, to fulfill the need for portability.
The assumed computing model of these languages is inevitably loosely coupled with an underlying architecture, obligating a combined compiler and runtime to find an efficient execution mapping from the input program onto the architecture which best exploits the hardware for performance.
In this dissertation, I argue and demonstrate that obtaining high performance from executing OpenCL programs on CPU is feasible. In order to achieve the goal, I present compiler and runtime techniques to execute OpenCL programs on CPU architectures.
First, I propose a compiler technique in which the execution of fine-grained parallel threads, called work-items, is collectively analyzed to consider the impact of scheduling them with respect to data locality.
By analyzing the memory addresses accessed in a kernel, the technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance.
The approach achieves geomean speedups of 3.32x over AMD's and 1.71x over Intel's state-of-the-art implementations on Parboil and Rodinia benchmarks.
Second, I propose a runtime that allows a compiler to deposit differently optimized kernels to mitigate the stress on the compiler in deriving the most optimal code.
The runtime systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination.
It exploits the fact that OpenCL programs typically come with a large number of independent work-groups, a feature that amortizes the cost of profiling execution of a few work-items, while the overhead is further reduced by retaining the profiling execution result to constitute the final execution output.
The proposed runtime performs with an average overhead of 3% compared to an ideal/oracular runtime in execution time