9 research outputs found
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
The Design and Implementation of a Bytecode for Optimization on Heterogeneous Systems
As hardware architectures shift towards more heterogeneous platforms with different vari- eties of multi- and many-core processors and graphics processing units (GPUs) by various manufacturers, programmers need a way to write simple and highly optimized code without worrying about the specifics of the underlying hardware. To meet this need, I have designed a virtual machine and bytecode around the goal of optimized execution on highly variable, heterogeneous hardware, instead of having goals such as small bytecodes as was the ob- jective of the Java R Virtual Machine. The approach used here is to combine elements of the Dalvik R virtual machine with concepts from the OpenCL R heterogeneous computing platform, along with an annotation system so that the results of complex compile time analysis can be available to the Just-In-Time compiler. The annotation format is flexible so that the set of annotations can be expanded as the field of heterogeneous computing continues to grow. An initial implementation of this virtual machine was written in the Scala programming language and makes use of the Java bindings for OpenCL to execute code segments on a GPU. The implementation consists of an assembler that converts an assembly version of the bytecode into its binary representation and an interpreter that runs programs from the assembled binary. Because the bytecode contains valuable optimization information, decisions can be made at runtime to choose how best to execute code segments. To demonstrate this concept, the interpreter uses this information to produce OpenCL ker- nel code for specified bytecode blocks and then builds and executes these kernels to improve performance. This hybrid interpreter/Just-In-Time compiler serves as an initial implemen- tation of a virtual machine that provides optimized code tailored to the available hardware on which the application is running
Dynamic Hardware Resource Management for Efficient Throughput Processing.
High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the
multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics.
GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific
style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications.
This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of
concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd
Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions
The adoption of hardware accelerators, such as Field-Programmable Gate Arrays,
into general-purpose computation pipelines continues to rise, driven by recent
trends in data collection and analysis as well as pressure from challenging
physical design constraints in hardware. The architectural designs of many of
these accelerators stand in stark contrast to the traditional von Neumann model
of CPUs. Consequently, existing programming languages, maintenance tools, and
techniques are not directly applicable to these devices, meaning that additional
architectural knowledge is required for effective programming and configuration.
Current programming models and techniques are akin to assembly-level programming
on a CPU, thus placing significant burden on developers tasked with using these
architectures. Because programming is currently performed at such low levels of
abstraction, the software development process is tedious and challenging and
hinders the adoption of hardware accelerators.
This dissertation explores the thesis that theoretical finite automata provide a
suitable abstraction for bridging the gap between high-level programming models
and maintenance tools familiar to developers and the low-level hardware
representations that enable high-performance execution on hardware accelerators.
We adopt a principled hardware/software co-design methodology to develop a
programming model providing the key properties that we observe are necessary for success,
namely performance and scalability, ease of use, expressive power, and legacy
support.
First, we develop a framework that allows developers to port existing, legacy
code to run on hardware accelerators by leveraging automata learning algorithms
in a novel composition with software verification, string solvers, and
high-performance automata architectures. Next, we design a domain-specific
programming language to aid programmers writing pattern-searching algorithms and
develop compilation algorithms to produce finite automata, which supports
efficient execution on a wide variety of processing architectures. Then, we
develop an interactive debugger for our new language, which allows developers to
accurately identify the locations of bugs in software while maintaining support
for high-throughput data processing. Finally, we develop two new
automata-derived accelerator architectures to support additional applications,
including the detection of security attacks and the parsing of recursive and
tree-structured data. Using empirical studies, logical reasoning, and
statistical analyses, we demonstrate that our prototype artifacts scale to
real-world applications, maintain manageable overheads, and support developers'
use of hardware accelerators. Collectively, the research efforts detailed in
this dissertation help ease the adoption and use of hardware accelerators for
data analysis applications, while supporting high-performance computation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155224/1/angstadt_1.pd
A domain-extensible compiler with controllable automation of optimisations
In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation.
This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation.
The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
Parallel and Distributed Execution of Model Management Programs
The engineering process of complex systems involves many stakeholders and development artefacts. Model-Driven Engineering (MDE) is an approach to development which aims to help curtail and better manage this complexity by raising the level of abstraction. In MDE, models are first-class artefacts in the development process. Such models can be used to describe artefacts of arbitrary complexity at various levels of abstraction according to the requirements of their prospective stakeholders. These models come in various sizes and formats and can be thought of more broadly as structured data. Since models are the primary artefacts in MDE, and the goal is to enhance the efficiency of the development process, powerful tools are required to work with such models at an appropriate level of abstraction. Model management tasks – such as querying, validation, comparison, transformation and text generation – are often performed using dedicated languages, with declarative constructs used to improve expressiveness. Despite their semantically constrained nature, the execution engines of these languages rarely capitalize on the optimization opportunities afforded to them. Therefore, working with very large models often leads to poor performance when using MDE tools compared to general-purpose programming languages, which has a detrimental effect on productivity. Given the stagnant single-threaded performance of modern CPUs along with the ubiquity of distributed computing, parallelization of these model management program is a necessity to address some of the scalability concerns surrounding MDE. This thesis demonstrates efficient parallel and distributed execution algorithms for model validation, querying and text generation and evaluates their effectiveness. By fully utilizing the CPUs on 26 hexa-core systems, we were able to improve performance of a complex model validation language by 122x compared to its existing sequential implementation. Up to 11x speedup was achieved with 16 cores for model query and model-to-text transformation tasks
Platform as a service integration for scientific computing using DIRAC
Cada dÃa crece máis a demanda de recursos de computación requirida polos investigadores,
capacidades de cálculo que coexisten co crecente volume de datos xerado actualmente. Estes
investigadores están a demandar un servizo de Computación de Altas Prestacións (HPC) que
permita a execución das suas simulacións dunha forma na que se deslocalicen os recursos para
poder acceder aos máximos posibles, facilitandoo coa forma o máis cómoda e segura para eles.
Doutra banda, as universidades están conectadas con centros de investigación con redes que
pusuen unha velocidade e fiabilidade que posibilitan a execución de traballos de cálculo
cientÃfico. As capacidades de computo existentes en universidades van dende aulas informáticas
para usos docentes, laboratorios, etc., ata clusters de ordenadores pertencentes a grupos de
investigación. Usando tecnoloxÃas grid e cloud estes recursos computacionais heteroxéneos
poderÃan ser reutilizados polos investigadores para realizar simulacións, aportando unha maior
cantidade de cómputo a xa existente e deslocalizando os recursos entre distintos lugares ao
redor do planeta. O obxectivo desta tese é adaptar a contorna para computación distribuÃda
DIRAC, desenvolvida para o proxecto LHCb do CERN, para o seu uso por varias comunidades de
usuarios baseado nas tecnoloxÃas cloud e big data. Esta contorna pusuirÃa repositorios de
software centralizados que permitan proveer o software necesario para que a través dos
entornos na nube se poidan executar as aplicacións dos investigadores en calquera parte do
planeta dunha forma escalable, permitindo aprobeitar tanto recursos dedicados como nondedicados.
Avaliando asà a execución desta plataforma para a realización de cálculos cientÃficos.
Este traballo comezará coa obtención de requisitos, para pasar despois ao proceso de
integración básica. Posteriormente, optimizarase o uso do software cientifico empregado para as
contornas cloud, tratando de adaptalo aos entornos virtualizados. Para iso, será necesario
realizar un estudo estadÃstico que sexa o máis próximo posible aos entornos en producción para
poder determinar e crear as infraestructuras adaptadas evitando asà a perda de rendemento
dentro de recursos. O seguinte caso serÃa utilizar as tecnoloxÃas virtualizadas, adaptando as
arquitecturas creadas, para a creación de sistemas que permitan o envÃo de traballos que
requiran de grandes cantidades de datos no eido do big data dunha forma distribuida