31 research outputs found
Fake Run-Time Selection of Template Arguments in C++
C++ does not support run-time resolution of template type arguments. To
circumvent this restriction, we can instantiate a template for all possible
combinations of type arguments at compile time and then select the proper
instance at run time by evaluation of some provided conditions. However, for
templates with multiple type parameters such a solution may easily result in a
branching code bloat. We present a template metaprogramming algorithm called
for_id that allows the user to select the proper template instance at run time
with theoretical minimum sustained complexity of the branching code.Comment: Objects, Models, Components, Patterns (50th International Conference,
TOOLS 2012
Recommended from our members
Whitepaper: The Value of Improving the Separation of Concerns
Microsoft's enterprise customers are demanding better ways to modularize their software systems. They look to the Java community, where these needs are being met with language enhancements, improved developer tools and middleware, and better runtime support. We present a business case for why Microsoft should give priority to supporting better modularization techniques, also known as advanced separation of concerns (ASOC), for the .NET platform, and we provide a roadmap for how to do so
Development of a performance analysis environment for parallel pattern-based applications
One of the challenges that the scientific community is facing nowadays refers to the data parallel
treatment. Every day, we produce more and more overwhelming amounts of data, in such a way that
there comes a point at which all those data volumes grow exponentially and can not be treated as we
would desire.
The importance of this treatment and data processing is mainly due to the need that arises in the
scientific area for discovering new advances in science, or simply for discovering new algorithms capable
of solving experiments each time more complex, whose resolution years ago was an unfeasible task due
to the available resources.
As a consequence, some changes appear in the internal architecture of these computers in order to
increase their computing capacity and thus, be able to cope with the need for massive data processing.
Thus, the scientific community has implemented different pattern-based parallel programming frameworks
in order to compute experiments in a faster and efficient way. Unfortunately, the use of these
programming paradigms is not a simple task, since it requires expertise and programming skills. This
is further complicated when developers are not aware of the program internal behaviour, which leads to
unexpected results in certain parts of the code. Inevitably, some need arises to develop a series of tools
with the aim of helping this community to analyze the performance and results of their experiments.
Hence, this bachelor thesis presents the development of a performance analysis environment based
on parallel applications as a solution to that problem. Specifically, this environment is composed of two
techniques commonly used, profiling and tracing, which have been added to the GrPPI framework. In
this way, users can obtain a general assessment of their applications performance and thus, act according
with the results obtained.Uno de los retos actuales al que la comunidad científica está haciendo frente hace referencia al
tratamiento en paralelo de datos. Diariamente producimos cada vez más cantidades abrumadoras de
datos, de tal manera que llega un punto en el que todos esos volúmenes de datos crecen desenfrenadamente
y no pueden ser tratados como se desearía.
La importancia de este tratamiento y procesamiento de datos se debe, principalmente, a la necesidad
que surge en el ámbito científico por descubrir nuevos avances en la ciencia, o, simplemente, por
descubrir nuevos algoritmos capaces de resolver experimentos cada vez más complejos cuya resolución
años atrás era una tarea inviable debido a los recursos disponibles.
Como consecuencia, surgen cambios en la arquitectura interna de estos ordenadores con el fin de
aumentar su capacidad de cómputo y así poder hacer frente a dicha necesidad de tratamiento masivo
de datos. Así, la comunidad científica ha implementado distintos frameworks de programación paralela
basados en patrones paralelos con el fin de computar esos experimentos de una manera más rápida
y eficiente. Desafortunadamente, la utilización de estos paradigmas de programación no es una tarea
sencilla, ya que requiere experiencia y destrezas en programación. Esto se complica aún más cuando los
desarrolladores no son conscientes del comportamiento interno del programa, lo cual conlleva a obtener
resultados inesperados en ciertas partes del código. Inevitablemente, surge la necesidad de desarrollar
una serie de herramientas con el objetivo de ayudar a esta comunidad a analizar el rendimiento y
resultados de sus experimentos.
Así pues, este trabajo fin de carrera presenta el desarrollo de un entorno de análisis de rendimiento
de aplicaciones paralelas como solución al ese problema. Concretamente, este entorno está compuesto de
dos técnicas comunmente utilizadas, profiling y tracing, las cuales han sido añadidas al framework de programación
paralela GrPPI. Así, los usuarios podrán recibir una valoración general sobre el rendimiento
de sus aplicaciones y actuar conforme a los resultados obtenidos.Ingeniería Informática (Plan 2011
Performance engineering of data-intensive applications
Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms.
In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases.
Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%.
For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability.
Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment.
Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications
Recommended from our members
Debugging Woven Code
The ability to debug woven programs is critical to the adoption of Aspect Oriented Programming (AOP). Nevertheless, many AOP systems lack adequate support for debugging, making it difficult to diagnose faults and understand the program's structure and control flow. We discuss why debugging aspect behavior is hard and how harvesting results from related research on debugging optimized code can make the problem more tractable. We also specify general debugging criteria that we feel all AOP systems should support. We present a novel solution to the problem of debugging aspect-enabled programs. Our Wicca system is the first dynamic AOP system to support full source-level debugging of woven code. It introduces a new weaving strategy that combines source weaving with online byte-code patching. Changes to the aspect rules, or base or aspect source code are rewoven and recompiled on-the-fly. We present the results of an experiment that show how these features provide the programmer with a powerful interactive debugging experience with relatively little overhead
Generic Reloading for Languages Based on the Truffle Framework
Programmide käitusaegset uuendamist on põhjalikult uuritud ning selle kasutamine programmeerimiskeelte implementatsioonides kogub hoogu. Senised pakutud lahendused programmide käitusaegse uuendamise osas on rakendatavad ainult konkreetsetele keeltele ja ei ole taaskasutatavad. Käesolevas lõputöös on uuritud seda, kuidas Truffle-nimeline programmeerimiskeelte loomise raamistik suudaks aidata keelte loojatel lisada käitusaegse uuendamise tuge. Autor on loonud taaskasutatava dünaamilise uuendamise lahenduse, mida erinevad Truffle raamistikus loodud keeled saavad kasutada selleks, et vähese vaevaga toetada käitusaegseid uuendusi. Antud lahendusega on võimalik uuendatavaks teha Pythoni, Ruby ja JavasScripti Truffle implementatsioone. Väljatöötatud lahendusel on peaaegu olematu mõju keele tippvõimsusele, kui on sisse lülitatud Truffle täppisajastusega (JIT) kompilaator. See lahendus teeb käitusaegse uuendamise toe lisamise uutele ja tulevastele keeltele märkimisväärselt lihtsamaks.Reloading running programs is a well-researched and increasingly popular feature of programming language implementations. There are plenty of proposed solutions for various existing programming languages, but typically the solutions target a specific language and are not reusable. In this thesis, we explored how the Truffle language implementation framework could aid language creators in adding reloading capabilities to their languages. We created a reusable reloading core that different Truffle-based languages can hook into to support dynamic updates with minimum amount of effort on their part. We demonstrate how the Truffle implementations of Python, Ruby and JavaScript can be made reloadable with the developed solution. With Truffle’s just-in-time compiler enabled, our solution incurs close to zero overhead on steady-state performance. This approach significantly reduces the effort required to add dynamic update support for existing and future languages
From Valid Measurements to Valid Mini-Apps
In high-performance computing, performance analysis, tuning, and exploration are relevant throughout the life cycle of an application. State-of-the-art tools provide capable measurement infrastructure, but they lack automation of repetitive tasks, such as iterative measurement-overhead reduction, or tool support for challenging and time-consuming tasks, e.g., mini-app creation. In this thesis, we address this situation with (a) a comparative study on overheads introduced by different tools, (b) the tool PIRA for automatic instrumentation refinement, and (c) a tool-supported approach for mini-app extraction.
In particular, we present PIRA for automatic iterative performance measurement refinement. It performs whole-program analysis using both source-code and runtime information to heuristically determine where in the target application measurement hooks should be placed for a low-overhead assessment. At the moment, PIRA offers a runtime heuristic to identify compute-intensive parts, a performance-model heuristic to identify scalability limitations, and a load imbalance detection heuristic. In our experiments, PIRA compared to Score-P’s built-in filtering significantly reduces the runtime overhead in 13 out of 15 benchmark cases and typically introduces a slowdown of < 10 %.
To provide PIRA with the required infrastructure, we develop MetaCG — an extendable lightweight whole-program call-graph library for C/C++. The library offers a compiler-agnostic call-graph (CG) representation, a Clang-based tool to construct a target’s CG, and a tool to validate the structure of the MetaCG. In addition to its use in PIRA, we show that whole-program CG analysis reduces the number of allocation to track by the memory tracking sanitizer TypeART by up to a factor of 2,350×.
Finally, we combine the presented tools and develop a tool-supported approach to (a) identify, and (b) extract relevant application regions into representative mini-apps. Therefore, we present a novel Clang-based source-to-source translator and a type-safe checkpoint-restart (CPR) interface as a common interface to existing MPI-parallel CPR libraries. We evaluate the approach by extracting a mini-app of only 1,100 lines of code from an 8.5 million lines of code application. The mini-app is subsequently analyzed, and maintains the significant characteristics of the original application’s behavior. It is then used for tool-supported parallelization, which led to a speed-up of 35 %.
The software presented in this thesis is available at https://github.com/tudasc
Programming Persistent Memory
Beginning and experienced programmers will use this comprehensive guide to persistent memory programming. You will understand how persistent memory brings together several new software/hardware requirements, and offers great promise for better performance and faster application startup times—a huge leap forward in byte-addressable capacity compared with current DRAM offerings. This revolutionary new technology gives applications significant performance and capacity improvements over existing technologies. It requires a new way of thinking and developing, which makes this highly disruptive to the IT/computing industry. The full spectrum of industry sectors that will benefit from this technology include, but are not limited to, in-memory and traditional databases, AI, analytics, HPC, virtualization, and big data. Programming Persistent Memory describes the technology and why it is exciting the industry. It covers the operating system and hardware requirements as well as how to create development environments using emulated or real persistent memory hardware. The book explains fundamental concepts; provides an introduction to persistent memory programming APIs for C, C++, JavaScript, and other languages; discusses RMDA with persistent memory; reviews security features; and presents many examples. Source code and examples that you can run on your own systems are included. What You’ll Learn Understand what persistent memory is, what it does, and the value it brings to the industry Become familiar with the operating system and hardware requirements to use persistent memory Know the fundamentals of persistent memory programming: why it is different from current programming methods, and what developers need to keep in mind when programming for persistence Look at persistent memory application development by example using the Persistent Memory Development Kit (PMDK) Design and optimize data structures for persistent memory Study how real-world applications are modified to leverage persistent memory Utilize the tools available for persistent memory programming, application performance profiling, and debugging Who This Book Is For C, C++, Java, and Python developers, but will also be useful to software, cloud, and hardware architects across a broad spectrum of sectors, including cloud service providers, independent software vendors, high performance compute, artificial intelligence, data analytics, big data, etc