15 research outputs found
HaTS: Hardware-Assisted Transaction Scheduler
In this paper we present HaTS, a Hardware-assisted Transaction Scheduler. HaTS improves performance of concurrent applications by classifying the executions of their atomic blocks (or in-memory transactions) into scheduling queues, according to their so called conflict indicators. The goal is to group those transactions that are conflicting while letting non-conflicting transactions proceed in parallel. Two core innovations characterize HaTS. First, HaTS does not assume the availability of precise information associated with incoming transactions in order to proceed with the classification. It relaxes this assumption by exploiting the inherent conflict resolution provided by Hardware Transactional Memory (HTM). Second, HaTS dynamically adjusts the number of the scheduling queues in order to capture the actual application contention level. Performance results using the STAMP benchmark suite show up to 2x improvement over state-of-the-art HTM-based scheduling techniques
Design of an Efficient Interconnection Network of Temperature Sensors
Temperature has become a first class design constraint because high temperatures adversely affect circuit reliability, static power and degrade the performance. In this scenario, thermal characterization of ICs and on-chip temperature monitoring represent fundamental tasks in electronic design. In this work, we analyze the features that an interconnection network of temperature sensors must fulfill. Departing from the network topology, we continue with the proposal of a very light-weight network architecture based on digitalization resource sharing. Our proposal supposes a 16% improvement in area and power consumption compared to traditional approache
On-chip Monitoring: A Light-Weight Interconnection Network Approach
Current nanometer technologies are subjected to several adverse effects that seriously impact the yield and performance of integrated circuits. Such is the case of within-die parameters uncertainties, varying workload conditions, aging, temperature, etc. Monitoring, calibration and dynamic adaptation have appeared as promising solutions to these issues and many kinds of monitors have been presented recently. In this scenario, where systems with hundreds of monitors of different types have been proposed, the need for light-weight monitoring networks has become essential. In this work we present a light-weight network architecture based on digitization resource sharing of nodes that require a time-to-digital conversion. Our proposal employs a single wire interface, shared among all the nodes in the network, and quantizes the time domain to perform the access multiplexing and transmit the information. It supposes a 16% improvement in area and power consumption compared to traditional approaches
Efficient cross-architecture hardware virtualisation
Hardware virtualisation is the provision of an isolated virtual environment that
represents real physical hardware. It enables operating systems, or other system-level
software (the guest), to run unmodified in a “container” (the virtual machine)
that is isolated from the real machine (the host).
There are many use-cases for hardware virtualisation that span a wide-range
of end-users. For example, home-users wanting to run multiple operating systems
side-by-side (such as running a Windows® operating system inside an OS
X environment) will use virtualisation to accomplish this. In research and development
environments, developers building experimental software and hardware
want to prototype their designs quickly, and so will virtualise the platform
they are targeting to isolate it from their development workstation. Large-scale
computing environments employ virtualisation to consolidate hardware, enforce
application isolation, migrate existing servers or provision new servers.
However, the majority of these use-cases call for same-architecture virtualisation,
where the architecture of the guest and the host machines match—a situation
that can be accelerated by the hardware-assisted virtualisation extensions
present on modern processors. But, there is significant interest in virtualising
the hardware of different architectures on a host machine, especially in the
architectural research and development worlds.
Typically, the instruction set architecture of a guest platform will be different
to the host machine, e.g. an ARM guest on an x86 host will use an ARM instruction
set, whereas the host will be using the x86 instruction set. Therefore, to
enable this cross-architecture virtualisation, each guest instruction must be emulated
by the host CPU—a potentially costly operation. This thesis presents a
range of techniques for accelerating this instruction emulation, improving over
a state-of-the art instruction set simulator by 2:64x. But, emulation of the guest
platform’s instruction set is not enough for full hardware virtualisation. In fact,
this is just one challenge in a range of issues that must be considered. Specifically,
another challenge is efficiently handling the way external interrupts are
managed by the virtualisation system. This thesis shows that when employing
efficient instruction emulation techniques, it is not feasible to arbitrarily
divert control-flow without consideration being given to the state of the emulated
processor. Furthermore, it is shown that it is possible for the virtualisation
environment to behave incorrectly if particular care is not given to the point
at which control-flow is allowed to diverge. To solve this, a technique is developed
that maintains efficient instruction emulation, and correctly handles
external interrupt sources.
Finally, modern processors have built-in support for hardware virtualisation
in the form of instruction set extensions that enable the creation of an abstract
computing environment, indistinguishable from real hardware. These extensions
enable guest operating systems to run directly on the physical processor,
with minimal supervision from a hypervisor. However, these extensions are
geared towards same-architecture virtualisation, and as such are not immediately
well-suited for cross-architecture virtualisation. This thesis presents a
technique for exploiting these existing extensions, and using them in a cross-architecture
virtualisation setting, improving the performance of a novel cross-architecture
virtualisation hypervisor over state-of-the-art by 2:5x
SHAPES : Easy and high-level memory layouts
CPU speeds have vastly exceeded those of RAM. As such, developers who aim to achieve high
performance on modern architectures will most likely need to consider how to use CPU caches
effectively, hence they will need to consider how to place data in memory so as to exploit spatial
locality and achieve high memory bandwidth.
Performing such manual memory optimisations usually sacrifices readability, maintainability,
memory safety, and object abstraction. This is further exacerbated in managed languages, such
as Java and C#, where the runtime abstracts away the memory from the developer and such
optimisations are, therefore, almost impossible.
To that extent, we present in this thesis a language extension called SHAPES . SHAPES aims
to offer developers more fine-grained control over the placement of data, without sacrificing
memory safety or object abstraction, hence retaining the expressiveness and familiarity of OOP.
SHAPES introduces the concepts of pools and layouts; programmers group related objects into
pools, and specify how objects are laid out in these pools. Classes and types are annotated
by pool parameters, which allow placement aspects to be changed orthogonally to how the
business logic operates on the objects in the pool. These design decisions disentangle business
logic and memory concerns.
We provide a formal model of SHAPES , present its type and memory safety model, and its
translation into a low-level language. We present our reasoning as to why we can expect
SHAPES to be compiled in an efficient manner in terms of the runtime representation of objects
and the access to their fields.
Moreover, we present SHAPES -z, an implementation of SHAPES as an embeddable language,
and shapeszc , the compiler for SHAPES -z. We provide our our design and implementation
considerations for SHAPES -z and shapeszc . Finally, we evaluate the performance of SHAPES
and SHAPES -z through case studies.Open Acces
SIMD@OpenMP : a programming model approach to leverage SIMD features
SIMD instruction sets are a key feature in current general purpose and high performance architectures. SIMD instructions apply in parallel the same operation to a group of data, commonly known as vector. A single SIMD/vector instruction can, thus, replace a sequence of scalar instructions. Consequently, the number of instructions can be greatly reduced leading to improved execution times.
However, SIMD instructions are not widely exploited by the vast majority of programmers. In many cases, taking advantage of these instructions relies on the compiler. Nevertheless, compilers struggle with the automatic vectorization of codes. Advanced programmers are then compelled to exploit SIMD units by hand, using low-level hardware-specific intrinsics. This approach is cumbersome, error prone and not portable across SIMD architectures.
This thesis targets OpenMP to tackle the underuse of SIMD instructions from three main areas of the programming model: language constructions, compiler code optimizations and runtime algorithms. We choose the Intel Xeon Phi coprocessor (Knights Corner) and its 512-bit SIMD instruction set for our evaluation process. We make four contributions aimed at improving the exploitation of SIMD instructions in this scope.
Our first contribution describes a compiler vectorization infrastructure suitable for OpenMP. This infrastructure targets for-loops and whole functions. We define a set of attributes for expressions that determine how the code is vectorized. Our vectorization infrastructure also implements support for several advanced vector features. This infrastructure is proven to be effective in the vectorization of complex codes and it is the basis upon which we build the following two contributions.
The second contribution introduces a proposal to extend OpenMP 3.1 with SIMD parallelism. Essential parts of this work have become key features of the SIMD proposal included in OpenMP 4.0. We define the "simd" and "simd for" directives that allow programmers to describe SIMD parallelism of loops and whole functions. Furthermore, we propose a set of optional clauses that leads the compiler to generate a more efficient vector code. These SIMD extensions improve the programming efficiency when exploiting SIMD resources. Our evaluation on the Intel Xeon Phi coprocessor shows that our SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler.
In the third contribution, we propose a vector code optimization that enhances overlapped vector loads. These vector loads redundantly read from memory scalar elements already loaded by other vector loads. Our vector code optimization improves the memory usage of these accesses by means of building a vector register cache and exploiting register-to-register instructions. Our proposal also includes a new clause (overlap) in the context of the SIMD extensions for OpenMP of our first contribution. This new clause allows enabling, disabling and tuning this optimization on demand.
The last contribution tackles the exploitation of SIMD instructions in the OpenMP barrier and reduction primitives. We propose a new combined barrier and reduction tree scheme specifically designed to make the most of SIMD instructions. Our barrier algorithm takes advantage of simultaneous multi-threading technology (SMT) and it utilizes SIMD memory instructions in the synchronization process.
The four contributions of this thesis are an important step in the direction of a more common and generalized use of SIMD instructions. Our work is having an outstanding impact on the whole OpenMP community, ranging from users of the programming model to compiler and runtime implementations. Our proposals in the context of OpenMP improves the programmability of the programming model, the overhead of runtime services and the execution time of applications by means of a better use of SIMD.Los juegos de instrucciones SIMD son un componente clave en las arquitecturas de propósito general y de alto rendimiento actuales. Estas instrucciones aplican en paralelo la misma operación a un conjunto de datos, conocido como vector. Una instrucción SIMD/vectorial puede sustituir una secuencia de instrucciones escalares. Así, el número de instrucciones puede ser reducido considerablemente, dando lugar a mejores tiempos de ejecución.
No obstante, las instrucciones SIMD no son explotadas ampliamente por la mayoría de programadores. En general, beneficiarse de estas instrucciones depende del compilador. Sin embargo, los compiladores tienen dificultades con la vectorización automática de códigos por lo que los programadores avanzados se ven obligados a explotar las unidades SIMD manualmente, empleando intrínsecas de bajo nivel específicas del hardware. Esta aproximación es costosa, propensa a errores y no portable entre arquitecturas.
Esta tesis se centra en el modelo de programación OpenMP para abordar el poco uso de las instrucciones SIMD desde tres áreas: construcciones del lenguaje, optimizaciones de código del compilador y algoritmos del runtime. Hemos escogido el coprocesador Intel Xeon Phi (Knights Corner) y su juego de instrucciones SIMD de 512 bits para nuestra evaluación. Realizamos cuatro contribuciones para mejorar la explotación de las instrucciones SIMD en este ámbito.
Nuestra primera contribución describe una infraestructura de vectorización de compilador adecuada para OpenMP. Esta infraestructura tiene como objetivo la vectorización de bucles y funciones. Para ello definimos un conjunto de atributos que determina como se vectoriza el código. Nuestra evaluación demuestra la efectividad de esta infraestructura en la vectorización de códigos complejos. Esta infraestructura es la base de las dos propuestas siguientes.
En la segunda contribución proponemos una extensión SIMD para de OpenMP 3.1. Partes esenciales de este trabajo se han convertido en características clave de la propuesta sobre SIMD incluida en OpenMP 4.0. Definimos las directivas ‘simd’ y ‘simd for’ que permiten a los programadores describir paralelismo SIMD de bucles y funciones. Además, proponemos un conjunto de cláusulas opcionales que permiten que el compilador genere código vectorial más eficiente. Nuestra evaluación muestra que nuestra propuesta SIMD permite al compilador vectorizar eficientemente códigos pobremente o no vectorizados automáticamente con el compilador Intel C/C++