29 research outputs found
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
From High Level Architecture Descriptions to Fast Instruction Set Simulators
As computer systems become increasingly complex and diverse, so too do the architectures
they implement. This leads to an increase in complexity in the tools used to design
new hardware and software. One particularly important tool in hardware and software
design is the Instruction Set Simulator, which is used to prototype new architectures and
hardware features, verify hardware, and test and debug software. Many Architecture
Description Languages exist which facilitate the description of new architectural or
hardware features, and generate a tools such as simulators. However, these typically
suffer from poor performance, are difficult to test effectively, and may be limited in
functionality.
This thesis considers three objectives when developing Instruction Set Simulators:
performance, correctness, and completeness, and presents techniques which contribute
to each of these. Performance is obtained by combining Dynamic Binary Translation
techniques with a novel analysis of high level architecture descriptions. This makes use
of partial evaluation techniques in order to both improve the translation system, and to
improve the quality of the translated code, leading a performance improvement of over
2.5x compared to a naïve implementation.
This thesis also presents techniques which contribute to the correctness objective.
Each possible behaviour of each described instruction is used to guide the generation
of a test case. Constraint satisfaction techniques are used to determine the necessary
instruction encoding and context for each behaviour to be produced. It is shown that
this is a significant improvement over benchmark-driven testing, and this technique
has led to the discovery of several bugs and inconsistencies in multiple state of the art
instruction set simulators.
Finally, several challenges in ‘Full System’ simulation are addressed, contributing
to both the performance and completeness objectives. Full System simulation generally
carries significant performance costs compared with other simulation strategies. Crucially,
instructions which access memory require virtual to physical address translation
and can now cause exceptions. Both of these processes must be correctly and efficiently
handled by the simulator. This thesis presents novel techniques to address this issue
which provide up to a 1.65x speedup over a state of the art solution
An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime
This doctoral thesis focuses on how the application threads that are based on dataflow
execution model can be managed at Network-on-Chip (NoC) level. The roots of the
dataflow execution model date back to the early 1970’s. Applications adhering to such
program execution model follow a simple producer-consumer communication scheme for
synchronising parallel thread related activities. In dataflow execution environment, a
thread can run if and only if all its required inputs are available. Applications running
on a large and complex computing environment can significantly benefit from the
adoption of dataflow model.
In the first part of the thesis, the work is focused on the thread distribution mechanism.
It has been shown that how a scalable hash-based thread distribution mechanism
can be implemented at the router level with low overheads. To enhance the support further,
a tool to monitor the dataflow threads’ status and a simple, functional model is
also incorporated into the design. Next, a software defined NoC has been proposed to
manage the distribution of dataflow threads by exploiting its reconfigurability.
The second part of this work is focused more on NoC microarchitecture level. Traditional
2D-mesh topology is combined with a standard ring, to understand how such
hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally,
a mixed-integer linear programming based analytical model has been proposed
to verify if the application threads mapped on to the free cores is optimal or not. The
proposed mathematical model can be used as a yardstick to verify the solution quality
of the newly developed mapping policy. It is not trivial to provide a complete low-level
framework for dataflow thread execution for better resource and power management.
However, this work could be considered as a primary framework to which improvements
could be carried out
Architectural Exploration of KeyRing Self-Timed Processors
RÉSUMÉ
Les dernières décennies ont vu l’augmentation des performances des processeurs contraintes
par les limites imposées par la consommation d’énergie des systèmes électroniques : des très
basses consommations requises pour les objets connectés, aux budgets de dépenses électriques
des serveurs, en passant par les limitations thermiques et la durée de vie des batteries des
appareils mobiles. Cette forte demande en processeurs efficients en énergie, couplée avec
les limitations de la réduction d’échelle des transistors—qui ne permet plus d’améliorer les
performances à densité de puissance constante—, conduit les concepteurs de circuits intégrés
à explorer de nouvelles microarchitectures permettant d’obtenir de meilleures performances
pour un budget énergétique donné. Cette thèse s’inscrit dans cette tendance en proposant
une nouvelle microarchitecture de processeur, appelée KeyRing, conçue avec l’intention de
réduire la consommation d’énergie des processeurs.
La fréquence d’opération des transistors dans les circuits intégrés est proportionnelle à leur
consommation dynamique d’énergie. Par conséquent, les techniques de conception permettant
de réduire dynamiquement le nombre de transistors en opération sont très largement
adoptées pour améliorer l’efficience énergétique des processeurs. La technique de clock-gating
est particulièrement usitée dans les circuits synchrones, car elle réduit l’impact de l’horloge
globale, qui est la principale source d’activité. La microarchitecture KeyRing présentée dans
cette thèse utilise une méthode de synchronisation décentralisée et asynchrone pour réduire
l’activité des circuits. Elle est dérivée du processeur AnARM, un processeur développé par
Octasic sur la base d’une microarchitecture asynchrone ad hoc. Bien qu’il soit plus efficient
en énergie que des alternatives synchrones, le AnARM est essentiellement incompatible avec
les méthodes de synthèse et d’analyse temporelle statique standards. De plus, sa technique
de conception ad hoc ne s’inscrit que partiellement dans les paradigmes de conceptions asynchrones.
Cette thèse propose une approche rigoureuse pour définir les principes généraux
de cette technique de conception ad hoc, en faisant levier sur la littérature asynchrone. La
microarchitecture KeyRing qui en résulte est développée en association avec une méthode
de conception automatisée, qui permet de s’affranchir des incompatibilités natives existant
entre les outils de conception et les systèmes asynchrones. La méthode proposée permet de
pleinement mettre à profit les flots de conception standards de l’industrie microélectronique
pour réaliser la synthèse et la vérification des circuits KeyRing. Cette thèse propose également
des protocoles expérimentaux, dont le but est de renforcer la relation de causalité
entre la microarchitecture KeyRing et une réduction de la consommation énergétique des
processeurs, comparativement à des alternatives synchrones équivalentes.----------ABSTRACT
Over the last years, microprocessors have had to increase their performances while keeping
their power envelope within tight bounds, as dictated by the needs of various markets: from
the ultra-low power requirements of the IoT, to the electrical power consumption budget
in enterprise servers, by way of passive cooling and day-long battery life in mobile devices.
This high demand for power-efficient processors, coupled with the limitations of technology
scaling—which no longer provides improved performances at constant power densities—, is
leading designers to explore new microarchitectures with the goal of pulling more performances
out of a fixed power budget. This work enters into this trend by proposing a new
processor microarchitecture, called KeyRing, having a low-power design intent.
The switching activity of integrated circuits—i.e. transistors switching on and off—directly
affects their dynamic power consumption. Circuit-level design techniques such as clock-gating
are widely adopted as they dramatically reduce the impact of the global clock in synchronous
circuits, which constitutes the main source of switching activity. The KeyRing microarchitecture
presented in this work uses an asynchronous clocking scheme that relies on decentralized
synchronization mechanisms to reduce the switching activity of circuits. It is derived from
the AnARM, a power-efficient ARM processor developed by Octasic using an ad hoc asynchronous
microarchitecture. Although it delivers better power-efficiency than synchronous
alternatives, it is for the most part incompatible with standard timing-driven synthesis and
Static Timing Analysis (STA). In addition, its design style does not fit well within the existing
asynchronous design paradigms. This work lays the foundations for a more rigorous
definition of this rather unorthodox design style, using circuits and methods coming from the
asynchronous literature. The resulting KeyRing microarchitecture is developed in combination
with Electronic Design Automation (EDA) methods that alleviate incompatibility issues
related to ad hoc clocking, enabling timing-driven optimizations and verifications of KeyRing
circuits using industry-standard design flows. In addition to bridging the gap with standard
design practices, this work also proposes comprehensive experimental protocols that aims to
strengthen the causal relation between the reported asynchronous microarchitecture and a
reduced power consumption compared with synchronous alternatives.
The main achievement of this work is a framework that enables the architectural exploration
of circuits using the KeyRing microarchitecture
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths