56 research outputs found
RVSDG: An Intermediate Representation for Optimizing Compilers
Intermediate Representations (IRs) are central to optimizing compilers as the
way the program is represented may enhance or limit analyses and
transformations. Suitable IRs focus on exposing the most relevant information
and establish invariants that different compiler passes can rely on. While
control-flow centric IRs appear to be a natural fit for imperative programming
languages, analyses required by compilers have increasingly shifted to
understand data dependencies and work at multiple abstraction layers at the
same time. This is partially evidenced in recent developments such as the MLIR
proposed by Google. However, rigorous use of data flow centric IRs in general
purpose compilers has not been evaluated for feasibility and usability as
previous works provide no practical implementations. We present the
Regionalized Value State Dependence Graph (RVSDG) IR for optimizing compilers.
The RVSDG is a data flow centric IR where nodes represent computations, edges
represent computational dependencies, and regions capture the hierarchical
structure of programs. It represents programs in demand-dependence form,
implicitly supports structured control flow, and models entire programs within
a single IR. We provide a complete specification of the RVSDG, construction and
destruction methods, as well as exemplify its utility by presenting Dead Node
and Common Node Elimination optimizations. We implemented a prototype compiler
and evaluate it in terms of performance, code size, compilation time, and
representational overhead. Our results indicate that the RVSDG can serve as a
competitive IR in optimizing compilers while reducing complexity
MAFin: Maximizing Accuracy in FinFET based Approximated Real-Time Computing
We propose MAFin that exploits the unique temperature effect inversion (TEI) property of a FinFET based multicore platform, where processing speed increases with temperature, in the context of approximate real-time computing. In approximate real-time computing platforms, the execution of each task can be divided into two parts: (i) the mandatory part, execution of which provides a result of acceptable quality, followed by (ii) the optional part, that can be executed partially or fully to refine the initially obtained result in order to increase the result-accuracy (QoS) without violating deadlines. With an objective to maximize the QoS for a FinFET based multicore system, MAFin, our proposed real-time scheduler first derives a task-to-core allocation, while respecting system-wide constraints and prepares a schedule. During execution, MAFin further increases the achieved QoS, while balancing the performance and temperature on-the-fly by incorporating a prudential temperature cognizant frequency management mechanism and guarantees imposed constraints. Specifically, MAFin exploits the TEI property of FinFET based processors, where processor-speed is enhanced at the increased temperature, to reduce the execution time of the individual tasks. This reduced execution-time is then traded off either to enhance QoS by executing more from the tasks’ optional parts or to improve energy efficiency by turning off the core. While surpassing prior art, MAFin achieves 70% QoS, which is further enhanced by 8.3% in online, with a maximum EDP gain of up to 12%, based on benchmark based evaluation on a 4-core based system
Adapt or Become Extinct!:The Case for a Unified Framework for Deployment-Time Optimization
The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform\u27s performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application
Viterbi Accelerator for Embedded Processor Datapaths
We present a novel architecture for a lightweight Viterbi accelerator that can be tightly integrated inside an embedded processor. We investigate the accelerator’s impact on processor performance by using the EEMBC Viterbi benchmark and the in-house Viterbi Branch Metric kernel. Our evaluation based on the EEMBC benchmark shows that an accelerated 65-nm 2.7-ns processor datapath is 20% larger but 90% more cycle efficient than a datapath lacking the Viterbi accelerator, leading to an 87% overall energy reduction and a data throughput of 3.52 Mbit/s
Efficient and Flexible Embedded Systems and Datapath Components
The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require-
ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design.
This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together with
the expressiveness of the wide control word improves its performance.
When designing new embedded systems it is important to have efficient components to build from. Arithmetic circuits are especially important, since they are extensively used in all applications. In particular, integer multipliers present big design challenges. The proposed twin-precision technique makes it possible to improve both throughput and power of conventional integer multipliers, when computing narrow-width multiplications. The thesis also shows that the Baugh-Wooley algorithm is more suitable for hardware implementations of signed integer multipliers than the commonly used modified-Booth algorithm.
A multi-core architecture is a common design choice when a single-core architecture cannot deliver sufficient performance. However, multi-core architectures introduce their own design challenges, such as scheduling applications
onto several cores. This thesis presents a novel task management unit, which offloads task scheduling from the conventional cores of a multi-core system, thus improving both performance and power efficiency of the system.
This thesis proposes novel solutions to a number of relevant issues that need to be addressed when designing embedded systems
- …