58 research outputs found
Towards a performance- and energy-efficient data filter cache
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor\u27s total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor energy efficiency. Filter caches are known to efficiently decrease the number of accesses to instruction caches. However, due to the irregular access pattern of data accesses, a conventional data filter cache (DFC) has a high miss rate, which degrades processor performance. We propose to integrate a DFC with a fast address calculation technique to significantly reduce the impact of misses and to improve performance by enabling one-cycle loads. Furthermore, we show that DFC stalls can be eliminated even after unsuccessful fast address calculations, by simultaneously accessing the DFC and L1 DC on the following cycle. We quantitatively evaluate different DFC configurations, with and without the fast address calculation technique, using different write allocation policies, and qualitatively describe their impact on energy efficiency. The proposed design provides an efficient DFC that yields both energy and performance improvements
Speculative tag access for reduced energy dissipation in set-associative L1 data caches
Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of energy is wasted when loads are performed. We propose a speculation technique that performs the tag comparison in parallel with the address calculation, leading to the access of only one way during the following cycle on successful speculations. The technique incurs no execution time penalty, has an insignificant area overhead, and does not require any customized SRAM implementation. Assuming a 16kB 4-way set-associative L1 data cache implemented in a 65-nm process technology, our evaluation based on 20 different MiBench benchmarks shows that the proposed technique on average leads to a 24% data cache energy reduction
Efficient Reconfigurable Multipliers Based on the Twin-Precision Technique
During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands on circuit designers that have to provide the foundation for all this functionality. The desire for increased functionality and an associated capability to adapt to changing requirements, has led to the design of reconfigurable architectures. With an increased interest and use of reconfigurable architectures there is a need for flexible and reconfigurable computational units that can meet the demands of high speed, high throughput, low power, and area efficiency. Multiplications are complex to implement and they continue to give designers headaches when trying to efficiently implement multipliers in hardware. Multipliers are therefore interesting to study, when investigating how to design flexible and reconfigurable computational units.In this thesis the results from investigations on flexible multipliers are presented. The new twin-precision technique, which was developed during this work, makes a multiplier able to adapt to different requirements. By adapting to actual multiplication bitwidth using the twin-precision technique, it is possible to save power, increase speed and double computational throughput. The investigations have also led to the conclusion that the long used and popular modified-Booth multiplier is inferior in all aspects to the less complex Baugh-Wooley multiplier. During this work, a VHDL multiplier generator was created and made publicly available
Efficient and Flexible Embedded Systems and Datapath Components
The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require-ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design.This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together withthe expressiveness of the wide control word improves its performance.When designing new embedded systems it is important to have efficient components to build from. Arithmetic circuits are especially important, since they are extensively used in all applications. In particular, integer multipliers present big design challenges. The proposed twin-precision technique makes it possible to improve both throughput and power of conventional integer multipliers, when computing narrow-width multiplications. The thesis also shows that the Baugh-Wooley algorithm is more suitable for hardware implementations of signed integer multipliers than the commonly used modified-Booth algorithm.A multi-core architecture is a common design choice when a single-core architecture cannot deliver sufficient performance. However, multi-core architectures introduce their own design challenges, such as scheduling applicationsonto several cores. This thesis presents a novel task management unit, which offloads task scheduling from the conventional cores of a multi-core system, thus improving both performance and power efficiency of the system.This thesis proposes novel solutions to a number of relevant issues that need to be addressed when designing embedded systems
Efficient Reconfigurable Multipliers Based on the Twin-Precision Technique
During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands on circuit designers that have to provide the foundation for all this functionality. The desire for increased functionality and an associated capability to adapt to changing requirements, has led to the design of reconfigurable architectures. With an increased interest and use of reconfigurable architectures there is a need for flexible and reconfigurable computational units that can meet the demands of high speed, high throughput, low power, and area efficiency. Multiplications are complex to implement and they continue to give designers headaches when trying to efficiently implement multipliers in hardware. Multipliers are therefore interesting to study, when investigating how to design flexible and reconfigurable computational units.In this thesis the results from investigations on flexible multipliers are presented. The new twin-precision technique, which was developed during this work, makes a multiplier able to adapt to different requirements. By adapting to actual multiplication bitwidth using the twin-precision technique, it is possible to save power, increase speed and double computational throughput. The investigations have also led to the conclusion that the long used and popular modified-Booth multiplier is inferior in all aspects to the less complex Baugh-Wooley multiplier. During this work, a VHDL multiplier generator was created and made publicly available
FlexCore: Implementing an Exposed Datapath Processor
The FlexCore processor is the resulting implementation of an exposed datapath approach conceptualized in the FlexSoC programme. By way of a crossbar switch interconnect, all execution units in a FlexCore datapath can potentially communicate, allowing the inherent hardware parallelism to be utilized. This interconnect enables configuration of a datapath to match an application domain, for example, by way of datapath accelerators. The baseline FlexCore is a general-purpose processor (GPP) and since all FlexCore configurations are extensions to the baseline, they offer GPP functionality as complement to the domain-specific functionality. This paper gives an overview of the implementation of complete FlexCore processors, accompanied with discussions on datapath interconnects, datapath extensions and instruction decompression
The Case for HPM-Based Baugh-Wooley Multipliers
The modified-Booth algorithm is extensively used for high-speed multiplier circuits. Once, when array multipliers were used, the reduced number of generated partial products significantly improved multiplier performance. In designs based on reduction trees with logarithmic logic depth, however, the reduced number of partial products has a limited impact on overall performance. The Baugh-Wooley algorithm is a different scheme for signed multiplication, but is not so widely adopted because it may be complicated to deploy on irregular reduction trees. We use the Baugh-Wooley algorithm in our High Performance Multiplier (HPM) tree, which combines a regular layout with a logarithmic logic depth. We show for a range of operator bit-widths that, when implemented in 130-nm and 65-nm process technologies, the Baugh-Wooley multipliers exhibit comparable delay, less power dissipation and smaller area foot-print than modified-Booth multipliers
- …