28,497 research outputs found
An Interpolative Analytical Cache Model with Application to Performance-Power Design Space Exploration
Caches are known to consume up to half of all system power in embedded processors. Co-optimizing performance and power of the cache subsystems is therefore an important step in the design of embedded systems, especially those employing application specific instruction processors. In this project, we propose an analytical cache model that succinctly captures the miss performance of an application over the entire cache parameter space. Unlike exhaustive trace driven simulation, our model requires that the program be simulated once so that a few key characteristics can be obtained. Using these application-dependent characteristics, the model can span the entire cache parameter space consisting of cache sizes, associativity and cache block sizes. In our unified model, we are able to cater for direct-mapped, set and fully associative instruction, data and unified caches. Validation against full trace-driven simulations shows that our model has a high degree of fidelity. Finally, we show how the model can be coupled with a power model for caches such that one can very quickly decide on pareto-optimal performance-power design points for rapid design space exploration.Singapore-MIT Alliance (SMA
A Cache Management Strategy to Replace Wear Leveling Techniques for Embedded Flash Memory
Prices of NAND flash memories are falling drastically due to market growth
and fabrication process mastering while research efforts from a technological
point of view in terms of endurance and density are very active. NAND flash
memories are becoming the most important storage media in mobile computing and
tend to be less confined to this area. The major constraint of such a
technology is the limited number of possible erase operations per block which
tend to quickly provoke memory wear out. To cope with this issue,
state-of-the-art solutions implement wear leveling policies to level the wear
out of the memory and so increase its lifetime. These policies are integrated
into the Flash Translation Layer (FTL) and greatly contribute in decreasing the
write performance. In this paper, we propose to reduce the flash memory wear
out problem and improve its performance by absorbing the erase operations
throughout a dual cache system replacing FTL wear leveling and garbage
collection services. We justify this idea by proposing a first performance
evaluation of an exclusively cache based system for embedded flash memories.
Unlike wear leveling schemes, the proposed cache solution reduces the total
number of erase operations reported on the media by absorbing them in the cache
for workloads expressing a minimal global sequential rate.Comment: Ce papier a obtenu le "Best Paper Award" dans le "Computer System
track" nombre de page: 8; International Symposium on Performance Evaluation
of Computer & Telecommunication Systems, La Haye : Netherlands (2011
LUT saving in embedded FPGAs for cache locking in real-time systems
[EN] In recent years, cache locking have appeared as a
solution to ease the schedulability analysis of real-time systems
using cache memories maintaining, at the same time, similar
performance improvements than regular cache memories. New
devices for the embedded market couple a processor and a
programmable logic device designed to enhance system flexibility
and increase the possibilities of customisation in the field. This
arrangement may help to improve the use of cache locking in
real-time systems. This work proposes the use of this embedded
programmable logic device to implement a logic function that
provides the cache controller the information it needs in order to
determine if a referenced main memory block has to be loaded
and locked into the cache; we have called this circuit a Locking
State Generator. Experiments show the requirements in terms of
number of hardware resources and a way to reduce them and
the circuit complexity. This reduction ranges from 50% up to
80% of the number of hardware resources originally needed to
build the Locking State Generator circuit.This work has been partially supported by PAID-06-11/2055 of Universitat Politecnica de València and TIN2011-28435- C03-01 of Ministerio de Ciencia e Innovacion.Martà Campoy, A.; RodrÃguez Ballester, F.; Ors Carot, R. (2013). LUT saving in embedded FPGAs for cache locking in real-time systems. International Journal On Advances in Systems and Measurements. 6(1-2):190-199. http://hdl.handle.net/10251/46758S19019961-
Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping
Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low.
This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%
Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors
Most modern 16-bit and 32-bit embedded processors contain cache memories to further increase instruction throughput of the device. Embedded processors that contain cache memories open an opportunity for the low-power research community to model the impact of cache energy consumption and throughput gains. For optimal cache memory configuration mathematical models have been proposed in the past. Most of these models are complex enough to be adapted for modern applications like run-time cache reconfiguration. This paper improves and validates previously proposed energy and throughput models for a data cache, which could be used for overhead analysis for various cache types with relatively small amount of inputs. These models analyze the energy and throughput of a data cache on an application basis, thus providing the hardware and software designer with the feedback vital to tune the cache or application for a given energy budget. The models are suitable for use at design time in the cache optimization process for embedded processors considering time and energy overhead or could be employed at runtime for reconfigurable architectures
Linux kernel compaction through cold code swapping
There is a growing trend to use general-purpose operating systems like Linux in embedded systems. Previous research focused on using compaction and specialization techniques to adapt a general-purpose OS to the memory-constrained environment, presented by most, embedded systems. However, there is still room for improvement: it has been shown that even after application of the aforementioned techniques more than 50% of the kernel code remains unexecuted under normal system operation. We introduce a new technique that reduces the Linux kernel code memory footprint, through on-demand code loading of infrequently executed code, for systems that support virtual memory. In this paper, we describe our general approach, and we study code placement algorithms to minimize the performance impact of the code loading. A code, size reduction of 68% is achieved, with a 2.2% execution speedup of the system-mode execution time, for a case study based on the MediaBench II benchmark suite
Transparent code authentication at the processor level
The authors present a lightweight authentication mechanism that verifies the authenticity of code and thereby addresses the virus and malicious code problems at the hardware level eliminating the need for trusted extensions in the operating system. The technique proposed tightly integrates the authentication mechanism into the processor core. The authentication latency is hidden behind the memory access latency, thereby allowing seamless on-the-fly authentication of instructions. In addition, the proposed authentication method supports seamless encryption of code (and static data). Consequently, while providing the software users with assurance for authenticity of programs executing on their hardware, the proposed technique also protects the software manufacturers’ intellectual property through encryption. The performance analysis shows that, under mild assumptions, the presented technique introduces negligible overhead for even moderate cache sizes
- …