427 research outputs found
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
Instruction prefetching techniques for ultra low-power multicore architectures
As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computing performance. To reduce this bottleneck, designers have been working on techniques to hide these latencies. On the other hand, design of embedded processors typically targets low cost and low power consumption. Therefore, techniques which can satisfy these constraints are more desirable for embedded domains. While out-of-order execution, aggressive speculation, and complex branch prediction algorithms can help hide the memory access latency in high-performance systems, yet they can cost a heavy power budget and are not suitable for embedded systems. Prefetching is another popular method for hiding the memory access latency, and has been studied very well for high-performance processors. Similarly, for embedded processors with strict power requirements, the application of complex prefetching techniques is greatly limited, and therefore, a low power/energy solution is mostly desired in this context.
In this work, we focus on instruction prefetching for ultra-low power processing architectures and aim to reduce energy overhead of this operation by proposing a combination of simple, low-cost, and energy efficient prefetching techniques. We study a wide range of applications from cryptography to computer vision and show that our proposed mechanisms can effectively improve the hit-rate of almost all of them to above 95%, achieving an average performance improvement of more than 2X. Plus, by synthesizing our designs using the state-of-the-art technologies we show that the prefetchers increase system’s power consumption less than 15% and total silicon area by less than 1%. Altogether, a total energy reduction of 1.9X is achieved, thanks to the proposed schemes, enabling a significantly higher battery life
Application-Specific Memory Subsystems
The disparity in performance between processors and main memories has
led computer architects to incorporate large cache hierarchies in
modern computers. These cache hierarchies are designed to be
general-purpose in that they strive to provide the best possible
performance across a wide range of applications. However, such a memory
subsystem does not necessarily provide the best possible performance for
a particular application.
Although general-purpose memory subsystems are desirable when the
work-load is unknown and the memory subsystem must remain fixed,
when this is not the case a custom memory subsystem may be beneficial.
For example, in an application-specific integrated circuit (ASIC) or
a field-programmable gate array (FPGA) designed to run a particular
application, a custom memory subsystem optimized for that application
would be desirable. In addition, when there are tunable
parameters in the memory subsystem, it may make sense to change these
parameters depending on the application being run. Such a situation
arises today with FPGAs and, to a lesser extent, GPUs, and it is
plausible that general-purpose computers will begin to support
greater flexibility in the memory subsystem in the future.
In this dissertation, we first show that it is possible to create
application-specific memory subsystems that provide much better
performance than a general-purpose memory subsystem. In addition,
we show a way to discover such memory subsystems automatically using
a superoptimization technique on memory address traces gathered
from applications. This allows one to generate a custom memory subsystem
with little effort.
We next show that our memory subsystem superoptimization technique can
be used to optimize for objectives other than performance. As an example,
we show that it is possible to reduce the number of writes to the main
memory, which can be useful for main memories with limited write
durability, such as flash or Phase-Change Memory (PCM).
Finally, we show how to superoptimize memory subsystems for streaming
applications, which are a class of parallel applications. In particular, we
show that, through the use of ScalaPipe, we can author and deploy streaming
applications targeting FPGAs with superoptimized memory subsystems.
ScalaPipe is a domain-specific language (DSL) embedded in the Scala
programming language for generating streaming applications that can be
implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we
are able to demonstrate actual performance improvements using the
superoptimized memory subsystem with applications implemented in hardware
WCET analysis of multi-level set-associative instruction caches
With the advent of increasingly complex hardware in real-time embedded
systems (processors with performance enhancing features such as pipelines,
cache hierarchy, multiple cores), many processors now have a set-associative L2
cache. Thus, there is a need for considering cache hierarchies when validating
the temporal behavior of real-time systems, in particular when estimating
tasks' worst-case execution times (WCETs). To the best of our knowledge, there
is only one approach for WCET estimation for systems with cache hierarchies
[Mueller, 1997], which turns out to be unsafe for set-associative caches. In
this paper, we highlight the conditions under which the approach described in
[Mueller, 1997] is unsafe. A safe static instruction cache analysis method is
then presented. Contrary to [Mueller, 1997] our method supports set-associative
and fully associative caches. The proposed method is experimented on
medium-size and large programs. We show that the method is most of the time
tight. We further show that in all cases WCET estimations are much tighter when
considering the cache hierarchy than when considering only the L1 cache. An
evaluation of the analysis time is conducted, demonstrating that analysing the
cache hierarchy has a reasonable computation time
Cache-Aware Instruction SPM Allocation for Hard Real-Time Systems
To improve the execution time of a program, parts of its instructions can be allocated to a fast Scratchpad Memory (SPM) at compile time. This is a well-known technique which can be used to minimize the program's worst-case Execution Time (WCET). However, modern embedded systems often use cached main memories. An SPM allocation will inevitably lead to changes in the program's memory layout in main memory, resulting in either improved or degraded worst-case caching behavior. We tackle this issue by proposing a cache-aware SPM allocation algorithm based on integer-linear programming which accounts for changes in the worst-case cache miss behavior
Improving Instruction Fetch Rate with Code Pattern Cache for Superscalar Architecture
In the past, instruction fetch speeds have been improved by using cache schemes that capture the actual program flow. In this proposal, we present the architecture of a new instruction cache named code pattern cache (CPC); the cache is used with superscalar processors. CPC?s operation is based on the fundamental principles that: common programs tend to repeat their execution patterns; and efficient storage of a program flow can enhance the performance of an instruction fetch mechanism. CPC saves basic blocks (sets of instructions separated by control instructions) and their boundary addresses while the code is running. Basic blocks and their addresses are stored in two separate structures, called block pointer cache (BPC) and basic block cache (BBC), respectively. Later, if the same basic block sequence is expected to execute, it is fetched from CPC, instead of the instruction cache; this mechanism results in higher likelihood of delivering a larger number of instructions in every clock cycle. We developed single and multi-threaded simulators for TC, BC, and CPC, and used them with 10 SPECint2000 benchmarks. The simulation results demonstrated CPC?s advantage over TC and BC, in terms of trace miss rate and average trace length. Additionally, we used cache models to quantify the timing, area, and power for the three cache schemes. Using an aggregate performance index that combined the simulation and modeling results, CPC was shown to perform better than both TC and BC. During our research, each of the TC-, BC-, or CPC- configurations took 4-6 hours to simulate, so performance comparison of these caches proved to be a very time-consuming process. Neural network models (NNM?s) can be time-efficient alternatives to simulations, so we studied their feasibility to represent the cache behavior. We developed two NNM\u27s, one to predict the trace miss rate and the other to predict the average trace length for the three caches. The NNM\u27s modeled the caches with reasonable accuracy, and produced results in a fraction of a second
- …