15 research outputs found
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
Linux kernel compaction through cold code swapping
There is a growing trend to use general-purpose operating systems like Linux in embedded systems. Previous research focused on using compaction and specialization techniques to adapt a general-purpose OS to the memory-constrained environment, presented by most, embedded systems. However, there is still room for improvement: it has been shown that even after application of the aforementioned techniques more than 50% of the kernel code remains unexecuted under normal system operation. We introduce a new technique that reduces the Linux kernel code memory footprint, through on-demand code loading of infrequently executed code, for systems that support virtual memory. In this paper, we describe our general approach, and we study code placement algorithms to minimize the performance impact of the code loading. A code, size reduction of 68% is achieved, with a 2.2% execution speedup of the system-mode execution time, for a case study based on the MediaBench II benchmark suite
Measurement-Based Timing Analysis of the AURIX Caches
Cache memories are one of the hardware resources with higher potential to reduce worst-case execution time (WCET) costs for software programs with tight real-time constraints. Yet, the complexity of cache analysis has caused a large fraction of real-time systems industry to avoid using them, especially in the automotive sector. For measurement-based timing analysis (MBTA) - the dominant technique in domains such as automotive - cache challenges the definition of test scenarios stressful enough to produce (cache) layouts that causing high contention.
In this paper, we present our experience in enabling the use of caches for a real automotive application running on an AURIX multiprocessor, using software randomization and measurement-based probabilistic timing analysis (MBPTA). Our results show that software randomization successfully exposes - in the experiments performed for timing analysis - cache related variability, in a manner that can be effectively captured by MBPTA
Recommended from our members
Statement of Work for Studies in BlueGene/L Scalability and Reconfigurability
As referenced in the subcontract, the work included three major goals: (1) study the performance of an ASCI application, (2) study tradeoffs in using the second CPU in coprocessor mode to optimize use of the L3 scratchpad memory for performing vector-like gather/scatter and streamlining operations, and (3) perform simulator studies of hardware phase detection and identification. We made some modifications to the work contract. Work involving the integration of a cache-conscious data placement algorithm to improve cache utilization on BlueGene/L has been added and work involving the L3 scratchpad memory has been eliminated. This was explained in the previous milestones. In this milestone, we continue to focus on the last goal by modifying a cycle-accurate simulator, sim-alpha [4]. As premise to hardware phase detection and identification, we need to have an infrastructure for testing various cache-conscious data placement methods. For this milestone, we discuss the completed framework that handles cache-conscious placement optimizations, which includes profiling data accesses and handling remapped addresses. We will also introduce an algorithm (ccdp profiling tool) that we implemented for assigning remapped addresses for a given code. Our performance results show that by using our ccdp profiling tool, we achieve reduced miss rates and an improved overall simulation performance. For our test cases, we use four applications from the SPEC CPU 2000 suite [2]. In our past milestones, we studied research that involves implementing cache-conscious data placement techniques. By becoming more familiar with previous research, we can make better decisions on designing our cache-conscious profiling tool. It is important to have a firm understanding of the existing techniques that have proven to be efficient at improving memory performance, since our tool will produce trace files as input to our enhanced simulator framework