24,564 research outputs found
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
Adaptive TTL-Based Caching for Content Delivery
Content Delivery Networks (CDNs) deliver a majority of the user-requested
content on the Internet, including web pages, videos, and software downloads. A
CDN server caches and serves the content requested by users. Designing caching
algorithms that automatically adapt to the heterogeneity, burstiness, and
non-stationary nature of real-world content requests is a major challenge and
is the focus of our work. While there is much work on caching algorithms for
stationary request traffic, the work on non-stationary request traffic is very
limited. Consequently, most prior models are inaccurate for production CDN
traffic that is non-stationary.
We propose two TTL-based caching algorithms and provide provable guarantees
for content request traffic that is bursty and non-stationary. The first
algorithm called d-TTL dynamically adapts a TTL parameter using a stochastic
approximation approach. Given a feasible target hit rate, we show that the hit
rate of d-TTL converges to its target value for a general class of bursty
traffic that allows Markov dependence over time and non-stationary arrivals.
The second algorithm called f-TTL uses two caches, each with its own TTL. The
first-level cache adaptively filters out non-stationary traffic, while the
second-level cache stores frequently-accessed stationary traffic. Given
feasible targets for both the hit rate and the expected cache size, f-TTL
asymptotically achieves both targets. We implement d-TTL and f-TTL and evaluate
both algorithms using an extensive nine-day trace consisting of 500 million
requests from a production CDN server. We show that both d-TTL and f-TTL
converge to their hit rate targets with an error of about 1.3%. But, f-TTL
requires a significantly smaller cache size than d-TTL to achieve the same hit
rate, since it effectively filters out the non-stationary traffic for
rarely-accessed objects
DIA: A complexity-effective decoding architecture
Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version
DR.SGX: Hardening SGX Enclaves against Cache Attacks with Data Location Randomization
Recent research has demonstrated that Intel's SGX is vulnerable to various
software-based side-channel attacks. In particular, attacks that monitor CPU
caches shared between the victim enclave and untrusted software enable accurate
leakage of secret enclave data. Known defenses assume developer assistance,
require hardware changes, impose high overhead, or prevent only some of the
known attacks. In this paper we propose data location randomization as a novel
defensive approach to address the threat of side-channel attacks. Our main goal
is to break the link between the cache observations by the privileged adversary
and the actual data accesses by the victim. We design and implement a
compiler-based tool called DR.SGX that instruments enclave code such that data
locations are permuted at the granularity of cache lines. We realize the
permutation with the CPU's cryptographic hardware-acceleration units providing
secure randomization. To prevent correlation of repeated memory accesses we
continuously re-randomize all enclave data during execution. Our solution
effectively protects many (but not all) enclaves from cache attacks and
provides a complementary enclave hardening technique that is especially useful
against unpredictable information leakage
Prioritized Garbage Collection: Explicit GC Support for Software Caches
Programmers routinely trade space for time to increase performance, often in
the form of caching or memoization. In managed languages like Java or
JavaScript, however, this space-time tradeoff is complex. Using more space
translates into higher garbage collection costs, especially at the limit of
available memory. Existing runtime systems provide limited support for
space-sensitive algorithms, forcing programmers into difficult and often
brittle choices about provisioning.
This paper presents prioritized garbage collection, a cooperative programming
language and runtime solution to this problem. Prioritized GC provides an
interface similar to soft references, called priority references, which
identify objects that the collector can reclaim eagerly if necessary. The key
difference is an API for defining the policy that governs when priority
references are cleared and in what order. Application code specifies a priority
value for each reference and a target memory bound. The collector reclaims
references, lowest priority first, until the total memory footprint of the
cache fits within the bound. We use this API to implement a space-aware
least-recently-used (LRU) cache, called a Sache, that is a drop-in replacement
for existing caches, such as Google's Guava library. The garbage collector
automatically grows and shrinks the Sache in response to available memory and
workload with minimal provisioning information from the programmer. Using a
Sache, it is almost impossible for an application to experience a memory leak,
memory pressure, or an out-of-memory crash caused by software caching.Comment: to appear in OOPSLA 201
Instruction fetch architectures and code layout optimizations
The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version
- …