1,744 research outputs found
DR.SGX: Hardening SGX Enclaves against Cache Attacks with Data Location Randomization
Recent research has demonstrated that Intel's SGX is vulnerable to various
software-based side-channel attacks. In particular, attacks that monitor CPU
caches shared between the victim enclave and untrusted software enable accurate
leakage of secret enclave data. Known defenses assume developer assistance,
require hardware changes, impose high overhead, or prevent only some of the
known attacks. In this paper we propose data location randomization as a novel
defensive approach to address the threat of side-channel attacks. Our main goal
is to break the link between the cache observations by the privileged adversary
and the actual data accesses by the victim. We design and implement a
compiler-based tool called DR.SGX that instruments enclave code such that data
locations are permuted at the granularity of cache lines. We realize the
permutation with the CPU's cryptographic hardware-acceleration units providing
secure randomization. To prevent correlation of repeated memory accesses we
continuously re-randomize all enclave data during execution. Our solution
effectively protects many (but not all) enclaves from cache attacks and
provides a complementary enclave hardening technique that is especially useful
against unpredictable information leakage
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Software Grand Exposure: SGX Cache Attacks Are Practical
Side-channel information leakage is a known limitation of SGX. Researchers
have demonstrated that secret-dependent information can be extracted from
enclave execution through page-fault access patterns. Consequently, various
recent research efforts are actively seeking countermeasures to SGX
side-channel attacks. It is widely assumed that SGX may be vulnerable to other
side channels, such as cache access pattern monitoring, as well. However, prior
to our work, the practicality and the extent of such information leakage was
not studied.
In this paper we demonstrate that cache-based attacks are indeed a serious
threat to the confidentiality of SGX-protected programs. Our goal was to design
an attack that is hard to mitigate using known defenses, and therefore we mount
our attack without interrupting enclave execution. This approach has major
technical challenges, since the existing cache monitoring techniques experience
significant noise if the victim process is not interrupted. We designed and
implemented novel attack techniques to reduce this noise by leveraging the
capabilities of the privileged adversary. Our attacks are able to recover
confidential information from SGX enclaves, which we illustrate in two example
cases: extraction of an entire RSA-2048 key during RSA decryption, and
detection of specific human genome sequences during genomic indexing. We show
that our attacks are more effective than previous cache attacks and harder to
mitigate than previous SGX side-channel attacks
Reducing Cache Contention On GPUs
The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Multi-criteria optimization for energy-efficient multi-core systems-on-chip
The steady down-scaling of transistor dimensions has made possible the evolutionary progress leading to today’s high-performance multi-GHz microprocessors and core based System-on-Chip (SoC) that offer superior performance, dramatically reduced cost per function, and much-reduced physical size compared to their predecessors. On the negative side, this rapid scaling however also translates to high power densities, higher operating temperatures and reduced reliability making it imperative to address design issues that have cropped up in its wake. In particular, the aggressive physical miniaturization have increased CMOS fault sensitivity to the extent that many reliability constraints pose threat to the device normal operation and accelerate the onset of wearout-based failures. Among various wearout-based failure mechanisms, Negative biased temperature instability (NBTI) has been recognized as the most critical source of device aging.
The urge of reliable, low-power circuits is driving the EDA community to develop new design techniques, circuit solutions, algorithms, and software, that can address these critical issues. Unfortunately, this challenge is complicated by the fact that power and reliability are known to be intrinsically conflicting metrics: traditional solutions to improve reliability such as redundancy, increase of voltage levels, and up-sizing of critical devices do contrast with traditional low-power solutions, which rely on compact architectures, scaled supply voltages, and small devices.
This dissertation focuses on methodologies to bridge this gap and establishes an important link between low-power solutions and aging effects. More specifically, we proposed new architectural solutions based on power management strategies to enable the design of low-power, aging aware cache memories.
Cache memories are one of the most critical components for warranting reliable and timely operation. However, they are also more susceptible to aging effects. Due to symmetric structure of a memory cell, aging occurs regardless of the fact that a cell (or word) is accessed or not. Moreover, aging is a worst-case matric and line with worst-case access pattern determines the aging of the entire cache. In order to stop the aging of a memory cell, it must be put into a proper idle state when a cell (or word) is not accessed which require proper management of the idleness of each atomic unit of power management.
We have proposed several reliability management techniques based on the idea of cache partitioning to alleviate NBTI-induced aging and obtain joint energy and lifetime benefits. We introduce graceful degradation mechanism which allows different cache blocks into which a cache is partitioned to age at different rates. This implies that various sub-blocks become unreliable at different times, and the cache keeps functioning with reduced efficiency. We extended the capabilities of this architecture by integrating the concept of reconfigurable caches to maintain the performance of the cache throughout its lifetime. By this strategy, whenever a block becomes unreliable, the remaining cache is reconfigured to work as a smaller size cache with only a marginal degradation of performance.
Many mission-critical applications require guaranteed lifetime of their operations and therefore the hardware implementing their functionality. Such constraints are usually enforced by means of various reliability enhancing solutions mostly based on redundancy which are not energy-friendly. In our work, we have proposed a novel cache architecture in which a smart use of cache partitions for redundancy allows us to obtain cache that meet a desired lifetime target with minimal energy consumption
L2C: Combining Lossy and Lossless Compression on Memory and I/O
In this paper we introduce L2C, a hybrid lossy/lossless compression scheme applicable both to the memory subsystem and I/O traffic of a processor chip. L2C employs general-purpose lossless compression and combines it with state of the art lossy compression to achieve compression ratios up to 16:1 and improve the utilization of chip\u27s bandwidth resources. Compressing memory traffic yields lower memory access time, improving system performance and energy efficiency. Compressing I/O traffic offers several benefits for resource-constrained systems, including more efficient storage and networking.We evaluate L2C as a memory compressor in simulation with a set of approximation-tolerant applications. L2C improves baseline execution time by an average of 50\%, and total system energy consumption by 16%. Compared to the lossy and lossless current state of the art memory compression approaches, L2C improves execution time by 9% and 26% respectively, and reduces system energy costs by 3% and 5%, respectively.I/O compression efficacy is evaluated using a set of real-life datasets. L2C achieves compression ratios of up to 10.4:1 for a single dataset and on average about 4:1, while introducing no more than 0.4% error
Recommended from our members
Efficient fine-grained virtual memory
Virtual memory in modern computer systems provides a single abstraction of the memory hierarchy.
By hiding fragmentation and overlays of physical memory, virtual memory frees applications from managing physical memory and improves programmability.
However, virtual memory often introduces noticeable overhead.
State-of-the-art systems use a paged virtual memory that maps virtual addresses to physical addresses
in page granularity (typically 4 KiB ).This mapping is stored as a page table. Before accessing physically addressed memory, the page table is accessed
to translate virtual addresses to physical addresses. Research shows that the overhead of accessing the page table can even exceed the execution time for some important applications.
In addition, this fine-grained mapping changes the access patterns between virtual and physical address spaces, introducing difficulties to many architecture techniques, such as caches and prefecthers.
In this dissertation, I propose architecture mechanisms to reduce the overhead of accessing and managing fine-grained virtual memory without compromising existing benefits.
There are three main contributions in this dissertation.
First, I investigate the impact of address translation on cache. I examine the restriction of virtually indexed, physically tagged (VIPT) caches with fine-grained paging and conclude that this restriction may lead to sub-optimal cache designs.
I introduce a novel cache strategy, speculatively indexed, physically tagged (SIPT) to enable flexible cache indexing under fine-grained page mapping.
SIPT speculates on the value of a few more index bits (1 - 3 in our experiments) to access the cache speculatively before translation, and then verify that the physical tag matches after translation.
Utilizing the fact that a simple relation generally exists between virtual and physical addresses, because memory allocators often exhibit contiguity, I also propose low-cost mechanisms to predict and correct potential mis-speculations.
Next, I focus on reducing the overhead of address translation for fine-grained virtual memory. I propose a novel architecture mechanism, Embedded Page Translation Information (EMPTI),
to provide general fine-grained page translation information on top of coarse-grained virtual memory.
EMPTI does so by speculating that a virtual address is mapped to a pre-determined physical location and then verifying the translation with a very-low-cost access to metadata embedded with data.
Coarse-grained virtual memory mechanisms (e.g., segmentation) are used to suggest the pre-determined physical location for each virtual page.
Overall, EMPTI achieves the benefits of low overhead translation while keeping the flexibility and programmability of fine-grained paging.
Finally, I improve the efficiency of metadata caching based on the fact that memory mapping contiguity generally exists beyond a page boundary.
In state-of-the-art architectures, caches treat PTEs (page table entries) as regular data. Although this is simple and straightforward,
it fails to maximize the storage efficiency of metadata.
Each page in the contiguously mapped region costs a full 8-byte PTE. However, the delta between virtual addresses and physical addresses remain the same and most metadata are identical.
I propose a novel microarchitectural mechanism that expands the effective PTE storage in the last-level-cache (LLC) and reduces the number of page-walk accesses that miss the LLC.Electrical and Computer Engineerin
- …