3,657 research outputs found
HardScope: Thwarting DOP with Hardware-assisted Run-time Scope Enforcement
Widespread use of memory unsafe programming languages (e.g., C and C++)
leaves many systems vulnerable to memory corruption attacks. A variety of
defenses have been proposed to mitigate attacks that exploit memory errors to
hijack the control flow of the code at run-time, e.g., (fine-grained)
randomization or Control Flow Integrity. However, recent work on data-oriented
programming (DOP) demonstrated highly expressive (Turing-complete) attacks,
even in the presence of these state-of-the-art defenses. Although multiple
real-world DOP attacks have been demonstrated, no efficient defenses are yet
available. We propose run-time scope enforcement (RSE), a novel approach
designed to efficiently mitigate all currently known DOP attacks by enforcing
compile-time memory safety constraints (e.g., variable visibility rules) at
run-time. We present HardScope, a proof-of-concept implementation of
hardware-assisted RSE for the new RISC-V open instruction set architecture. We
discuss our systematic empirical evaluation of HardScope which demonstrates
that it can mitigate all currently known DOP attacks, and has a real-world
performance overhead of 3.2% in embedded benchmarks
Adaptive runtime-assisted block prefetching on chip-multiprocessors
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft
HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems
Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising
alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility,
low leakage power, high density, and fast read speed. The STT-RAM's small
feature size is particularly desirable for the last-level cache (LLC), which
typically consumes a large area of silicon die. However, long write latency and
high write energy still remain challenges of implementing STT-RAMs in the CPU
cache. An increasingly popular method for addressing this challenge involves
trading off the non-volatility for reduced write speed and write energy by
relaxing the STT-RAM's data retention time. However, in order to maximize
energy saving potential, the cache configurations, including STT-RAM's
retention time, must be dynamically adapted to executing applications' variable
memory needs. In this paper, we propose a highly adaptable last level STT-RAM
cache (HALLS) that allows the LLC configurations and retention time to be
adapted to applications' runtime execution requirements. We also propose
low-overhead runtime tuning algorithms to dynamically determine the best
(lowest energy) cache configurations and retention times for executing
applications. Compared to prior work, HALLS reduced the average energy
consumption by 60.57% in a quad-core system, while introducing marginal latency
overhead.Comment: To Appear on IEEE Transactions on Computers (TC
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
Improving Phase Change Memory Performance with Data Content Aware Access
A prominent characteristic of write operation in Phase-Change Memory (PCM) is
that its latency and energy are sensitive to the data to be written as well as
the content that is overwritten. We observe that overwriting unknown memory
content can incur significantly higher latency and energy compared to
overwriting known all-zeros or all-ones content. This is because all-zeros or
all-ones content is overwritten by programming the PCM cells only in one
direction, i.e., using either SET or RESET operations, not both. In this paper,
we propose data content aware PCM writes (DATACON), a new mechanism that
reduces the latency and energy of PCM writes by redirecting these requests to
overwrite memory locations containing all-zeros or all-ones. DATACON operates
in three steps. First, it estimates how much a PCM write access would benefit
from overwriting known content (e.g., all-zeros, or all-ones) by
comprehensively considering the number of set bits in the data to be written,
and the energy-latency trade-offs for SET and RESET operations in PCM. Second,
it translates the write address to a physical address within memory that
contains the best type of content to overwrite, and records this translation in
a table for future accesses. We exploit data access locality in workloads to
minimize the address translation overhead. Third, it re-initializes unused
memory locations with known all-zeros or all-ones content in a manner that does
not interfere with regular read and write accesses. DATACON overwrites unknown
content only when it is absolutely necessary to do so. We evaluate DATACON with
workloads from state-of-the-art machine learning applications, SPEC CPU2017,
and NAS Parallel Benchmarks. Results demonstrate that DATACON significantly
improves system performance and memory system energy consumption compared to
the best of performance-oriented state-of-the-art techniques.Comment: 18 pages, 21 figures, accepted at ACM SIGPLAN International Symposium
on Memory Management (ISMM
- …