44,050 research outputs found
Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers
We analyze the caching overhead incurred by a class of multithreaded
algorithms when scheduled by an arbitrary scheduler. We obtain bounds that
match or improve upon the well-known caching cost for the
randomized work stealing (RWS) scheduler, where is the number of steals,
is the sequential caching cost, and and are the cache size and
block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and
Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates
including a missing citation and the replacement of some big Oh terms with
precise constant
Using the Spring Physical Model to Extend a Cooperative Caching Protocol for Many-Core Processors
International audienceAs the number of embedded cores grows up, the off-chip memory wall becomes an overwhelming bottleneck. As a consequence, it is more and more prevalent to efficiently exploit on-chip data storage. In a previous work, we proposed a data sliding mechanism that allows to store data onto our closest neighborhood, even under heavy stress loads. However, each cache block is allowed to migrate only one time to a neighbor's cache (e.g. 1-Chance Forwarding). In this paper, we propose an extension of our mechanism in order to expand the cooperative caching area. Our work is based on an adaptive physical model, where each cache block is considered as a mass connected to a spring. This technique constrains data migration according to the spring constant and the difference of work-loads between cores. This adaptive data sliding approach leads to a balanced spread of data on the chip and therefore improves on-chip storage. On-chip data access has been evaluated using an analytical approach. Results show that the extended data sliding increases the global cache hit rate on the chip, especially in the context of juxtaposed hot spots
Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency
Persistent memory provides high-performance data persistence at main memory.
Memory writes need to be performed in strict order to satisfy storage
consistency requirements and enable correct recovery from system crashes.
Unfortunately, adhering to such a strict order significantly degrades system
performance and persistent memory endurance. This paper introduces a new
mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering
requirements at significantly lower performance and endurance loss. LOC
consists of two key techniques. First, Eager Commit eliminates the need to
perform a persistent commit record write within a transaction. We do so by
ensuring that we can determine the status of all committed transactions during
recovery by storing necessary metadata information statically with blocks of
data written to memory. Second, Speculative Persistence relaxes the write
ordering between transactions by allowing writes to be speculatively written to
persistent memory. A speculative write is made visible to software only after
its associated transaction commits. To enable this, our mechanism supports the
tracking of committed transaction ID and multi-versioning in the CPU cache. Our
evaluations show that LOC reduces the average performance overhead of memory
persistence from 66.9% to 34.9% and the memory write traffic overhead from
17.1% to 3.4% on a variety of workloads.Comment: This paper has been accepted by IEEE Transactions on Parallel and
Distributed System
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
Execution Integrity with In-Place Encryption
Instruction set randomization (ISR) was initially proposed with the main goal
of countering code-injection attacks. However, ISR seems to have lost its
appeal since code-injection attacks became less attractive because protection
mechanisms such as data execution prevention (DEP) as well as code-reuse
attacks became more prevalent.
In this paper, we show that ISR can be extended to also protect against
code-reuse attacks while at the same time offering security guarantees similar
to those of software diversity, control-flow integrity, and information hiding.
We present Scylla, a scheme that deploys a new technique for in-place code
encryption to hide the code layout of a randomized binary, and restricts the
control flow to a benign execution path. This allows us to i) implicitly
restrict control-flow targets to basic block entries without requiring the
extraction of a control-flow graph, ii) achieve execution integrity within
legitimate basic blocks, and iii) hide the underlying code layout under
malicious read access to the program. Our analysis demonstrates that Scylla is
capable of preventing state-of-the-art attacks such as just-in-time
return-oriented programming (JIT-ROP) and crash-resistant oriented programming
(CROP). We extensively evaluate our prototype implementation of Scylla and show
feasible performance overhead. We also provide details on how this overhead can
be significantly reduced with dedicated hardware support
Experimental Evaluation of Cache-Related Preemption Delay Aware Timing Analysis
In the presence of caches, preemptive scheduling may incur a significant overhead referred to as cache-related preemption delay (CRPD). CRPD is caused by preempting tasks evicting cached memory blocks of preempted tasks, which have to be reloaded when the preempted tasks resume their execution.
In this paper we experimentally evaluate state-of-the-art techniques to account for the CRPD during timing analysis. We find that purely synthetically-generated task sets may yield misleading conclusions regarding the relative precision of different CRPD analysis techniques and the impact of CRPD on schedulability in general. Based on task characterizations obtained by static worst-case execution time (WCET) analysis, we shed new light on the state of the art
A Delay-Aware Caching Algorithm for Wireless D2D Caching Networks
Recently, wireless caching techniques have been studied to satisfy lower
delay requirements and offload traffic from peak periods. By storing parts of
the popular files at the mobile users, users can locate some of their requested
files in their own caches or the caches at their neighbors. In the latter case,
when a user receives files from its neighbors, device-to-device (D2D)
communication is enabled. D2D communication underlaid with cellular networks is
also a new paradigm for the upcoming 5G wireless systems. By allowing a pair of
adjacent D2D users to communicate directly, D2D communication can achieve
higher throughput, better energy efficiency and lower traffic delay. In this
work, we propose a very efficient caching algorithm for D2D-enabled cellular
networks to minimize the average transmission delay. Instead of searching over
all possible solutions, our algorithm finds out the best pairs,
which provide the best delay improvement in each loop to form a caching policy
with very low transmission delay and high throughput. This algorithm is also
extended to address a more general scenario, in which the distributions of
fading coefficients and values of system parameters potentially change over
time. Via numerical results, the superiority of the proposed algorithm is
verified by comparing it with a naive algorithm, in which all users simply
cache their favorite files
- …