948 research outputs found

    Polynomial-Time Fence Insertion for Structured Programs

    Get PDF
    To enhance performance, common processors feature relaxed memory models that reorder instructions. However, the correctness of concurrent programs is often dependent on the preservation of the program order of certain instructions. Thus, the instruction set architectures offer memory fences. Using fences is a subtle task with performance and correctness implications: using too few can compromise correctness and using too many can hinder performance. Thus, fence insertion algorithms that given the required program orders can automatically find the optimum fencing can enhance the ease of programming, reliability, and performance of concurrent programs. In this paper, we consider the class of programs with structured branch and loop statements and present a greedy and polynomial-time optimum fence insertion algorithm. The algorithm incrementally reduces fence insertion for a control-flow graph to fence insertion for a set of paths. In addition, we show that the minimum fence insertion problem with multiple types of fence instructions is NP-hard even for straight-line programs

    A concurrency semantics for relaxed atomics that permits optimisation and avoids thin-air executions

    Get PDF
    Copyright is held by the owner/author(s). Despite much research on concurrent programming languages, especially for Java and C/C++, we still do not have a satisfactory definition of their semantics, one that admits all common optimisations without also admitting undesired behaviour. Especially problematic are the "thin-Air" examples involving high-performance concurrent accesses, such as C/C++11 relaxed atomics. The C/C++11 model is in a per-candidate-execution style, and previous work has identified a tension between that and the fact that compiler optimisations do not operate over single candidate executions in isolation; rather, they operate over syntactic representations that represent all executions. In this paper we propose a novel approach that circumvents this difficulty. We define a concurrency semantics for a core calculus, including relaxed-Atomic and non-Atomic accesses, and locks, that admits a wide range of optimisation while still forbidding the classic thin-Air examples. It also addresses other problems relating to undefined behaviour. The basic idea is to use an event-structure representation of the current state of each thread, capturing all of its potential executions, and to permit interleaving of execution and transformation steps over that to reflect optimisation (possibly dynamic) of the code. These are combined with a non-multi-copy-Atomic storage subsystem, to reflect common hardware behaviour. The semantics is defined in a mechanised and executable form, and designed to be implementable above current relaxed hardware and strong enough to support the programming idioms that C/C++11 does for this fragment. It offers a potential way forward for concurrent programming language semantics, beyond the current C/C++11 and Java models.This work was partly funded by the EPSRC Programme Grant REMS: Rigorous Engineering for Mainstream Systems, EP/K008528/

    Omphale: Streamlining the Communication for Jobs in a Multi Processor System on Chip

    Get PDF
    Our Multi Processor System on Chip (MPSoC) template provides processing tiles that are connected via a network on chip. A processing tile contains a processing unit and a Scratch Pad Memory (SPM). This paper presents the Omphale tool that performs the first step in mapping a job, represented by a task graph, to such an MPSoC, given the SPM sizes as constraints. Furthermore a memory tile is introduced. The result of Omphale is a Cyclo Static DataFlow (CSDF) model and a task graph where tasks communicate via sliding windows that are located in circular buffers. The CSDF model is used to determine the size of the buffers and the communication pattern of the data. A buffer must fit in the SPM of the processing unit that is reading from it, such that low latency access is realized with a minimized number of stall cycles. If a task and its buffer exceed the size of the SPM, the task is examined for additional parallelism or the circular buffer is partly located in a memory tile. This results in an extended task graph that satisfies the SPM size constraints

    C์˜ ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ๊ณผ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ์กฐํ™”์‹œํ‚ค๊ธฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ํ—ˆ์ถฉ๊ธธ.์ฃผ๋ฅ˜ C ์ปดํŒŒ์ผ๋Ÿฌ๋“ค์€ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๊ณต๊ฒฉ์ ์ธ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ๊ทธ๋Ÿฐ ์ตœ์ ํ™”๋Š” ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์˜ ํ–‰๋™์„ ๋ฐ”๊พธ๊ธฐ๋„ ํ•œ๋‹ค. ๋ถˆํ–‰ํžˆ๋„ C ์–ธ์–ด๋ฅผ ๋””์ž์ธํ•  ๋•Œ ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ๊ณผ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์กฐํ™”์‹œํ‚ค๊ฐ€ ๊ต‰์žฅํžˆ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ด ํ•™๊ณ„์™€ ์—…๊ณ„์˜ ์ค‘๋ก ์ด๋‹ค. ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š”, ๊ทธ๋Ÿฌํ•œ ๊ธฐ๋Šฅ์ด ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด์„ ์ž˜ ์ง€์›ํ•ด์•ผ ํ•œ๋‹ค. ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด์„œ๋Š”, ์ฃผ๋ฅ˜ ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ณต์žกํ•˜๊ณ ๋„ ํšจ๊ณผ์ ์ธ ์ตœ์ ํ™”๋ฅผ ์ž˜ ์ง€์›ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ๊ณผ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ๋™์‹œ์— ์ž˜ ์ง€์›ํ•˜๋Š” ์‹คํ–‰์˜๋ฏธ๋Š” ์˜ค๋Š˜๋‚ ๊นŒ์ง€ ์ œ์•ˆ๋œ ๋ฐ”๊ฐ€ ์—†๋‹ค. ๋ณธ ๋ฐ•์‚ฌํ•™์œ„ ๋…ผ๋ฌธ์€ ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์š”๊ธดํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์ €์ˆ˜์ค€ ๊ธฐ๋Šฅ๊ณผ ์ฃผ์š”ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ์กฐํ™”์‹œํ‚จ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฐ ๋‹ค์Œ ์„ฑ์งˆ์„ ๋งŒ์กฑํ•˜๋Š” ๋Š์Šจํ•œ ๋™์‹œ์„ฑ, ๋ถ„ํ•  ์ปดํŒŒ์ผ, ์ •์ˆ˜-ํฌ์ธํ„ฐ ๋ณ€ํ™˜์˜ ์‹คํ–‰์˜๋ฏธ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ์งธ, ๊ธฐ๋Šฅ์ด ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด๊ณผ, ๊ทธ๋Ÿฌํ•œ ํŒจํ„ด์„ ๋…ผ์ฆํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ง€์›ํ•œ๋‹ค. ๋‘˜์งธ, ์ฃผ์š”ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋“ค์„ ์ง€์›ํ•œ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•œ ์‹คํ–‰์˜๋ฏธ์— ์ž์‹ ๊ฐ์„ ์–ป๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋…ผ๋ฌธ์˜ ์ฃผ์š” ๊ฒฐ๊ณผ๋ฅผ ๋Œ€๋ถ€๋ถ„ Coq ์ฆ๋ช…๊ธฐ ์œ„์—์„œ ์ฆ๋ช…ํ•˜๊ณ , ๊ทธ ์ฆ๋ช…์„ ๊ธฐ๊ณ„์ ์ด๊ณ  ์—„๋ฐ€ํ•˜๊ฒŒ ํ™•์ธํ–ˆ๋‹ค.To improve the performance of C programs, mainstream compilers perform aggressive optimizations that may change the behaviors of programs that use low-level features in unidiomatic ways. Unfortunately, despite many years of research and industrial efforts, it has proven very difficult to adequately balance the conflicting criteria for low-level features and compiler optimizations in the design of the C programming language. On the one hand, C should support the common usage patterns of the low-level features in systems programming. On the other hand, C should also support the sophisticated and yet effective optimizations performed by mainstream compilers. None of the existing proposals for C semantics, however, sufficiently support low-level features and compiler optimizations at the same time. In this dissertation, we resolve the conflict between some of the low-level features crucially used in systems programming and major compiler optimizations. Specifically, we develop the first formal semantics of relaxed-memory concurrency, separate compilation, and cast between integers and pointers that (1) supports their common usage patterns and reasoning principles for programmers, and (2) provably validates major compiler optimizations at the same time. To establish confidence in our formal semantics, we have formalized most of our key results in the Coq theorem prover, which automatically and rigorously checks the validity of the results.Abstract Acknowledgements Chapter I Prologue Chapter II Relaxed-Memory Concurrency Chapter III Separate Compilation and Linking Chapter IV Cast between Integers and Pointers Chapter V Epilogue ์ดˆ๋กDocto

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Lasagne : a static binary translator for weak memory model architectures

    Get PDF
    Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a programโ€™s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagneโ€™s intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin

    Memory Consistency and Cache Coherency in Network-on-Chip Based Multi-Core Systems

    Get PDF
    The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user de๏ฌned, real-time applications, but the computer engineering society aims at hiding underlying platform speci๏ฌc characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, in๏ฌ‚uenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is con๏ฌrmed by simulation-based experimental results.The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user de๏ฌned, real-time applications, but the computer engineering society aims at hiding underlying platform speci๏ฌc characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, in๏ฌ‚uenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is con๏ฌrmed by simulation-based experimental results

    Memory Consistency Models

    Get PDF
    Abstract: The memory consistency model for a shared-memory multiprocessor specifies the behaviour of memory with respect to read and write operations from multiple processors. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. The optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads. We evaluate all the consistency models and the comparison for the weak consistency model and release consistency model, the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems
    • โ€ฆ
    corecore