94 research outputs found
Predictable and composable system-on-chip memory controllers
Contemporary System-on-Chip (SoC) become more and more complex, as increasing integration results in a larger number of concurrently executing applications. These applications consist of tasks that are mapped on heterogeneous multi-processor platforms with distributed memory hierarchies, where SRAMs and SDRAMs are shared by a variety of arbiters. Some applications have real-time requirements, meaning that they must perform a particular computation before a deadline to guarantee functional correctness, or to prevent quality degradation. Mapping the applications on the platform such that all real-time requirements are satisfied is very challenging. The number of possible mappings of tasks to processing elements and data structures to memories may be large, and appropriate configuration settings must be determined once the mapping is chosen. Verifying that a particular mapping satisfies all application requirements is typically done by system-level simulation. However, resource sharing causes interference between applications, making their temporal behaviors inter-dependent. All concurrently executing applications must hence be verified together, causing the verification complexity of the system to increase exponentially with the number of applications. Together these factors contribute to making the integration and verification process a dominant part of SoC development, both in terms of time and money. Predictable and composable systems are proposed to manage the increasing verification complexity. Predictable systems provide lower bounds on application performance, while applications in composable systems are completely isolated and cannot affect each other’s temporal behavior by even a single clock cycle. Predictable systems enable formal verification that covers all possible interactions with the platform. However, this assumes that the behavior of an application is captured in a performance model, which is not the case for many applications. Composability offers a complementary verification approach by letting these applications be verified independently by simulation with linear verification complexity. A limitation of current predictable and composable systems is that there are no memory controllers supporting the concepts in a general way. Current SRAM controllers can be shared in a predictable way with a variety of arbiters, but are only composable if statically scheduled or shared using time-division multiplexing. Existing SDRAM controllers are not composable, and are either unpredictable or limited to applications that are statically scheduled. This thesis addresses the limitations of current predictable and composable systems by proposing a general predictable and composable memory controller, thereby addressing the mapping and verification problem in embedded systems. The proposed memory controller is divided into a front-end and a back-end. The back-end is specific for DDR2/DDR3 SDRAM and makes the memory behave in a predictable manner using precomputed memory patterns that are dynamically combined at run time. The front-end contains buffering and an arbiter in the class of Latency-Rate (LR) servers, which is a class with many well-known predictable arbiters. We extend this class with a Credit-Controlled Static-Priority (CCSP) arbiter that is developed specifically for shared resources with latency-critical requestors and high loads, such as memories. Three key features of CCSP are: 1) It accommodates latency-critical requestors with low bandwidth requirements without wasting bandwidth. 2) Over-allocated bandwidth can be made negligible at an increased area cost, without affecting latency. 3) It has a small implementation that runs fast enough to keep up with most DDR2/DDR3 memories. The proposed front-end is general and can be used with other predictable resources, such as SRAM controllers. The proposed memory controller hence supports multiple arbiter and memory types, thus addressing the diversity in modern SoCs. The combination of front-end and predictable memory behaves like a LR server, which is the shared resource abstraction used in this work. In essence, a LR server guarantees a requestor a minimum bandwidth and a maximum latency, enabling formal verification of real-time requirements. The LR server model is compatible with several commonly used formal analysis frameworks, such as network calculus and data-flow analysis. Our memory controller hence allows any combination of predictable memory and LR arbiter to be used transparently for formal verification of applications with any of these frameworks. Sharing a predictable memory at run-time results in interference between requestors, making the memory controller non-composable. This is addressed by adding a Delay Block to the front-end that delays all signals sent from the front-end to a requestor to always emulate worst-case interference. This makes requestors unable to affect each other’s temporal behavior, which is sufficient to guarantee composability on the level of applications. Our predictable memory controller hence offers composable service with a variety of memory and arbiter types, which widely extends the scope of composable platforms. Another benefit of this approach is that it enables composable service to be dynamically enabled and disabled, enabling requestors that do not require composable service to use slack bandwidth to improve performance. The predictable and composable memory controller is supported by a configuration flow that automatically computes memory patterns and arbiter settings to satisfy given bandwidth and latency requirements. The flow uses abstraction to separate the configuration of the memory and the arbiter, enabling settings to be computed in a streamlined fashion for all supported memories and arbiters
Cache-affinity scheduling for fine grain multithreading
Cache utilisation is often very poor in multithreaded applications, due to the loss of data access locality incurred by frequent context switching. This problem is compounded on shared memory multiprocessors when dynamic load balancing is introduced and thread migration disrupts cache content. In this paper, we present a technique, which we refer to as ‘batching’, for reducing the negative impact of fine grain multithreading on cache performance. Prototype schedulers running on uniprocessors and shared memory multiprocessors are described, and finally experimental results which illustrate the improvements observed after applying our techniques are presented.peer-reviewe
Dynamics and pragmatics for high performance concurrency
This thesis is concerned with support at all levels for building highly concurrent and dynamic parallel processing systems. The CSP model of concurrency, as (largely) embodied in the occam programming language is used due to its simplicity, expressiveness, architecture- independent nature, and potential for high performance. Additionally, occam provides guarantees regarding freedom from aliasing and race-hazard error. This thesis addresses one of the grand challenges of present day computer science: providing a software technology that offers the dynamic flexibility and performance of mainstream object oriented environments with the level of safety, formal analysis, modularity and lightweight concurrency offered by CSP/occam. Two approaches to this challenge are possible: do something to make the mainstream languages (e.g. Java, C++) safe, or make occam dynamic -- without compromising its existing good properties. This thesis follows the latter route. The first part of this thesis concentrates on enhancing the occam language and run-time system, on a commodity platform (IBM PC) running the freely available Linux operating system. After a brief introduction to the various components of the kroc occam system, additions and extensions to the occam programming language and supporting run-time system are examined. These provide a greater degree of programming flexibility in occam (for example, by adding support for dynamic allocation, mobile semantics and dynamic network construction), without compromising the safety of programs which use them. Benchmarks are reported that demonstrate significant improvements in performance (for example, channel communication in tens of nano-seconds). The second part concentrates on improving the level of interaction between occam programs and the OS environment. Providing easy access to sockets and networking, for example. This thesis concludes with a discussion of the work presented herein, with consideration given to parallels with object-oriented languages. Also described are details of ongoing and potential future research. The modified language grammar, details of new compiler generated code, and miscellany are provided in the appendices.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Dynamics and pragmatics for high performance concurrency
This thesis is concerned with support at all levels for building highly concurrent and dynamic parallel processing systems. The CSP model of concurrency, as (largely) embodied in the occam programming language is used due to its simplicity, expressiveness, architecture- independent nature, and potential for high performance. Additionally, occam provides guarantees regarding freedom from aliasing and race-hazard error. This thesis addresses one of the grand challenges of present day computer science: providing a software technology that offers the dynamic flexibility and performance of mainstream object oriented environments with the level of safety, formal analysis, modularity and lightweight concurrency offered by CSP/occam. Two approaches to this challenge are possible: do something to make the mainstream languages (e.g. Java, C++) safe, or make occam dynamic -- without compromising its existing good properties. This thesis follows the latter route. The first part of this thesis concentrates on enhancing the occam language and run-time system, on a commodity platform (IBM PC) running the freely available Linux operating system. After a brief introduction to the various components of the kroc occam system, additions and extensions to the occam programming language and supporting run-time system are examined. These provide a greater degree of programming flexibility in occam (for example, by adding support for dynamic allocation, mobile semantics and dynamic network construction), without compromising the safety of programs which use them. Benchmarks are reported that demonstrate significant improvements in performance (for example, channel communication in tens of nano-seconds). The second part concentrates on improving the level of interaction between occam programs and the OS environment. Providing easy access to sockets and networking, for example. This thesis concludes with a discussion of the work presented herein, with consideration given to parallels with object-oriented languages. Also described are details of ongoing and potential future research. The modified language grammar, details of new compiler generated code, and miscellany are provided in the appendices
Life of occam-Pi
This paper considers some questions prompted by a brief review of the history of computing. Why is programming so hard? Why is concurrency considered an “advanced” subject? What’s the matter with Objects? Where did all the Maths go? In searching for answers, the paper looks at some concerns over fundamental ideas within object orientation (as represented by modern programming languages), before focussing on the concurrency model of communicating processes and its particular expression in the occam family of languages. In that focus, it looks at the history of occam, its underlying philosophy (Ockham’s Razor), its semantic foundation on Hoare’s CSP, its principles of process oriented design and its development over almost three decades into occam-? (which blends in the concurrency dynamics of Milner’s ?-calculus). Also presented will be an urgent need for rationalisation – occam-? is an experiment that has demonstrated significant results, but now needs time to be spent on careful review and implementing the conclusions of that review. Finally, the future is considered. In particular, is there a future
Recommended from our members
A Globally Arbitrated Memory Tree for Mixed-Time-Criticality Systems
Embedded systems are increasingly based on multi-core platforms to accommodate a growing number of applications, some of which have real-time requirements. Resources, such as off-chip DRAM, are typically shared between the applications using memory interconnects with different arbitration polices to cater to diverse bandwidth and latency requirements. However, traditional centralized interconnects are not scalable as the number of clients increase. Similarly, current distributed interconnects either cannot satisfy the diverse requirements or have decoupled arbitration stages, resulting in larger area, power and worst-case latency. The four main contributions of this article are: 1) a Globally Arbitrated Memory Tree (GAMT) with a distributed architecture that scales well with the number of cores, 2) an RTL-level implementation that can be configured with five arbitration policies (three distinct and two as special cases), 3) the concept of mixed arbitration policies that allows the policy to be selected individually per core, and 4) a worst-case analysis for a mixed arbitration policy that combines TDM and FBSP arbitration.We compare the performance of GAMT with centralized implementations and show that it can run up to four times faster and have over 51 and 37 percent reduction in area and power consumption, respectively, for a given bandwidth
- …