100 research outputs found
Memory consistency directed cache coherence protocols for scalable multiprocessors
The memory consistency model, which formally specifies the behavior of the
memory system, is used by programmers to reason about parallel programs. From a
hardware design perspective, weaker consistency models permit various optimizations
in a multiprocessor system: this thesis focuses on designing and optimizing the cache
coherence protocol for a given target memory consistency model.
Traditional directory coherence protocols are designed to be compatible with the
strictest memory consistency model, sequential consistency (SC). When they are used
for chip multiprocessors (CMPs) that provide more relaxed memory consistency models,
such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of
scalability, in terms of per-core storage due to sharer tracking, which poses a problem
with increasing number of cores in today’s CMPs, most of which no longer are sequentially
consistent. The recent convergence towards programming language based relaxed
memory consistency models has sparked renewed interest in lazy cache coherence
protocols. These protocols exploit synchronization information by enforcing coherence
only at synchronization boundaries via self-invalidation. As a result, such protocols do
not require sharer tracking which benefits scalability. On the downside, such protocols
are only readily applicable to a restricted set of consistency models, such as Release
Consistency (RC), which expose synchronization information explicitly. In particular,
existing architectures with stricter consistency models (such as x86) cannot readily
make use of lazy coherence protocols without either: adapting the protocol to satisfy
the stricter consistency model; or changing the architecture’s consistency model to (a
variant of) RC, typically at the expense of backward compatibility. The first part of
this thesis explores both these options, with a focus on a practical approach satisfying
backward compatibility.
Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and
SPARC processors, and existing parallel programs written for these architectures, we
first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency
model. TSO-CC does not track sharers and instead relies on self-invalidation and
detection of potential acquires (in the absence of explicit synchronization) using per
cache line timestamps to efficiently and lazily satisfy the TSO memory consistency
model. Our results show that TSO-CC achieves, on average, performance comparable
to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales
logarithmically with increasing core count.
Next, we propose an approach for the x86-64 architecture, which is a compromise
between retaining the original consistency model and using a more storage efficient
lazy coherence protocol. First, we propose a mechanism to convey synchronization
information via a simple ISA extension, while retaining backward compatibility with
legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC),
a scalable cache coherence protocol for RCtso, the resulting memory consistency
model. RC3 does not track sharers and relies on self-invalidation on acquires. To
satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1
timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving
performance comparable to a MESI directory protocol for RC optimized programs.
RC3’s storage overhead per cache line scales logarithmically with increasing core count
and reduces on-chip coherence storage overheads by 45% compared to TSO-CC.
Finally, it is imperative that hardware adheres to the promised memory consistency
model. Indeed, consistency directed coherence protocols cannot use conventional coherence
definitions (e.g. SWMR) to be verified against, and few existing verification
methodologies apply. Furthermore, as the full consistency model is used as a specification,
their interaction with other components (e.g. pipeline) of a system must not be
neglected in the verification process. Therefore, verifying a system with such protocols
in the context of interacting components is even more important than before. One
common way to do this is via executing tests, where specific threads of instruction
sequences are generated and their executions are checked for adherence to the consistency
model. It would be extremely beneficial to execute such tests under simulation,
i.e. when the functional design implementation of the hardware is being prototyped.
Most prior verification methodologies, however, target post-silicon environments, which
when used for simulation-based memory consistency verification would be too slow.
We propose McVerSi, a test generation framework for fast memory consistency
verification of a full-system design implementation under simulation. Our primary
contribution is a Genetic Programming (GP) based approach to memory consistency test
generation, which relies on a novel crossover function that prioritizes memory operations
contributing to non-determinism, thereby increasing the probability of uncovering
memory consistency bugs. To guide tests towards exercising as much logic as possible,
the simulator’s reported coverage is used as the fitness function. Furthermore, we
increase test throughput by making the test workload simulation-aware. We evaluate
our proposed framework using the Gem5 cycle accurate simulator in full-system mode
with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed
TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI
protocol due to the faulty interaction of the pipeline and the cache coherence protocol,
highlighting that even conventional protocols should be verified rigorously in the
context of a full-system. Crucially, these bugs would not have been discovered through
individual verification of the pipeline or the coherence protocol. We study 11 bugs
in total. Our GP-based test generation approach finds all bugs consistently, therefore
providing much higher guarantees compared to alternative approaches (pseudo-random
test generation and litmus tests)
Eventual Consistency: Origin and Support
Eventual consistency is demanded nowadays in geo-replicated services that need to be highly scalable and available. According to the CAP constraints, when network partitions may arise, a distributed service should choose between being strongly consistent or being highly available. Since scalable services should be available, a relaxed consistency (while the network is partitioned) is the preferred choice. Eventual consistency is not a common data-centric consistency model, but only a state convergence condition to be added to a relaxed consistency model. There are still several aspects of eventual consistency that have not been analysed in depth in previous works: 1. which are the oldest replication proposals providing eventual consistency, 2. which replica consistency models provide the best basis for building eventually consistent services, 3. which mechanisms should be considered for implementing an eventually consistent service, and 4. which are the best combinations of those mechanisms for achieving different concrete goals. This paper provides some notes on these important topics
Notes on Theory of Distributed Systems
Notes for the Yale course CPSC 465/565 Theory of Distributed Systems
Recommended from our members
Data Management Solutions for Tackling Big Data Variety
Variety is one of the three defining characteristics of Big Data; the others being Volume and Velocity. There are several aspects of this data variety: diversity in data formats (text, video, audio) and structure (relational, graph etc), variety in access methodologies(OLTP, OLAP), and distribution heterogeneity within the workloads (read-heavy, high contention). Data management solutions for modern-day applications need to tackle this variety.This dissertation provides an understanding of the challenges associated with the different elements of variety, and proposes several solutions for efficiently handling its various aspects. First, the dissertation studies the challenges related to variety in data structure and access methodologies, and the resultant heterogeneity at the data infrastructure level. Applications now employ several data-processing engines with different underlying representations, like row, column, graph etc., to process their data. We propose Janus, which introduces a novel data-movement pipeline, which enables the use of different representations to support both high throughput of transactions and diverse analytics, while still ensuring consistent real-time analytics in a scale-out setting. Janus partitions the data at different representations, and allows distributed transactions and diverse partitioning strategies at the representations. Then, we propose Typhon and Cerberus, which define and enforce consistency semantics for application data spread across representations. Second, this dissertation proposes solutions for handling distribution heterogeneity within the workloads. Workloads can have have skewed distribution in terms of operation-type, data access or temporal variation. We propose strongly-consistent quorum reads for Raft-like consensus protocols, which can be utilized to scale read-heavy workloads. For supporting high contention transaction workloads, we integrate an existing dynamic timestamp allocation based concurrency control mechanism in a distributed OLTP setting, and analyze its performance. Third, we study IoT applications, which have to deal with both physical heterogeneity of the sensors, as well asdiverse data-processing demands. We propose a multi-representation based architecture catering to IoT applications, and also present the initial design of M-stream, a computation framework for enabling integration and monitoring of uncertain data from multiplesensors. Through analysis, illustrative examples and extensive evaluation of the proposed protocols, this dissertation demonstrates that the proposed solutions can be employed for efficiently handling the different aspects of variety of data-intensive applications
Recommended from our members
A distributed analysis and monitoring framework for the compact Muon solenoid experiment and a pedestrian simulation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The design of a parallel and distributed computing system is a very complicated task. It requires a detailed understanding of the design issues and of the theoretical and practical aspects of their solutions. Firstly, this thesis discusses in detail the major concepts and components required to make parallel and distributed computing a reality. A multithreaded and distributed framework capable of analysing the simulation data produced by a pedestrian simulation software was developed. Secondly, this thesis discusses the origins and fundamentals of Grid computing and the motivations for its use in High Energy Physics. Access to the data produced by the Large Hadron Collider (LHC) has to be provided for more than five thousand scientists all over the world. Users who run analysis jobs on the Grid do not necessarily have expertise in Grid computing. Simple, userfriendly and reliable monitoring of the analysis jobs is one of the key components of the operations of the distributed analysis; reliable monitoring is one of the crucial components of the Worldwide LHC Computing Grid for providing the functionality and performance that is required by the LHC experiments. The CMS Dashboard Task Monitoring and the CMS Dashboard Job Summary monitoring applications were developed to serve the needs of the CMS community
Location Awareness in Multi-Agent Control of Distributed Energy Resources
The integration of Distributed Energy Resource (DER) technologies such as heat pumps, electric vehicles and small-scale generation into the electricity grid at the household level is limited by technical constraints. This work argues that location is an important aspect for the control and integration of DER and that network topology can inferred without the use of a centralised network model. It addresses DER integration challenges by presenting a novel approach that uses a decentralised multi-agent system where equipment controllers learn and use their location within the low-voltage section of the power system.
Models of electrical networks exhibiting technical constraints were developed. Through theoretical analysis and real network data collection, various sources of location data were identified and new geographical and electrical techniques were developed for deriving network topology using Global Positioning System (GPS) and 24-hour voltage logs. The multi-agent system paradigm and societal structures were examined as an approach to a multi-stakeholder domain and congregations were used as an aid to decentralisation in a non-hierarchical, non-market-based approach. Through formal description of the agent attitude INTEND2, the novel technique of Intention Transfer was applied to an agent congregation to provide an opt-in, collaborative system.
Test facilities for multi-agent systems were developed and culminated in a new embedded controller test platform that integrated a real-time dynamic electrical network simulator to provide a full-feedback system integrated with control hardware. Finally, a multi-agent control system was developed and implemented that used location data in providing demand-side response to a voltage excursion, with the goals of improving power quality, reducing generator disconnections, and deferring network reinforcement.
The resulting communicating and self-organising energy agent community, as demonstrated on a unique hardware-in-the-loop platform, provides an application model and test facility to inspire agent-based, location-aware smart grid applications across the power systems domain
System support for object replication in distributed systems
Distributed systems are composed of a collection of cooperating but failure prone system components. The number of components in such systems is often large and, despite low probabilities of any particular component failing, the likelihood that there will be at least a small number of failures within the system at a given time is high. Therefore, distributed systems must be able to withstand partial failures. By being resilient to partial failures, a distributed system becomes more able to offer a dependable service and therefore more useful. Replication is a well known technique used to mask partial failures and increase reliability in distributed computer systems. However, replication management requires sophisticated distributed control algorithms, and is therefore a labour intensive and error prone task. Furthermore, replication is in most cases employed due to applications' non-functional requirements for reliability, as dependability is generally an orthogonal issue to the problem domain of the application. If system level support for replication is provided, the application developer can devote more effort to application specific issues. Distributed systems are inherently more complex than centralised systems. Encapsulation and abstraction of components and services can be of paramount importance in managing their complexity. The use of object oriented techniques and languages, providing support for encapsulation and abstraction, has made development of distributed systems more manageable. In systems where applications are being developed using object-oriented techniques, system support mechanisms must recognise this, and provide support for the object-oriented approach. The architecture presented exploits object-oriented techniques to improve transparency and to reduce the application programmer involvement required to use the replication mechanisms. This dissertation describes an approach to implementing system support for object replication, which is distinct from other approaches such as replicated objects in that objects are not specially designed for replication. Additionally, object replication, in contrast to data replication, is a function-shipping approach and deals with the replication of both operations and data. Object replication is complicated by objects' encapsulation of local state and the arbitrary interaction patterns that may exist among objects. Although fully transparent object replication has not been achieved, my thesis is that partial system support for replication of program-level objects is practicable and assists the development of certain classes of reliable distributed applications. I demonstrate the usefulness of this approach by describing a prototype implementation and showing how it supports the development of an example toy application. To increase their flexibility, the system support mechanisms described are tailorable. The approach adopted in this work is to provide partial support for object replication, relying on some assistance from the application developer to supply application dependent functionality within particular collators for dealing with processing of results from object replicas. Care is taken to make the programming model as simple and concise as possible
Multigrain shared memory
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 197-203).by Donald Yeung.Ph.D
Programming models for many-core architectures: a co-design approach
Common many-core processors contain tens of cores and distributed memory. Compared to a multicore system, which only has a few tightly coupled cores sharing a single bus and memory, several complex problems arise. Notably, many cores require many parallel tasks to fully utilize the cores, and communication happens in a distributed and decentralized way. Therefore, programming such a processor requires the application to exhibit concurrency. In contrast to a single-core application, a concurrent application has to deal with memory state changes with an observable (non-deterministic) intermediate state. The complexity introduced by these problems makes programming a many-core system with a single-core-based programming approach notoriously hard.\ud
\ud
The central concept of this thesis is that abstractions, which are related to (many-core) programming, are structured in a single platform model. A platform is a layered view of the hardware, a memory model, a concurrency model, a model of computation, and compile-time and run-time tooling. Then, a programming model is a specific view on this platform, which is used by a programmer. In this view, some details can be hidden from the programmer's perspective, some details cannot. For example, an operating system presents an infinite number of parallel virtual execution units to the application whilst it hides details regarding scheduling. On the other hand, a programmer usually has balance workload among threads by hand.\ud
\ud
This thesis presents modifications to different abstraction layers of a many-core architecture, in order to make the system as a whole more efficient, and to reduce the programming complexity. These modifications influence other abstractions in the platform, and especially the programming model. Therefore, this thesis applies co-design on all models. Notably, co-design of the memory model, concurrency model, and model of computation is required for a scalable implementation of lambda-calculus. Moreover, only the combination of requirements of the many-core hardware from one side and the concurrency model from the other leads to a memory model abstraction. Hence, this thesis shows that to cope with the current trends in many-core architectures from a programming perspective, it is essential and feasible to inspect and adapt all abstractions collectively
- …