235 research outputs found
Distributed Shared Memory: A Survey of Issues and Algorithms
26 pagesA distributed shared memory (DSM) is an implementation of the shared memory
abstraction on a multicomputer architecture which has no physically shared memory.
Shared memory is important (as a programming model) not only because of the vast
number of existing applications which use it, but also because it is a more appropriate
paradigm for certain algorithms. The DSM concept was demonstrated to be viable by
Li, in IVY. Recently, there has been a surge of new projects which implement DSM in
a variety of software and hardware environments.
This paper gives an integrated overview of distributed shared memory. We discuss
theoretical lower bounds on the performance of DSM systems, design choices such as
structure and granularity, access, coherence semantics, scalability, and heterogeneity,
and open problems in DSM. In addition, we describe algorithms used to implement and
improve efficiency: reducing thrashing, eliminating false sharing, matching the coherence
protocol to the type of sharing, and relaxing the semantics of the memory coherence
provided. A spectrum of current DSM systems are used as illustrative examples
Language independent modelling of parallelism
To make programs work in parallel contexts without any hazards, programming languages require changes to their structures and compilers. One of the most complicated parts is memory models and how programming languages deal with memory interactions. Different processors provide a different level of safety guarantees (i.e. ARM provides relaxed whereas Intel provides strong guarantees). On the other hand, different programming languages provide different structures for parallel computation and have individual protocols for communicating with parallel processes. Unfortunately, no specific choice is best in all situations. This thesis focuses on memory models of various programming languages and processors highlighting some positive and negative features from the point of view of programmability, performance and portability. In order to give some evidence of problems and performance bottlenecks, some small programs have been developed. This thesis also concentrates on incorrect behaviors, especially on data race conditions in programs, providing suggestions on how to avoid them. Also, some litmus tests on systems featuring different vendors' processors were performed to observe data races on each system. Nowadays programming paradigms also became a big issue. Some of the programming styles support observable non-determinism which is the main reason for incorrect behavior in programs. In this thesis, different programming models are also discussed based on the current state of the available research. Also, the imperative and functional paradigms in different contexts are compared. Finally, a mathematical problem was solved using two different paradigms to provide some practical evidence of the theory
Software Coherence in Multiprocessor Memory Systems
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding
Least space-time first scheduling algorithm : scheduling complex tasks with hard deadline on parallel machines
Both time constraints and logical correctness are essential to real-time systems and failure to specify and observe a time constraint may result in disaster. Two orthogonal issues arise in the design and analysis of real-time systems: one is the specification of the system, and the semantic model describing the properties of real-time programs; the other is the scheduling and allocation of resources that may be shared by real-time program modules.
The problem of scheduling tasks with precedence and timing constraints onto a set of processors in a way that minimizes maximum tardiness is here considered. A new scheduling heuristic, Least Space Time First (LSTF), is proposed for this NP-Complete problem. Basic properties of LSTF are explored; for example, it is shown that (1) LSTF dominates Earliest-Deadline-First (EDF) for scheduling a set of tasks on a single processor (i.e., if a set of tasks are schedulable under EDF, they are also schedulable under LSTF); and (2) LSTF is more effective than EDF for scheduling a set of independent simple tasks on multiple processors.
Within an idealized framework, theoretical bounds on maximum tardiness for scheduling algorithms in general, and tighter bounds for LSTF in particular, are proven for worst case behavior. Furthermore, simulation benchmarks are developed, comparing the performance of LSTF with other scheduling disciplines for average case behavior.
Several techniques are introduced to integrate overhead (for example, scheduler and context switch) and more realistic assumptions (such as inter-processor communication cost) in various execution models. A workload generator and symbolic simulator have been implemented for comparing the performance of LSTF (and a variant -- LSTF+) with that of several standard scheduling algorithms.
LSTF\u27s execution model, basic theories, and overhead considerations have been defined and developed. Based upon the evidence, it is proposed that LSTF is a good and practical scheduling algorithm for building predictable, analyzable, and reliable complex real-time systems.
There remain some open issues to be explored, such as relaxing some current restrictions, discovering more properties and theorems of LSTF under different models, etc. We strongly believe that LSTF can be a practical scheduling algorithm in the near future
Memory consistency directed cache coherence protocols for scalable multiprocessors
The memory consistency model, which formally specifies the behavior of the
memory system, is used by programmers to reason about parallel programs. From a
hardware design perspective, weaker consistency models permit various optimizations
in a multiprocessor system: this thesis focuses on designing and optimizing the cache
coherence protocol for a given target memory consistency model.
Traditional directory coherence protocols are designed to be compatible with the
strictest memory consistency model, sequential consistency (SC). When they are used
for chip multiprocessors (CMPs) that provide more relaxed memory consistency models,
such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of
scalability, in terms of per-core storage due to sharer tracking, which poses a problem
with increasing number of cores in today’s CMPs, most of which no longer are sequentially
consistent. The recent convergence towards programming language based relaxed
memory consistency models has sparked renewed interest in lazy cache coherence
protocols. These protocols exploit synchronization information by enforcing coherence
only at synchronization boundaries via self-invalidation. As a result, such protocols do
not require sharer tracking which benefits scalability. On the downside, such protocols
are only readily applicable to a restricted set of consistency models, such as Release
Consistency (RC), which expose synchronization information explicitly. In particular,
existing architectures with stricter consistency models (such as x86) cannot readily
make use of lazy coherence protocols without either: adapting the protocol to satisfy
the stricter consistency model; or changing the architecture’s consistency model to (a
variant of) RC, typically at the expense of backward compatibility. The first part of
this thesis explores both these options, with a focus on a practical approach satisfying
backward compatibility.
Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and
SPARC processors, and existing parallel programs written for these architectures, we
first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency
model. TSO-CC does not track sharers and instead relies on self-invalidation and
detection of potential acquires (in the absence of explicit synchronization) using per
cache line timestamps to efficiently and lazily satisfy the TSO memory consistency
model. Our results show that TSO-CC achieves, on average, performance comparable
to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales
logarithmically with increasing core count.
Next, we propose an approach for the x86-64 architecture, which is a compromise
between retaining the original consistency model and using a more storage efficient
lazy coherence protocol. First, we propose a mechanism to convey synchronization
information via a simple ISA extension, while retaining backward compatibility with
legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC),
a scalable cache coherence protocol for RCtso, the resulting memory consistency
model. RC3 does not track sharers and relies on self-invalidation on acquires. To
satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1
timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving
performance comparable to a MESI directory protocol for RC optimized programs.
RC3’s storage overhead per cache line scales logarithmically with increasing core count
and reduces on-chip coherence storage overheads by 45% compared to TSO-CC.
Finally, it is imperative that hardware adheres to the promised memory consistency
model. Indeed, consistency directed coherence protocols cannot use conventional coherence
definitions (e.g. SWMR) to be verified against, and few existing verification
methodologies apply. Furthermore, as the full consistency model is used as a specification,
their interaction with other components (e.g. pipeline) of a system must not be
neglected in the verification process. Therefore, verifying a system with such protocols
in the context of interacting components is even more important than before. One
common way to do this is via executing tests, where specific threads of instruction
sequences are generated and their executions are checked for adherence to the consistency
model. It would be extremely beneficial to execute such tests under simulation,
i.e. when the functional design implementation of the hardware is being prototyped.
Most prior verification methodologies, however, target post-silicon environments, which
when used for simulation-based memory consistency verification would be too slow.
We propose McVerSi, a test generation framework for fast memory consistency
verification of a full-system design implementation under simulation. Our primary
contribution is a Genetic Programming (GP) based approach to memory consistency test
generation, which relies on a novel crossover function that prioritizes memory operations
contributing to non-determinism, thereby increasing the probability of uncovering
memory consistency bugs. To guide tests towards exercising as much logic as possible,
the simulator’s reported coverage is used as the fitness function. Furthermore, we
increase test throughput by making the test workload simulation-aware. We evaluate
our proposed framework using the Gem5 cycle accurate simulator in full-system mode
with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed
TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI
protocol due to the faulty interaction of the pipeline and the cache coherence protocol,
highlighting that even conventional protocols should be verified rigorously in the
context of a full-system. Crucially, these bugs would not have been discovered through
individual verification of the pipeline or the coherence protocol. We study 11 bugs
in total. Our GP-based test generation approach finds all bugs consistently, therefore
providing much higher guarantees compared to alternative approaches (pseudo-random
test generation and litmus tests)
Fault tolerant architectures for integrated aircraft electronics systems
Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered
Performance Analysis of Live-Virtual-Constructive and Distributed Virtual Simulations: Defining Requirements in Terms of Temporal Consistency
This research extends the knowledge of live-virtual-constructive (LVC) and distributed virtual simulations (DVS) through a detailed analysis and characterization of their underlying computing architecture. LVCs are characterized as a set of asynchronous simulation applications each serving as both producers and consumers of shared state data. In terms of data aging characteristics, LVCs are found to be first-order linear systems. System performance is quantified via two opposing factors; the consistency of the distributed state space, and the response time or interaction quality of the autonomous simulation applications. A framework is developed that defines temporal data consistency requirements such that the objectives of the simulation are satisfied. Additionally, to develop simulations that reliably execute in real-time and accurately model hierarchical systems, two real-time design patterns are developed: a tailored version of the model-view-controller architecture pattern along with a companion Component pattern. Together they provide a basis for hierarchical simulation models, graphical displays, and network I/O in a real-time environment. For both LVCs and DVSs the relationship between consistency and interactivity is established by mapping threads created by a simulation application to factors that control both interactivity and shared state consistency throughout a distributed environment
Study of fault-tolerant software technology
Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance
- …