1,599 research outputs found
The cache coherence protocol of the data diffusion machine
The Data Diffusion Machine (DDM) is a scalable shared memory multiprocessor in which the location of a datum in the machine is completely decoupled from its address. A data access "snooping" protocol provides an automatic duplication and migration of the data to wherever needed. The protocol also handles data coherence and replacement. The hardware organization consists of a hierarchy of buses and data controllers linking an arbitrary number of processors each having a large set-associative memory. Each data controller has a set-associative directory containing status bits for data under its control. The rest of the system appears to one processor like shared memory system, which makes the DDM a general architecture. The DDM is scalable in that there may be any number of levels in the hierarchy. The logical topmost bus (or any other bus) can be implemented by an unlimited number of physical buses removing an anticipated bottleneck
Moving the shared memory closer to the processors: DDM
Multiprocessors with shared memory are considered more general and easier
to program than message-passing machines. The scalability is, however, in favor
of the latter. There are a number of proposals showing how the poor scalability
of shared memory multiprocessors can be improved by the introduction of private
caches attached to the processors. These caches are kept consistent with each
other by cache-coherence protocols.
In this paper we introduce a new class of architectures called Cache Only
Memory Architectures (COMA). These architectures provide the programming
paradigm of the shared-memory architectures, but are believed to be more scal-
able. COMAs have no physically shared memory; instead, the caches attached to
the processors contain all the memory in the system, and their size is therefore
large. A datum is allowed to be in any or many of the caches, and will automatically be moved to where it is needed by a cache-coherence protocol, which also
ensures that the last copy of a datum is never lost. The location of a datum in
the machine is completely decoupled from its address.
We also introduce one example of COMA: the Data Diffusion Machine (DDM).
The DDM is based on a hierarchical network structure, with processor/memory
pairs at its tips. Remote accesses generally cause only a limited amount of traffic
over a limited part of the machine.
The architecture is scalable in that there can be any number of levels in the
hierarchy, and that the root bus of the hierarchy can be implemented by several
buses, increasing the bandwidth
MGSim - Simulation tools for multi-core processor architectures
MGSim is an open source discrete event simulator for on-chip hardware
components, developed at the University of Amsterdam. It is intended to be a
research and teaching vehicle to study the fine-grained hardware/software
interactions on many-core and hardware multithreaded processors. It includes
support for core models with different instruction sets, a configurable
multi-core interconnect, multiple configurable cache and memory models, a
dedicated I/O subsystem, and comprehensive monitoring and interaction
facilities. The default model configuration shipped with MGSim implements
Microgrids, a many-core architecture with hardware concurrency management.
MGSim is furthermore written mostly in C++ and uses object classes to represent
chip components. It is optimized for architecture models that can be described
as process networks.Comment: 33 pages, 22 figures, 4 listings, 2 table
Analytic model of a cache-only memory architecture
An approximate analytic model of a shared memory multiprocessor with a Cache Only Memory Architecture (COMA), the busbased Data Difussion Machine (DDM), is presented and validated. It describes the timing and interference in the system as a function of the hardware, the protocols, the topology and the workload. Model results
have been compared to results from an independent simulator. The comparison shows good model accuracy specially for non-saturated systems, where the errors in response times and device utilizations are independent of the number of processors and remain below 10% in 90% of the
simulations. Therefore, the model can be used as an average performance prediction tool that avoids expensive simulations in the design of systems with many processors
Reducing data movement on large shared memory systems by exploiting computation dependencies
Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52× and average improvements of 1.12× with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28× on average with respect to the state-of-the-art.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). I. Sánchez Barrera has been partially supported by the Spanish Ministry of Education, Culture and Sport under Formación del Profesorado Universitario fellowship number FPU15/03612. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (published version
Runtime-assisted cache coherence deactivation in task parallel programs
With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability.
This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Unions Horizon 2020 research
and innovation programme (grant agreements 671697 and 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness
under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft
Architecture and Design of Medical Processor Units for Medical Networks
This paper introduces analogical and deductive methodologies for the design
medical processor units (MPUs). From the study of evolution of numerous earlier
processors, we derive the basis for the architecture of MPUs. These specialized
processors perform unique medical functions encoded as medical operational
codes (mopcs). From a pragmatic perspective, MPUs function very close to CPUs.
Both processors have unique operation codes that command the hardware to
perform a distinct chain of subprocesses upon operands and generate a specific
result unique to the opcode and the operand(s). In medical environments, MPU
decodes the mopcs and executes a series of medical sub-processes and sends out
secondary commands to the medical machine. Whereas operands in a typical
computer system are numerical and logical entities, the operands in medical
machine are objects such as such as patients, blood samples, tissues, operating
rooms, medical staff, medical bills, patient payments, etc. We follow the
functional overlap between the two processes and evolve the design of medical
computer systems and networks.Comment: 17 page
- …