424 research outputs found
The Effects of the architectural design, replacement algorithm, and size parameters of cache memory in uniprocessor computer systems
To investigate the effects of cache coherency on multiprocessors, it is helpful to first explore coherency issues within uniprocessors, working with a small part of a big problem instead of attacking the big problem from the start. This thesis will investigate the design and implementation of three different cache designs, varying the mapping strategy, replacement algorithm, and size parameters to determine the effects each have on the cache miss ratio, coherency, and average memory access time. VHDL is used to create software models of each cache design investigated, so that parameter values can be easily changed, and so that no money or time is wasted by first prototyping the cache design in actual hardware. These VHDL implementations are presented, along with several test-bench programs that were used to not only validate the performance of the VHDL implementation, but also to explore the program-dependent performance factors and coherency. Several snoopy cache coherence protocols are presented at the end of the thesis, in order to suggest future research into the VHDL implementation of shared-memory multiprocessors
Exploring the value of supporting multiple DSM protocols in Hardware DSM Controllers
Journal ArticleThe performance of a hardware distributed shared memory (DSM) system is largely dependent on its architect's ability to reduce the number of remote memory misses that occur. Previous attempts to solve this problem have included measures such as supporting both the CC-NUMA and S-COMA architectures is the same machine and providing a programmable DSM controller that can emulate any DSM mechanism. In this paper we first present the design of a DSM controller that supports multiple DSM protocols in custom hardware, and allows the programmer or compiler to specify on a per-variable basis what protocol to use to keep that variable coherent. This simulated performance of this DSM controller compares favorably with that of conventional single-protocol custom hardware designs, often outperforming the conventional systems by a factor of two. To achieve these promising results, that multi-protocol DSM controller needed to support only two DSM architectures (CC-NUMA and S-COMA) and three coherency protocols (both release and sequentially consistent write invalidate and release consistent write update). This work demonstrates the value of supporting a degree of flexibility in one's DSM controller design and suggests what operations such a flexible DSM controller should support
DIRECTORY BASED CACHE COHERENCY, ORGANIZATION, OPERATIONS AND CHALLENGES IN IMPLEMENTATION - STUDY
Todays systems are designed with Multi Core Architecture. The idea behind this is to achieve high system throuput. Once the Processor clock speed reached its saturation, designers opted for having multiple cores. Each Core or Processor equipped with their own private cache memory. But under Chip Multiprocessor, where all the processor have access to shared memory, having respective cache memory will result with Cache Coherency Problem. In Directory Protocol, for each block of data there is a directory entry that contains a number of pointers. The purpose of this number is to mention the locations of block copies. The important advantage of directory based protocols is that they scale much better than snoopy protocols. In addition to this it has the advantage of ability to exploit arbitrary point-to-point interconnects. But mean time it also has the overhead in terms of the storage and manipulation of directory state. This paper discus different Directory Based implementation, operations along with and its implementation difficulties
Extending and Implementing the Self-adaptive Virtual Processor for Distributed Memory Architectures
Many-core architectures of the future are likely to have distributed memory
organizations and need fine grained concurrency management to be used
effectively. The Self-adaptive Virtual Processor (SVP) is an abstract
concurrent programming model which can provide this, but the model and its
current implementations assume a single address space shared memory. We
investigate and extend SVP to handle distributed environments, and discuss a
prototype SVP implementation which transparently supports execution on
heterogeneous distributed memory clusters over TCP/IP connections, while
retaining the original SVP programming model
Castell: a heterogeneous cmp architecture scalable to hundreds of processors
Technology improvements and power constrains have taken multicore architectures to dominate
microprocessor designs over uniprocessors. At the same time, accelerator based architectures
have shown that heterogeneous multicores are very efficient and can provide high throughput for
parallel applications, but with a high-programming effort. We propose Castell a scalable chip
multiprocessor architecture that can be programmed as uniprocessors, and provides the high
throughput of accelerator-based architectures.
Castell relies on task-based programming models that simplify software development. These
models use a runtime system that dynamically finds, schedules, and adds hardware-specific features
to parallel tasks. One of these features is DMA transfers to overlap computation and data
movement, which is known as double buffering. This feature allows applications on Castell
to tolerate large memory latencies and lets us design the memory system focusing on memory
bandwidth.
In addition to provide programmability and the design of the memory system, we have used
a hierarchical NoC and added a synchronization module. The NoC design distributes memory
traffic efficiently to allow the architecture to scale. The synchronization module is a consequence
of the large performance degradation of application for large synchronization latencies.
Castell is mainly an architecture framework that enables the definition of domain-specific
implementations, fine-tuned to a particular problem or application. So far, Castell has been
successfully used to propose heterogeneous multicore architectures for scientific kernels, video
decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW).
It has also been used to explore a number of architecture optimizations such as enhanced DMA
controllers, and architecture support for task-based programming models.
ii
- …