429 research outputs found

    Simulation models of shared-memory multiprocessor systems

    Get PDF

    Quantitative performance evaluation of SCI memory hierarchies

    Get PDF

    A multiple-bus, active backplane architecture for multiprocessor systems

    Get PDF
    This research investigates several problems associated with current multiprocessor interconnection networks, focusing primarily on general-purpose, shared-memory configurations. The project deals with all aspects of the interconnection, from the architectural level to the physical backplane. A bus-based architecture is presented as an alternative to the limitations of current schemes. This dissertation will focus on the physical layer implementation;For increased reliability, performance and scalability, a multiple-bus architecture is proposed. Each bus uses a word-serial approach to keep the total number of bus signals manageable. A source-synchronous transfer protocol allows data to be streamed at a high rate, thus increasing the pin-efficiency of the bus. The control acquisition scheme combines collision detection and priority arbitration to minimize bus access time without requiring additional signal lines. Cache coherence, message passing, and synchronization primitives are provided within the bus protocol to support multiple-processor systems;To reduce the capacitive loading on the bus, an active backplane is employed. This moves the transceiver and bus interface unit from the plug-in module down to the backplane. In addition to increasing the characteristic impedance of the bus, it also reduces the end-to-end propagation delay. Another advantage of moving the bus transceivers to the backplane is the uniform load presented to the bus, regardless of whether a slot is populated;Due to the reduction in drive current required, a custom CMOS transceiver, suitable for VLSI implementation, is used. It incorporates the collision detection circuitry required for the control acquisition scheme. Initial transceiver prototypes have been designed and fabricated in 2-[mu]m CMOS. These have been successfully tested at transfer rates in excess of 50MHz

    A logical layer protocol for ActiveBus architecture

    Get PDF
    This research investigates several problems associated with current multiprocessor interconnection networks, focusing primarily on general-purpose, shared-memory configurations. The project deals with all aspects of the interconnection, from the architectural level to the physical backplane. A multiple-bus based architecture is presented as an alternative to the limitations of current schemes. This dissertation will focus on the logical layer specification;The ActiveBus--a multiple, active bus--interconnection is proposed. Multiple buses increase the bandwidth as well as reliability of the interconnection while the active backplane shows a reduced and uniform capacitive load;A logical layer protocol was designed for each bus to work independently, to achieve fault tolerance. Each bus uses a word-serial approach to keep the total number of bus signal lines manageable. A dual clocking scheme is proposed. The faster clock is used for data transfer. The other clock, refered to as sync clock, is used for arbitration and handshaking;Absence of discontinuities on the bus coupled with a source-synchronous transfer protocol allows data to be streamed at a high rate, thus increasing the pin-efficiency of the bus. The data transmission rate is limited only by clock skew. In addition, the ActiveBus interface unit and the source synchronous protocol move the synchronization penalty from the shared bus to the private buffer in the unit;The protocol uses a new arbitration scheme, termed Previous Priority First. This hybrid control acquisition scheme combines collision detection and priority arbitration to minimize bus access time without requiring additional signal lines. Collision detection provides a quick access in an unsaturated system while priority arbitration guarantees the deterministic election of the master in a saturated system. The scheme also incorporates a fairness mode to minimize starvation and bus access delay in the system;The cache coherence scheme supports both copy-back and write-through policies to reduce the overhead. MOESI protocol with snoopy caches, being the most general, is followed. Message passing and synchronization primitives are provided within the bus protocol to support multiple processor systems. These primitives attempt to minimize the traffic generated by the spin locks or the memory hot spots

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    Interconnect design for the edge computing system-on-chip

    Get PDF
    Nowadays the majority of system-on-chips are designed by placing various IP blocks such as CPUs, memories and accelerators on the same chip. With the advantage of silicon manufacturing technologies, it has become possible to place hundreds of CPU cores and other design blocks on the same chip. A communication system that transfers data between chip components largely affects overall chip performance, computational speed and response time for external events. Firstly, this thesis studies the main on-chip interconnect design paradigms. According to the presented research, various architectures may be chosen for an interconnect design depending on the required complexity and number of subsystems. The shared and hybrid bus interconnects are one of the oldest means of on-chip communication. They are efficient for small systems with no more than ten IP blocks. The crossbars or bus matrix interconnects can help to build on-chip communication systems which can efficiently interconnect dozens of system-on-chip modules. The networks-on-chip can provide a communication solution for large scale chip designs with hundreds of IP blocks. The second part of this thesis focuses on the novel Ballast chip implementation and its interconnect design. The Ballast is a heterogeneous multiprocessor chip designed for edge computing and general-purpose computing applications. In this thesis Ballast interconnect was designed from scratch by using a cascaded crossbar approach by connecting three open-sourced AXI protocol bus matrices. The designed interconnect allows to efficiently connect 6 bus masters with 9 slaves and provides up to 9,6 GB/s bandwidth for the most productive CPU subsystem

    Simulating the data diffusion machine

    Full text link
    • …
    corecore