414 research outputs found
A VLSI architecture for enhancing software reliability
As a solution to the software crisis, we propose an architecture that supports and encourages the use of programming techniques and mechanisms for enhancing software reliability. The proposed architecture provides efficient mechanisms for detecting a wide variety of run-time errors, for supporting data abstraction, module-based programming and encourages the use of small protection domains through a highly efficient capability mechanism. The proposed architecture also provides efficient support for user-specified exception handlers and both event-driven and trace-driven debugging mechanisms. The shortcomings of the existing capability-based architectures that were designed with a similar goal in mind are examined critically to identify their problems with regard to capability translation, domain switching, storage management, data abstraction and interprocess communication. Assuming realistic VLSI implementation constraints, an instruction set for the proposed architecture is designed. Performance estimates of the proposed system are then made from the microprograms corresponding to these instructions based on observed characteristics of similar systems and language usage. A comparison of the proposed architecture with similar ones, both in terms of functional characteristics and low-level performance indicates the proposed design to be superior
Acceleration of the hardware-software interface of a communication device for parallel systems
During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built
Rediflow architecture prospectus
Journal ArticleRediflow is intended as a multi-function (symbolic and numeric) multiprocessor, demonstrating techniques for achieving speedup for Lisp-coded problems through the use of advanced programming concepts, high-speed communication, and dynamic load-distribution, in a manner suitable for scaling to upwards of 10,000 processors. An initial physical realization is proposed employing 16 nodes (initially in a hypercube topology), with processor, memory, and intelligent switch at each node
A Compiler Target Model for Line Associative Registers
LARs (Line Associative Registers) are very wide tagged registers, used for both register-wide SWAR (SIMD Within a Register )operations and scalar operations on arbitrary fields. LARs include a large data field, type tags, source addresses, and a dirty bit, which allow them to not only replace both caches and registers in the conventional memory hierarchy, but improve on both their functions. This thesis details a LAR-based architecture, and describes the design of a compiler which can generate code for a LAR-based design. In particular, type conversion, alignment, and register allocation are discussed in detail
Integrated shared-memory and message-passing communication in the Alewife multiprocessor
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D
Architectural Enhancements for Data Transport in Datacenter Systems
Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions.
In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits.
Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads.
Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd
Recommended from our members
Cherub: A hardware distributed single shared address space memory architecture
Increased computer throughput can be achieved through the use of parallel processing. The granularity of a parallel program is the average number of instructions performed by the tasks constituting it. Coarse-grained programs typically execute huge numbers of instructions per task (w 105). The tasks in fine-grained programs are typically short (æ 103). In general, the finer the program grain, the greater the potential for exploiting parallelism. Amdahl’s Law shows that in the absence of overheads, the more potential parallelism that is realised in an algorithm, the faster it will be. The economical granularity of tasks is determined by the intertask communications overhead. Break-even occurs when processing is approximately equally divided between useful work and overhead.
The two common parallel programming paradigms are shared variable and message passing. Shared variable is, in general, the more natural of the two as it allows implicit communication between tasks. This encourages the programmer to make use of fine-grained tasks. The message passing paradigm requires explicit communication between tasks. This encourages the programmer to use coarser-grained tasks.
Two kinds of parallel architecture have become established. The first is the multiprocessor, which is built around a shared bus giving broadcast communications and a shared memory. This is characterised by low communications overhead, but limited scalability. The second is the multicomputer, which is based on point-to-point communications with larger communications overhead, but good scalability. Quantitatively, the low overhead of the multiprocessor is well matched to fine-grain tasks and, hence, to supporting the shared variable paradigm, while the high overhead of the multicomputer matches it to coarse-grain parallelism and, hence, to the message passing paradigm.
Currently, there appears to be no middle ground in parallel computing; an architecture which can support both several hundred medium-grained (« 104 instructions) parallel tasks and the shared variable programming paradigm would be advantageous in many applications.
This thesis asserts that it is possible to implement a new computer architecture, Cherub, which has at least 200 processors and is able to support shared variable programming with an optimal task granularity of around 104 instructions. This can be achieved through the combination of a hardware-based distributed shared single address space and a wafer-scale communications network.
To support the thesis, the dissertation first specifies a programmer’s interface to Cherub which is simple enough to implement in hardware. It then designs algorithms which provide this interface, allowing the requirements of the underlying network to be estimated. Finally, a wafer scale communications network is outlined, and simulations are used to demonstrate that it can provide the performance required to successfully implement Cherub
- …