15,798 research outputs found
Pond: A Robust, scalable, massively parallel computer architecture
A new computer architecture, intended for implementation in late and post silicon technologies, is proposed. The architecture is a fine-grained, inherently parallel system consisting of a large grid of thousands or millions of simple atomic processors (APs) employing a simple instruction set. Each AP is configured as either a program instruction or data storage element. These elements are organized into logical entities, analogous to traditional programming functions/methods and data structures. Programming work is underway to compile and run programs from traditional sequential code where parallelism is automatically discovered at the high level on both instruction level and function level, and integrated into the object code that is then sent to the processor. The result is a massively parallel architecture that fully exploits instruction and thread-level parallelism. The architecture design is presented, in-progress work involving conversion of existing code is discussed, and examples are shown to indicate the speedup potential that exists in this new architecture when compared to current architectures
Designing with RoBs for High Performance VLIW Architecture
VLIW architecture has become widespread due to the combined bene?ts of simple hardware and compiler extracted instruction level parallelism. The VLIW instruction set architecture and its hardware implementation is tightly coupled and a novel simultaneous multithreading VLIW architecture with dynamic dispatch mechanism which uses RoBs complex logic to maximize ILP has been proposed. Since the resulting dynamic instruction schedule of many applications seldom changes, it is reasonable to store and reuse the schedule instead of reconstructing it each time. The new VLIW architecture shows that it can effectively increase the processor efficiency which improves the performance
A case for merging the ILP and DLP paradigms
The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version
A system-on-chip vector multiprocessor for transmission line modelling acceleration
We discuss a configurable, System-on-Chip vector
multiprocessor for accelerating the Transmission Line
Modeling (TLM) algorithm with an architecture capable of
exploiting the two primary forms of parallelism in the code,
thread and data level parallelism. Theoretical results
demonstrate an order of magnitude reduction in the dynamic
instruction count for a scalar-processor/vector-coprocessor
configuration at a vector length of sixteen 32-bit singleprecision
elements. Furthermore, a multi-vector SoC
architecture consisting of ten such vector accelerators provides
a near-linear theoretical performance benefit of the order of
88% in three out of four benchmark configurations which is
orthogonal to the benefit realized by vectorization alone. We
discuss in detail this potent architecture and present
implementation data for the 2-way multi-processor VLSI
macrocell
Precise exception handling for a self-timed processor
Journal ArticleSelf-timed systems structured as multiple concurrent processes and communicating through self-timed queues are a convenient way to implement decoupled computer architectures. Machines of this type can exploit instruction level parallelism in a natural way, and can be easily modified and extended. However, providing a precise exception model for a self-timed micropipelined processor can be difficult, since the processor state does not change at uniformly discrete intervals. We present a precise exception method implemented for Fred, a self-timed, decoupled, pipelined computer architecture with out-of-order instruction completion
Application of compiler-assisted multiple instruction rollback recovery to speculative execution
Speculative execution is a method to increase instruction level parallelism which can be exploited by both super-scalar and VLIW architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and accurate reporting of exceptions for speculated instructions. Multiple instruction rollback is a technique developed for recovery from transient processor failure. Many of the difficulties encountered during recovery from branch misprediction or from instruction re-execution due to exception in a speculative execution architecture are similar to those encountered during multiple instruction rollback. The applicability of a recently developed compiler-assisted multiple instruction rollback scheme to aid in speculative execution repair is investigated. Extensions to the compiler-assisted scheme to support branch and exception repair are presented along with performance measurements across ten application programs
Performance evaluation of CSMT for VLIW processors
Clustered VLIW embedded processors have become widespread due to benefits of simple hardware and low power. However, while some applications exhibit large amounts of instruction level parallelism (ILP) and benefit from very wide machines, others have little ILP, which wastes precious resources in wide processors. Simultaneous MultiThreading (SMT) is a well known technique that improves resource utilization by exploiting thread level parallelism at the instruction grain level. However, implementing SMT for VLIWs requires complex structures. CSMT (Clusterlevel Simultaneous MultiThreading) allows some degree of SMT in clustered VLIW processors. CSMT considers the set of operations that execute simultaneously in a given cluster (named bundle)as the assignment unit. All bundles belonging to a VLIW instruction from a given thread are issued simultaneously. To minimize cluster conflicts between threads, a very simple hardwarebased cluster renaming mechanism is proposed. The experimental results show that CSMT significantly
improves ILP when compared with other multithreading approaches suited for VLIW.
For instance, with 4 threads CSMT shows an average speedup of 113% over a single-thread VLIW architecture and 36% over Interleaved MultiThreading (IMT). In some cases, speedup can be as high as 228% over single thread architecture and 97% over IMT. Also CSMT for a 2-thread processor, achieves almost the same performance as IMT for a 4-thread processor and also outperforms it in some cases.Peer ReviewedPostprint (author’s final draft
A novel architecture for large windows processors
Several processor architectures with large instruction windows have been proposed. They improve performance by maintaining hundreds of instructions in flight to increase the level of instruction parallelism (ILP). Such architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of the processor resources. Check-pointing, however, leads to an imprecise state recovery on mispredicted branches and exceptions and frequent re-execution of current-path instructions during the state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This technical report proposes a new processor architecture that does not use either a traditional ROB or check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. Its novel register management architecture allows implementation of large register files with simpler and more scalable, register renaming and commit. It is also key to the precise recovery mechanism.Postprint (published version
- …