31,594 research outputs found
Flexible compiler-managed L0 buffers for clustered VLIW processors
Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.Peer ReviewedPostprint (published version
Distributed data cache designs for clustered VLIW processors
Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.Peer ReviewedPostprint (published version
Low-Power Reconfigurable Architectures for High-Performance Mobile Nodes
Modern embedded systems have an emerging demand on high
performance and low power circuits. Traditionally special functional units for
each application are developed separately. These are plugged to a general
purpose processors to extend its instruction set making it an application specific
instruction set processor. As this strategy reaches its boundaries in area and
complexity reconfigurable architectures propose to be more flexible. Thus
combining both approaches to a reconfigurable application specific processor is
going to be the upcoming solution for future embedded systems
An Experimental Microarchitecture for a Superconducting Quantum Processor
Quantum computers promise to solve certain problems that are intractable for
classical computers, such as factoring large numbers and simulating quantum
systems. To date, research in quantum computer engineering has focused
primarily at opposite ends of the required system stack: devising high-level
programming languages and compilers to describe and optimize quantum
algorithms, and building reliable low-level quantum hardware. Relatively little
attention has been given to using the compiler output to fully control the
operations on experimental quantum processors. Bridging this gap, we propose
and build a prototype of a flexible control microarchitecture supporting
quantum-classical mixed code for a superconducting quantum processor. The
microarchitecture is based on three core elements: (i) a codeword-based event
control scheme, (ii) queue-based precise event timing control, and (iii) a
flexible multilevel instruction decoding mechanism for control. We design a set
of quantum microinstructions that allows flexible control of quantum operations
with precise timing. We demonstrate the microarchitecture and microinstruction
set by performing a standard gate-characterization experiment on a transmon
qubit.Comment: 13 pages including reference. 9 figure
Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation
Detailed modeling of processors and high performance cycle-accurate
simulators are essential for today's hardware and software design. These
problems are challenging enough by themselves and have seen many previous
research efforts. Addressing both simultaneously is even more challenging, with
many existing approaches focusing on one over another. In this paper, we
propose the Reduced Colored Petri Net (RCPN) model that has two advantages:
first, it offers a very simple and intuitive way of modeling pipelined
processors; second, it can generate high performance cycle-accurate simulators.
RCPN benefits from all the useful features of Colored Petri Nets without
suffering from their exponential growth in complexity. RCPN processor models
are very intuitive since they are a mirror image of the processor pipeline
block diagram. Furthermore, in our experiments on the generated cycle-accurate
simulators for XScale and StrongArm processor models, we achieved an order of
magnitude (~15 times) speedup over the popular SimpleScalar ARM simulator.Comment: Submitted on behalf of EDAA (http://www.edaa.com/
Maximizing resource utilization by slicing of superscalar architecture
Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed
- …