Search CORE

30 research outputs found

Hardware schemes for early register release

Author: González Colás Antonio María
Monreal Arnal Teresa
Valero Cortés Mateo
Viñals Yufera Víctor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

On the efficiency of reductions in µ-SIMD media extensions

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for µ-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in µ-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current µ-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of µ-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Three-dimensional memory vectorization for high bandwidth media memory systems

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Hierarchical clustered register file organization for VLIW processors

Author: Ayguadé Parra Eduard
Llosa Espuny José Francisco
Valero Cortés Mateo
Zalamea León Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Late allocation and early release of physical registers

Author: González Colás Antonio María
González González José
Monreal Arnal Teresa
Valero Cortés Mateo
Viñals Yufera Víctor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tile size selection for low-power tile-based architectures

Author: Akella Venkatesh
Brown Michael
Chong Frederic T.
Franklin Diana
Mankin Jennifer
Oliver John Y.
Rao Ravishankar
Publication venue: DigitalCommons@CalPoly
Publication date: 01/01/2006
Field of study

In this paper, we investigate the power implications of tile size selection for tile-based processors. We refer to this investigation as a tile granularity study. This is accomplished by distilling the architectural cost of tiles with different computational widths into a system metric we call the Granularity Indicator (GI). The GI is then compared against the communications exposed when algorithms are partitioned across multiple tiles. Through this comparison, the tile granularity that best fits a given set of algorithms can be determined, reducing the system power for that set of algorithms. When the GI analysis is applied to the Synchroscalar tile architecture[1], we find that Synchroscalar\u27s already low power consumption can be further reduced by 14% when customized for execution of the 802.11a receiver. In addition, the GI can also be a used to evaluate tile size when considering multiple applications simultaneously, providing a convenient platform for hardware-software co-design

Crossref

DigitalCommons@CalPoly

RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture

Author: Asanovic Krste
Tseng Jessica H.
Publication venue
Publication date: 18/09/2006
Field of study

RingScalar is a complexity-effective microarchitecture for out-of-order superscalar processors, that reduces the area, latency, and power of all major structures in the instruction flow. The design divides an N-way superscalar into N columns connected in a unidirectional ring, where each column contains a portion of the instruction window, a bank of the register file, and an ALU. The design exploits the fact that most decoded instructions are waiting on just one operand to use only a single tag per issue window entry, and to restrict instruction wakeup and value bypass to only communicate with the neighboring column. Detailed simulations of four-issue single-threaded machines running SPECint2000 show that RingScalar has IPC only 13% lower than an idealized superscalar, while providing large reductions in area, power, and circuit latency

DSpace@MIT

Reducing the complexity of the register file in dynamic superscalar processors

Author: Balasubramonian Rajeev
Dwarkadas Sandhya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Journal ArticleDynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multi-ported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Single-level dynamic register caching architecture for high-performance superscalar processors

Author: Liebert John A.
Publication venue: 'Oregon State University'
Publication date
Field of study

The amount of instruction level parallelism (ILP) that can be exploited depends greatly on the size of the instruction window and the number of in-flight instructions the processor can support. However, this requires a register file with a large set of physical registers for renaming and multiple ports to provide register accesses to several instructions at once. The number of registers and ports a register file must contain will increase as the next generation wide-issue processors take advantage of more ILP, which will also increase its access time, area, and power dissipation. This paper proposes a method called Dynamic Register Caching, which uses a small, fast register cache along with a slow full register file in a single-level configuration, and splits the porting requirement between the two with each one capable of supplying values to FUs. This reduces the miss penalty found in previous multi-level schemes to just the access time of the full register file. The proposed method uses In-Cache bits and Register-port Select logic to keep track of operands in the register cache and the availability of free ports on both register files, and a simple instruction steering mechanism to determine which register file will supply values to instructions. This technique of dynamically steering instructions requires slightly more logic to implement, but incurs no additional delay and insures that load balance is a non-issue. Our study based on SimpleScalar microarchitecture simulation shows that the proposed scheme provides on average 15~22% improvement in IPC, 47~56% reduction in total area, and 23~49% reduction in power compared to a monolithic register file

ScholarsArchive@OSU

Energy models for network-on-chip components

Author: Bhat S.
Publication venue
Publication date: 31/01/2006
Field of study

Pure OAI Repository