Abstract-This paper presents a novel variable-latency multiplier architecture, suitable for implementation as a self-timed multiplier core or as a fully synchronous multicycle multiplier core. The architecture combines a second-order Booth algorithm with a split carry save array pipelined organization, incorporating multiple row skipping and completion-predicting carry-select final adder. The paper reports the architecture and logic design, CMOS circuit design and performance evaluation. In 0.35 m CMOS, the expected sustainable cycle time for a 32-bit synchronous implementation is 2.25 ns. Instruction level simulations estimate 54% single-cycle and 46% two-cycle operations in SPEC95 execution. Using the same CMOS process, the 32-bit asynchronous implementation is expected to reach an average 1.76 ns throughput and 3.48 ns latency in SPEC95 execution.
I. INTRODUCTION

F
AST integer multipliers are a key topic in the VLSI design of high-speed microprocessors. Recent results have shown that through a careful full-custom CMOS design a 54 54 bit multiplication in less than 3 ns is possible [21] . However, with commonly available CMOS processes, micro-architectures with 2-ns cycle time are commercially available [28] . As a result, due to the registers' setup and hold times, even a fast 32-b multiplication may not fit in a single cycle, and the design of pipelined multicycle multipliers is a common design choice to avoid the whole microarchitecture be limited by a relatively slow multiplier.
Data dependency always puts a limitation to the throughput of pipelined arithmetic units [22] , due to idle cycles between consecutive dependent operations. To overcome this, synchronous variable-latency pipelined addition units have recently been proposed in DSP industrial design [30] . A variable latency unit operates as a normal pipelined unit, but for most operands it can complete its operation in a single cycle, thus avoiding idle cycles insertion and improving the average throughtput. A synchronous signal flags in which cycle the operation has completed. A more aggressive implementation of this idea is inherent in asynchronous design, with self-timed units capable of an average response faster than the worst case [6] , for they best fit the conceptual design of a synchronous, fixed-latency instruction set architecture [22] . In fact, variable latency is present in some nonpipelined multi-cycle multiply units based on an iterative sequential algorithm, targeting low-cost CPUs [35] . Another example of this approach is the 8-b multiplier recently presented in [45] . Synchronous variable-latency has also been proposed for addition [32] and implemented in a high-speed pipelined VLSI adder [30] . No specific work has addressed synchronous variable-latency multipliers targeting high speed.
On the other hand, several research works have addressed asynchronous VLSI multipliers. An early example of the variable latency concept with an asynchronous implementation is in [36] . Some studies address serial asynchronous multipliers, with a low-speed target [13] This paper presents an integer (two's complement negative coding) pipelined multiplier architecture, which combines several algorithmic and design techniques to allow the VLSI implementation as a self-timed multiplier core or as a fully synchronous variable-latency multiplier core. The synchronous version is essentially a novel design, while the asynchronous version is a substantial evolution of [26] in the architecture (use of Booth encoding, different data-dependent carry-save array, different final adder) with additional differences in the
