979 research outputs found
Design of Synchronous Section-Carry Based Carry Lookahead Adders with Improved Figure of Merit
The section-carry based carry lookahead adder (SCBCLA) architecture was
proposed as an efficient alternative to the conventional carry lookahead adder
(CCLA) architecture for the physical implementation of computer arithmetic. In
previous related works, self-timed SCBCLA architectures and synchronous SCBCLA
architectures were realized using standard cells and FPGAs. In this work, we
deal with improved realizations of synchronous SCBCLA architectures designed in
a semi-custom fashion using standard cells. The improvement is quantified in
terms of a figure of merit (FOM), where the FOM is defined as the inverse
product of power, delay and area. Since power, delay and area of digital
designs are desirable to be minimized, the FOM is desirable to be maximized.
Starting from an efficient conventional carry lookahead generator, we show how
an optimized section-carry based carry lookahead generator is realized. In
comparison with our recent work dealing with standard cells based
implementation of SCBCLAs to perform 32-bit addition of two binary operands, we
show in this work that with improved section-carry based carry lookahead
generators, the resulting SCBCLAs exhibit significant improvements in FOM.
Compared to the earlier optimized hybrid SCBCLA, the proposed optimized hybrid
SCBCLA improves the FOM by 88.3%. Even the optimized hybrid CCLA features
improvement in FOM by 77.3% over the earlier optimized hybrid CCLA. However,
the proposed optimized hybrid SCBCLA is still the winner and has a better FOM
than the currently optimized hybrid CCLA by 15.3%. All the CCLAs and SCBCLAs
are implemented to realize 32-bit dual-operand binary addition using a 32/28nm
CMOS process.Comment: arXiv admin note: text overlap with arXiv:1603.0796
Approximate Ripple Carry and Carry Lookahead Adders - A Comparative Analysis
Approximate ripple carry adders (RCAs) and carry lookahead adders (CLAs) are
presented which are compared with accurate RCAs and CLAs for performing a
32-bit addition. The accurate and approximate RCAs and CLAs are implemented
using a 32/28nm CMOS process. Approximations ranging from 4- to 20-bits are
considered for the less significant adder bit positions. The simulation results
show that approximate RCAs report reductions in the power-delay product (PDP)
ranging from 19.5% to 82% than the accurate RCA for approximation sizes varying
from 4- to 20-bits. Also, approximate CLAs report reductions in PDP ranging
from 16.7% to 74.2% than the accurate CLA for approximation sizes varying from
4- to 20-bits. On average, for the approximation sizes considered, it is
observed that approximate CLAs achieve a 46.5% reduction in PDP compared to the
approximate RCAs. Hence, approximate CLAs are preferable over approximate RCAs
for the low power implementation of approximate computer arithmetic
ASIC-based Implementation of Synchronous Section-Carry Based Carry Lookahead Adders
The section-carry based carry lookahead adder (SCBCLA) topology was proposed
as an improved high-speed alternative to the conventional carry lookahead adder
(CCLA) topology in previous works. Self-timed and FPGA-based implementations of
SCBCLAs and CCLAs were considered earlier, and it was found that SCBCLAs could
help in delay reduction i.e. pave the way for improved speed compared to CCLAs
at the expense of some increase in area and/or power parameters. In this work,
we consider semi-custom ASIC-based implementations of different variants of
SCBCLAs and CCLAs to perform 32-bit dual-operand addition. Based on the
simulation results for 32-bit dual-operand addition obtained by targeting a
high-end 32/28nm CMOS process, it is found that an optimized SCBCLA
architecture reports a 9.8% improvement in figure-of-merit (FOM) compared to an
optimized CCLA architecture, where the FOM is defined as the inverse of the
product of power, delay, and area. It is generally inferred from the
simulations that the SCBCLA architecture could be more beneficial compared to
the CCLA architecture in terms of the design metrics whilst benefitting a
variety of computer arithmetic operations involving dual-operand and/or
multi-operand additions. Also, it is observed that heterogeneous CLA
architectures tend to fare well compared to homogeneous CLA architectures, as
substantiated by the simulation results.Comment: in the Book, Recent Advances in Circuits, Systems, Signal Processing
and Communications, Included in ISI/SCI Web of Science and Web of Knowledge,
Proceedings of 10th International Conference on Circuits, Systems, Signal and
Telecommunications, pp. 58-64, 2016, Barcelona, Spai
Designing high-speed, low-power full adder cells based on carbon nanotube technology
This article presents novel high speed and low power full adder cells based
on carbon nanotube field effect transistor (CNFET). Four full adder cells are
proposed in this article. First one (named CN9P4G) and second one (CN9P8GBUFF)
utilizes 13 and 17 CNFETs respectively. Third design that we named CN10PFS uses
only 10 transistors and is full swing. Finally, CN8P10G uses 18 transistors and
divided into two modules, causing Sum and Cout signals are produced in a
parallel manner. All inputs have been used straight, without inverting. These
designs also used the special feature of CNFET that is controlling the
threshold voltage by adjusting the diameters of CNFETs to achieve the best
performance and right voltage levels. All simulation performed using Synopsys
HSPICE software and the proposed designs are compared to other classical and
modern CMOS and CNFET-based full adder cells in terms of delay, power
consumption and power delay product.Comment: 13 Pages, 13 Figures, 2 Table
Asynchronous Early Output Dual-Bit Full Adders Based on Homogeneous and Heterogeneous Delay-Insensitive Data Encoding
This paper presents the designs of asynchronous early output dual-bit full
adders without and with redundant logic (implicit) corresponding to homogeneous
and heterogeneous delay-insensitive data encoding. For homogeneous
delay-insensitive data encoding only dual-rail i.e. 1-of-2 code is used, and
for heterogeneous delay-insensitive data encoding 1-of-2 and 1-of-4 codes are
used. The 4-phase return-to-zero protocol is used for handshaking. To
demonstrate the merits of the proposed dual-bit full adder designs, 32-bit
ripple carry adders (RCAs) are constructed comprising dual-bit full adders. The
proposed dual-bit full adders based 32-bit RCAs incorporating redundant logic
feature reduced latency and area compared to their non-redundant counterparts
with no accompanying power penalty. In comparison with the weakly indicating
32-bit RCA constructed using homogeneously encoded dual-bit full adders
containing redundant logic, the early output 32-bit RCA comprising the proposed
homogeneously encoded dual-bit full adders with redundant logic reports
corresponding reductions in latency and area by 22.2% and 15.1% with no
associated power penalty. On the other hand, the early output 32-bit RCA
constructed using the proposed heterogeneously encoded dual-bit full adder
which incorporates redundant logic reports respective decreases in latency and
area than the weakly indicating 32-bit RCA that consists of heterogeneously
encoded dual-bit full adders with redundant logic by 21.5% and 21.3% with nil
power overhead. The simulation results obtained are based on a 32/28nm CMOS
process technology
Asynchronous Ripple Carry Adder based on Area Optimized Early Output Dual-Bit Full Adder
This technical note presents the design of a new area optimized asynchronous
early output dual-bit full adder (DBFA). An asynchronous ripple carry adder
(RCA) is constructed based on the new asynchronous DBFAs and existing
asynchronous early output single-bit full adders (SBFAs). The asynchronous
DBFAs and SBFAs incorporate redundant logic and are encoded using the
delay-insensitive dual-rail code (i.e. homogeneous data encoding) and follow a
4-phase return-to-zero handshaking. Compared to the previous asynchronous RCAs
involving DBFAs and SBFAs, which are based on homogeneous or heterogeneous
delay-insensitive data encodings and which correspond to different timing
models, the early output asynchronous RCA incorporating the proposed DBFAs
and/or SBFAs is found to result in reduced area for the dual-operand addition
operation and feature significantly less latency than the asynchronous RCAs
which consist of only SBFAs. The proposed asynchronous DBFA requires 28.6% less
silicon than the previously reported asynchronous DBFA. For a 32-bit
asynchronous RCA, utilizing 2 stages of SBFAs in the least significant
positions and 15 stages of DBFAs in the more significant positions leads to
optimization in the latency. The new early output 32-bit asynchronous RCA
containing DBFAs and SBFAs reports the following optimizations in design
metrics over its counterparts: i) 18.8% reduction in area than a previously
reported 32-bit early output asynchronous RCA which also has 15 stages of DBFAs
and 2 stages of SBFAs, ii) 29.4% reduction in latency than a 32-bit early
output asynchronous RCA containing only SBFAs.Comment: 12 pages. arXiv admin note: substantial text overlap with
arXiv:1706.04487, arXiv:1704.0761
Weighted p-bits for FPGA implementation of probabilistic circuits
Probabilistic spin logic (PSL) is a recently proposed computing paradigm
based on unstable stochastic units called probabilistic bits (p-bits) that can
be correlated to form probabilistic circuits (p-circuits). These p-circuits can
be used to solve problems of optimization, inference and also to implement
precise Boolean functions in an "inverted" mode, where a given Boolean circuit
can operate in reverse to find the input combinations that are consistent with
a given output. In this paper we present a scalable FPGA implementation of such
invertible p-circuits. We implement a "weighted" p-bit that combines stochastic
units with localized memory structures. We also present a generalized tile of
weighted p-bits to which a large class of problems beyond invertible Boolean
logic can be mapped, and how invertibility can be applied to interesting
problems such as the NP-complete Subset Sum Problem by solving a small instance
of this problem in hardware
Soft Realization: a Bio-inspired Implementation Paradigm
Researchers traditionally solve the computational problems through rigorous
and deterministic algorithms called as Hard Computing. These precise algorithms
have widely been realized using digital technology as an inherently reliable
and accurate implementation platform, either in hardware or software forms.
This rigid form of implementation which we refer as Hard Realization relies on
strict algorithmic accuracy constraints dictated to digital design engineers.
Hard realization admits paying as much as necessary implementation costs to
preserve computation precision and determinism throughout all the design and
implementation steps. Despite its prior accomplishments, this conventional
paradigm has encountered serious challenges with today's emerging applications
and implementation technologies. Unlike traditional hard computing, the
emerging soft and bio-inspired algorithms do not rely on fully precise and
deterministic computation. Moreover, the incoming nanotechnologies face
increasing reliability issues that prevent them from being efficiently
exploited in hard realization of applications. This article examines Soft
Realization, a novel bio-inspired approach to design and implementation of an
important category of applications noticing the internal brain structure. The
proposed paradigm mitigates major weaknesses of hard realization by (1)
alleviating incompatibilities with today's soft and bio-inspired algorithms
such as artificial neural networks, fuzzy systems, and human sense signal
processing applications, and (2) resolving the destructive inconsistency with
unreliable nanotechnologies. Our experimental results on a set of well-known
soft applications implemented using the proposed soft realization paradigm in
both reliable and unreliable technologies indicate that significant energy,
delay, and area savings can be obtained compared to the conventional
implementation.Comment: The Imprecise (Approximate) computing and Relaxed Fault Tolerance
concept are some but not all instances of the Soft Realization. The soft
realization and imprecise computing are first introduced around 2005 as H.R.
Mahdiani Phd Thesis proposal. The first imprecise computing paper is
published in 2010. This manuscript is written in 2012, submitted to Nature in
2017 and rejected by the editor
Optimized Configurable Architectures for Scalable Soft-Input Soft-Output MIMO Detectors with 256-QAM
This paper presents an optimized low-complexity and high-throughput
multiple-input multiple-output (MIMO) signal detector core for detecting
spatially-multiplexed data streams. The core architecture supports various
layer configurations up to 4, while achieving near-optimal performance, as well
as configurable modulation constellations up to 256-QAM on each layer. The core
is capable of operating as a soft-input soft-output log-likelihood ratio (LLR)
MIMO detector which can be used in the context of iterative detection and
decoding. High area-efficiency is achieved via algorithmic and architectural
optimizations performed at two levels. First, distance computations and slicing
operations for an optimal 2-layer maximum a posteriori (MAP) MIMO detector are
optimized to eliminate the use of multipliers and reduce the overhead of
slicing in the presence of soft-input LLRs. We show that distances can be
easily computed using elementary addition operations, while optimal slicing is
done via efficient comparisons with soft decision boundaries, resulting in a
simple feed-forward pipelined architecture. Second, to support more layers, an
efficient channel decomposition scheme is presented that reduces the detection
of multiple layers into multiple 2-layer detection subproblems, which map onto
the 2-layer core with a slight modification using a distance accumulation stage
and a post-LLR processing stage. Various architectures are accordingly
developed to achieve a desired detection throughput and run-time
reconfigurability by time-multiplexing of one or more component cores. The
proposed core is applied as well to design an optimal multi-user MIMO detector
for LTE. The core occupies an area of 1.58MGE and achieves a throughput of 733
Mbps for 256-QAM when synthesized in 90 nm CMOS
Superconductor Digital Electronics: Scalability and Energy Efficiency Issues
Superconductor digital electronics using Josephson junctions as ultrafast
switches and magnetic-flux encoding of information was proposed over 30 years
ago as a sub-terahertz clock frequency alternative to semiconductor electronics
based on complementary metal-oxide-semiconductor (CMOS) transistors. Recently,
interest in developing superconductor electronics has been renewed due to a
search for energy saving solutions in applications related to high-performance
computing. The current state of superconductor electronics and fabrication
processes are reviewed in order to evaluate whether this electronics is
scalable to a very large scale integration (VLSI) required to achieve
computation complexities comparable to CMOS processors. A fully planarized
process at MIT Lincoln Laboratory, perhaps the most advanced process developed
so far for superconductor electronics, is used as an example. The process has
nine superconducting layers: eight Nb wiring layers with the minimum feature
size of 350 nm, and a thin superconducting layer for making compact
high-kinetic-inductance bias inductors. All circuit layers are fully planarized
using chemical mechanical planarization (CMP) of SiO2 interlayer dielectric.
The physical limitations imposed on the circuit density by Josephson junctions,
circuit inductors, shunt and bias resistors, etc., are discussed. Energy
dissipation in superconducting circuits is also reviewed in order to estimate
whether this technology, which requires cryogenic refrigeration, can be energy
efficient. Fabrication process development required for increasing the density
of superconductor digital circuits by a factor of ten and achieving densities
above 10^7 Josephson junctions per cm^2 is described.Comment: 20 pages, 14 figures, 153 references, submitted to Low Temperature
Physic
- …