677 research outputs found
Configurable 3D-integrated focal-plane sensor-processor array architecture
A mixed-signal Cellular Visual Microprocessor architecture with digital processors is
described. An ASIC implementation is also demonstrated. The architecture is composed of a
regular sensor readout circuit array, prepared for 3D face-to-face type integration, and one or
several cascaded array of mainly identical (SIMD) processing elements. The individual array
elements derived from the same general HDL description and could be of different in size, aspect
ratio, and computing resources
Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems
The generic matrix multiply (GEMM) function is the core element of
high-performance linear algebra libraries used in many
computationally-demanding digital signal processing (DSP) systems. We propose
an acceleration technique for GEMM based on dynamically adjusting the
imprecision (distortion) of computation. Our technique employs adaptive scalar
companding and rounding to input matrix blocks followed by two forms of packing
in floating-point that allow for concurrent calculation of multiple results.
Since the adaptive companding process controls the increase of concurrency (via
packing), the increase in processing throughput (and the corresponding increase
in distortion) depends on the input data statistics. To demonstrate this, we
derive the optimal throughput-distortion control framework for GEMM for the
broad class of zero-mean, independent identically distributed, input sources.
Our approach converts matrix multiplication in programmable processors into a
computation channel: when increasing the processing throughput, the output
noise (error) increases due to (i) coarser quantization and (ii) computational
errors caused by exceeding the machine-precision limitations. We show that,
under certain distortion in the GEMM computation, the proposed framework can
significantly surpass 100% of the peak performance of a given processor. The
practical benefits of our proposal are shown in a face recognition system and a
multi-layer perceptron system trained for metadata learning from a large music
feature database.Comment: IEEE Transactions on Signal Processing (vol. 60, 2012
Rutger's CAM2000 chip architecture
This report describes the architecture and instruction set of the Rutgers CAM2000 memory chip. The CAM2000 combines features of Associative Processing (AP), Content Addressable Memory (CAM), and Dynamic Random Access Memory (DRAM) in a single chip package that is not only DRAM compatible but capable of applying simple massively parallel operations to memory. This document reflects the current status of the CAM2000 architecture and is continually updated to reflect the current state of the architecture and instruction set
Content addressable memory project
A parameterized version of the tree processor was designed and tested (by simulation). The leaf processor design is 90 percent complete. We expect to complete and test a combination of tree and leaf cell designs in the next period. Work is proceeding on algorithms for the computer aided manufacturing (CAM), and once the design is complete we will begin simulating algorithms for large problems. The following topics are covered: (1) the practical implementation of content addressable memory; (2) design of a LEAF cell for the Rutgers CAM architecture; (3) a circuit design tool user's manual; and (4) design and analysis of efficient hierarchical interconnection networks
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)âSIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
DESIGN AND IMPLEMENTATION OF THE INSTRUCTION SET ARCHITECTURE FOR DATA LARS
The ideal memory system assumed by most programmers is one which has high capacity, yet allows any word to be accessed instantaneously. To make the hardware approximate this performance, an increasingly complex memory hierarchy, using caches and techniques like automatic prefetch, has evolved. However, as the gap between processor and memory speeds continues to widen, these programmer-visible mechanisms are becoming inadequate.
Part of the recent increase in processor performance has been due to the introduction of programmer/compiler-visible SWAR (SIMD Within A Register) parallel processing on increasingly wide DATA LARs (Line Associative Registers) as a way to both improve data access speed and increase efficiency of SWAR processing. Although the base concept of DATA LARs predates this thesis, this thesis presents the first instruction set architecture specification complete enough to allow construction of a detailed prototype hardware design. This design was implemented and tested using a hardware simulator
Acceleration of PageRank with customized precision based on mantissa segmentation
[EN] We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard ieee double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.H. Anzt was supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER. This work was also supported by the EU H2020 project 732631 "OPRECOMP. Open Transprecision Computing,' and the US Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC0016513 and DE-SC-0010042Gruetzmacher, T.; Cojean, T.; Flegar, G.; Anzt, H.; Quintana-Orti, ES. (2020). Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Transactions on Parallel Computing. 7(1):1-19. https://doi.org/10.1145/3380934S1197
- âŠ