155 research outputs found

    Book Review

    Get PDF
    A Scholarly Review of “Error Control for Network-On-Chip Links” (Authors: Bo Fu and Paul Ampadu, 2012)Fu, B.; and Ampadu, P. 2012. Error Control for Network-On-Chip Links.Springer Science+Business Media, LLC, New York, NY, USA.Available: <http://dx.doi.org/10.1007/978-1-4419-9313-7>

    SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs

    Full text link
    The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).Comment: To Appear at IPDPS 202

    SIGNAL PROCESSING TECHNIQUES AND APPLICATIONS

    Get PDF
    As the technologies scaling down, more transistors can be fabricated into the same area, which enables the integration of many components into the same substrate, referred to as system-on-chip (SoC). The components on SoC are connected by on-chip global interconnects. It has been shown in the recent International Technology Roadmap of Semiconductors (ITRS) that when scaling down, gate delay decreases, but global interconnect delay increases due to crosstalk. The interconnect delay has become a bottleneck of the overall system performance. Many techniques have been proposed to address crosstalk, such as shielding, buffer insertion, and crosstalk avoidance codes (CACs). The CAC is a promising technique due to its good crosstalk reduction, less power consumption and lower area. In this dissertation, I will present analytical delay models for on-chip interconnects with improved accuracy. This enables us to have a more accurate control of delays for transition patterns and lead to a more efficient CAC, whose worst-case delay is 30-40% smaller than the best of previously proposed CACs. As the clock frequency approaches multi-gigahertz, the parasitic inductance of on-chip interconnects has become significant and its detrimental effects, including increased delay, voltage overshoots and undershoots, and increased crosstalk noise, cannot be ignored. We introduce new CACs to address both capacitive and inductive couplings simultaneously.Quantum computers are more powerful in solving some NP problems than the classical computers. However, quantum computers suffer greatly from unwanted interactions with environment. Quantum error correction codes (QECCs) are needed to protect quantum information against noise and decoherence. Given their good error-correcting performance, it is desirable to adapt existing iterative decoding algorithms of LDPC codes to obtain LDPC-based QECCs. Several QECCs based on nonbinary LDPC codes have been proposed with a much better error-correcting performance than existing quantum codes over a qubit channel. In this dissertation, I will present stabilizer codes based on nonbinary QC-LDPC codes for qubit channels. The results will confirm the observation that QECCs based on nonbinary LDPC codes appear to achieve better performance than QECCs based on binary LDPC codes.As the technologies scaling down further to nanoscale, CMOS devices suffer greatly from the quantum mechanical effects. Some emerging nano devices, such as resonant tunneling diodes (RTDs), quantum cellular automata (QCA), and single electron transistors (SETs), have no such issues and are promising candidates to replace the traditional CMOS devices. Threshold gate, which can implement complex Boolean functions within a single gate, can be easily realized with these devices. Several applications dealing with real-valued signals have already been realized using nanotechnology based threshold gates. Unfortunately, the applications using finite fields, such as error correcting coding and cryptography, have not been realized using nanotechnology. The main obstacle is that they require a great number of exclusive-ORs (XORs), which cannot be realized in a single threshold gate. Besides, the fan-in of a threshold gate in RTD nanotechnology needs to be bounded for both reliability and performance purpose. In this dissertation, I will present a majority-class threshold architecture of XORs with bounded fan-in, and compare it with a Boolean-class architecture. I will show an application of the proposed XORs for the finite field multiplications. The analysis results will show that the majority class outperforms the Boolean class architectures in terms of hardware complexity and latency. I will also introduce a sort-and-search algorithm, which can be used for implementations of any symmetric functions. Since XOR is a special symmetric function, it can be implemented via the sort-and-search algorithm. To leverage the power of multi-input threshold functions, I generalize the previously proposed sort-and-search algorithm from a fan-in of two to arbitrary fan-ins, and propose an architecture of multi-input XORs with bounded fan-ins

    Optical multiple access techniques for on-board routing

    Get PDF
    The purpose of this research contract was to design and analyze an optical multiple access system, based on Code Division Multiple Access (CDMA) techniques, for on board routing applications on a future communication satellite. The optical multiple access system was to effect the functions of a circuit switch under the control of an autonomous network controller and to serve eight (8) concurrent users at a point to point (port to port) data rate of 180 Mb/s. (At the start of this program, the bit error rate requirement (BER) was undefined, so it was treated as a design variable during the contract effort.) CDMA was selected over other multiple access techniques because it lends itself to bursty, asynchronous, concurrent communication and potentially can be implemented with off the shelf, reliable optical transceivers compatible with long term unattended operations. Temporal, temporal/spatial hybrids and single pulse per row (SPR, sometimes termed 'sonar matrices') matrix types of CDMA designs were considered. The design, analysis, and trade offs required by the statement of work selected a temporal/spatial CDMA scheme which has SPR properties as the preferred solution. This selected design can be implemented for feasibility demonstration with off the shelf components (which are identified in the bill of materials of the contract Final Report). The photonic network architecture of the selected design is based on M(8,4,4) matrix codes. The network requires eight multimode laser transmitters with laser pulses of 0.93 ns operating at 180 Mb/s and 9-13 dBm peak power, and 8 PIN diode receivers with sensitivity of -27 dBm for the 0.93 ns pulses. The wavelength is not critical, but 830 nm technology readily meets the requirements. The passive optical components of the photonic network are all multimode and off the shelf. Bit error rate (BER) computations, based on both electronic noise and intercode crosstalk, predict a raw BER of (10 exp -3) when all eight users are communicating concurrently. If better BER performance is required, then error correction codes (ECC) using near term electronic technology can be used. For example, the M(8,4,4) optical code together with Reed-Solomon (54,38,8) encoding provides a BER of better than (10 exp -11). The optical transceiver must then operate at 256 Mb/s with pulses of 0.65 ns because the 'bits' are now channel symbols

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

    Securing in-memory processors against Row Hammering Attacks

    Get PDF
    Modern applications on general purpose processors require both rapid and power-efficient computing and memory components. As applications continue to improve, the demand for high speed computation, fast-access memory, and a secure platform increases. Traditional Von Neumann Architectures split the computing and memory units, causing both latency and high power-consumption issues; henceforth, a hybrid memory processing system is proposed, known as in-memory processing. In-memory processing alleviates the delay of computation and minimizes power-consumption; such improvements saw a 14x speedup improvement, 87\% fewer power consumption, and appropriate linear scalability versus performance. Several applications of in-memory processing include data-driven applications such as Artificial Intelligence (AI), Convolutional and Deep Neural Networks (CNNs/DNNs). However, processing-in-memory can also suffer from a security and reliability issue known as the Row Hammer Security Bug; this security exploit flips bits within memory without access, leading to error injection, system crashes, privilege separation, and total hijack of a system; the novel Row Hammer security bug can negatively impact the accuracies of CNNs and DNNs via flipping the bits of stored weight values without direct access. Weights of neural networks are stored in a variety of data patterns, resulting in either a solid (all 1s or all 0s), checkered (alternating 1s and 0s in both rows and columns), row-stripe (alternating 1s and 0s in rows), or column-striped (alternating 1s and 0s in columns) manner; the row-stripe data pattern exhibits the largest likelihood of a Row Hammer attack, resulting in the accuracies of neural networks dropping over 30\%. A row-stripe avoidance coding scheme is proposed to reduce the probability of the Row Hammer Attack occurring within neural networks. The coding scheme encodes the binary portion of a weight in a CNN or DNN to reduce the chance of row-stripe data patterns, overall reducing the likelihood of a Row Hammer attack occurring while improving the overall security of the in-memory processing system

    On-board processing for future satellite communications systems: Satellite-Routed FDMA

    Get PDF
    A frequency division multiple access (FDMA) 30/20 GHz satellite communications architecture without on-board baseband processing is investigated. Conceptual system designs are suggested for domestic traffic models totaling 4 Gb/s of customer premises service (CPS) traffic and 6 Gb/s of trunking traffic. Emphasis is given to the CPS portion of the system which includes thousands of earth terminals with digital traffic ranging from a single 64 kb/s voice channel to hundreds of channels of voice, data, and video with an aggregate data rate of 33 Mb/s. A unique regional design concept that effectively smooths the non-uniform traffic distribution and greatly simplifies the satellite design is employed. The satellite antenna system forms thirty-two 0.33 deg beam on both the uplinks and the downlinks in one design. In another design matched to a traffic model with more dispersed users, there are twenty-four 0.33 deg beams and twenty-one 0.7 deg beams. Detailed system design techniques show that a single satellite producing approximately 5 kW of dc power is capable of handling at least 75% of the postulated traffic. A detailed cost model of the ground segment and estimated system costs based on current information from manufacturers are presented
    • …
    corecore