5 research outputs found

    Computing the Similarity Estimate Using Approximate Memory

    Get PDF
    In many computing applications there is a need to compute the similarity of sets of elements. When the sets have many elements or comparison involves many sets, computing the similarity requires significant computational effort and storage capacity. As in most cases, a reasonably accurate estimate is sufficient, many algorithms for similarity estimation have been proposed during the last decades. Those algorithms compute signatures for the sets and use them to estimate similarity. However, as the number of sets that need to be compared grows, even these similarity estimation algorithms require significant memory with its associated power dissipation. This article for the first time considers the use of approximate memories for similarity estimation. A theoretical analysis and simulation results are provided; initially it is shown that similarity sketches can tolerate large bit error rates and thus, they can benefit from using approximate memories without substantially compromising the accuracy of the similarity estimate. An understanding of the effect of errors in the stored signatures on the similarity estimate is pursued. A scheme to mitigate the impact of errors is presented; the proposed scheme tolerates even larger bit error rates and does not need additional memory. For example, bit error rates of up to 10 4 have less than a 1% impact on the accuracy of the estimate when the memory is unprotected, and larger bit errors rates can be tolerated if the memory is parity protected. These findings can be used for voltage supply scaling and increasing the refresh time in SRAMs and DRAMs. Based on those initial results, an enhanced implementation is further proposed for unprotected memories that further extends the range of tolerated BERs and enables power savings of up to 61.31% for SRAMs. In conclusion, this article shows that the use of approximate memories in sketches for similarity estimation provides significant benefits with a negligible impact on accuracy.This work was supported by ACHILLES project PID2019-104207RB-I00 and Go2Edge network RED2018-102585-T funded by the Spanish Agencia Estatal de Investigación (AEI) 10.13039/501100011033 and by the Madrid Community research project TAPIR-CM under Grant P2018/TCS-4496. The research of S. Liu and F. Lombardi was supported by NSF under Grants CCF-1953961 and 1812467

    Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

    Full text link
    The complexity of Machine Learning (ML) systems increases each year, with current implementations of large language models or text-to-image generators having billions of parameters and requiring billions of arithmetic operations. As these systems are widely utilized, ensuring their reliable operation is becoming a design requirement. Traditional error detection mechanisms introduce circuit or time redundancy that significantly impacts system performance. An alternative is the use of Concurrent Error Detection (CED) schemes that operate in parallel with the system and exploit their properties to detect errors. CED is attractive for large ML systems because it can potentially reduce the cost of error detection. In this paper, we introduce Concurrent Classifier Error Detection (CCED), a scheme to implement CED in ML systems using a concurrent ML classifier to detect errors. CCED identifies a set of check signals in the main ML system and feeds them to the concurrent ML classifier that is trained to detect errors. The proposed CCED scheme has been implemented and evaluated on two widely used large-scale ML models: Contrastive Language Image Pretraining (CLIP) used for image classification and Bidirectional Encoder Representations from Transformers (BERT) used for natural language applications. The results show that more than 95 percent of the errors are detected when using a simple Random Forest classifier that is order of magnitude simpler than CLIP or BERT. These results illustrate the potential of CCED to implement error detection in large-scale ML models

    Stochastic dividers for low latency neural networks

    Get PDF
    Due to the low complexity in arithmetic unit design, stochastic computing (SC) has attracted considerable interest to implement Artificial Neural Networks (ANNs) for resources-limited applications, because ANNs must usually perform a large number of arithmetic operations. To attain a high computation accuracy in an SC-based ANN, extended stochastic logic is utilized together with standard SC units and thus, a stochastic divider is required to perform the conversion between these logic representations. However, the conventional divider incurs in a large computation latency, so limits an SC implementation for ANNs used in applications needing high performance. Therefore, there is a need to design fast stochastic dividers for SC-based ANNs. Recent works (e.g., a binary searching and triple modular redundancy (BS-TMR) based stochastic divider) are targeting a reduction in computation latency, while keeping the same accuracy compared with the traditional design. However, this divider still requires NN iterations to deal with 2N2^{N} -bit stochastic sequences, and thus the latency increases in proportion to the sequence length. In this paper, a decimal searching and TMR (DS-TMR) based stochastic divider is initially proposed to further reduce the computation latency; it only requires two iterations to calculate the quotient, so regardless of the sequence length. Moreover, a trade-off design between accuracy and hardware is also presented. An SC-based Multi-Layer Perceptron (MLP) is then considered to show the effectiveness of the proposed dividers over current designs. Results show that when utilizing the proposed dividers, the MLP achieves the lowest computation latency while keeping the same classification accuracy; although incurring in an area increase, the overhead due to the proposed dividers is low over the entire MLP. When using as combined metric for both hardware design and computation complexity the product of the implementation area, latency, power and number of clock cycles, the proposed designs are also shown to be superior to the SC-based MLPs (at the same level of accuracy) employing other dividers found in the technical literature as well as the commonly used 32-bit floating point implementation.The work of Shanshan Liu, Farzad Niknia, and Fabrizio Lombardi was supported by the NSF Grant CCF-1953961 and Grant 1812467. The work of Pedro Reviriego was supported in part by the Spanish Ministry of Science and Innovation under project ACHILLES (Grant PID2019-104207RB-I00) and the Go2Edge Network (Grant RED2018-102585-T), and in part by the Madrid Community Research Agency under Grant TAPIR-CM P2018/TCS-4496. The work of Weiqiang Liu was supported by the NSFC under Grant 62022041 and Grant 61871216. The work of Ahmed Louri was supported by the NSF Grant CCF-1812495 and Grant 1953980

    Tolerance of Siamese Networks (SNs) to Memory Errors: Analysis and Design

    Full text link
    This paper considers memory errors in a Siamese Network (SN) through an extensive analysis and proposes two schemes (using a weight filter and a code) to provide efficient hardware solutions for error tolerance. Initially the impact of memory errors on the weights of the SN (stored as floating-point (FP) numbers) is analyzed; this shows that the degradation is mostly caused by outliers in weights. Two schemes are subsequently proposed. An analysis is pursued to establish the filter’s bounds selection by the maximum/minimum values of the weight distributions, by which outliers can be removed from the operation of the SN. A code scheme for protecting the sign and exponent bits of each weight in an FP number, is also proposed; this code incurs in no memory overhead by utilizing the 4 least significant bits (LSB) to store parity bits. Simulation shows that the filter has a better performance for multi-bit errors correction (a reduction of 95.288% in changed predictions), while the code achieves superior results in single-bit errors correction (a reduction of 99.775% in changed predictions). The combined method that uses the two proposed schemes, retains their advantages, so adaptive to all scenarios; The ASIC-based FP designs of the SN using serial and hybrid implementations are also presented; these pipelined designs utilize a novel multi-layer perceptron (MLP) (as branch networks of the SN) that operates at a frequency of 681.2 MHz (at a 32nm technology node), so significantly higher than existing designs found in the technical literature. The proposed error-tolerant approaches also show advantages in overheads comparing with for example traditional error correction code (ECC). These error-tolerant MLP-based designs are well suited to hardware/power-constrained platforms

    Delta Sigma Modulator-Based Dividers for Accurate and Low Latency Stochastic Computing Systems

    Full text link
    Stochastic computing (SC) has received considerable research interest in the past decade. Significant efforts have been devoted to reducing computation latency for the stochastic divider, which is the most complex unit in SC. However, current SC systems still lack dividers that can timely operate with other SC units by aligned processing periods. Moreover, all existing stochastic dividers cannot perform accurate division for input values near the center of the SC computation range. This paper proposes two Delta Sigma Modulator (DSM) based stochastic dividers. The proposed first-order DSM-based divider significantly reduces the additional clock cycles needed for division, and also slightly increases the accuracy (e.g., compared with the fastest existing divider of 10-bit resolution, a reduction of 87.5% in the number of additional clock cycles is accomplished, with an average mean square error (MSE) that is decreased from 10-3.9 to 10-4.0). Moreover, a fully compatible second-order DSM-based divider is proposed. It achieves a higher division accuracy (e.g., MSE of 10-4.7 for 10-bit resolution) and does not require additional clock cycles, at the cost of a slightly increased hardware overhead. As an emerging application, SC-based neural networks are implemented as a case study to evaluate the advantages of the proposed designs. The synthesis results show that compared to the network implementation with the most efficient existing stochastic divider, the use of the proposed dividers reduces the total hardware overhead of the network by 32.0% to 46.6%, and slightly improves the classification accuracy. Overall, the proposed divider designs enable an SC system to operate with aligned timing, so resulting in a better implementation
    corecore