15 research outputs found

    RELIABLE LOW-LATENCY VITERBI ALGORITHM ARCHITECTURES BENCHMARKED ON ASIC AND FPGA

    Get PDF
    The Viterbi formula is generally used in a variety of delicate use designs consisting of deciphering convolutional codes utilized in interactions such as satellite interaction, mobile relay, and also cordless lan. In addition, the formula has actually been related to automated speech acknowledgment and also storage space tools. In this thesis, reliable mistake discovery systems for styles based upon low-latency, low-complexity Viterbi decoders exist. The benefit of the suggested plans is that dependability needs, above resistance, as well as efficiency deterioration limitations are installed in the frameworks as well as can be adjusted appropriately. We likewise existing 3 versions of recomputing with inscribed operands as well as its adjustments to discover both short-term as well as long-term mistakes, paired with signature-based plans. The Viterbi formula is generally related to a variety of delicate use designs consisting of translating convolution codes made use of in interactions such as satellite interaction, mobile relay, and also cordless lan. Additionally, the formula has actually been put on automated speech acknowledgment as well as storage space tools

    Reliable Low-Latency and Low-Complexity Viterbi Architectures Benchmarked on ASIC and FPGA

    Get PDF
    The Viterbi algorithm is commonly applied in a number of sensitive usage models including decoding convolutional codes used in communications such as satellite communication, cellular relay, and wireless local area networks. Moreover, the algorithm has been applied to automatic speech recognition and storage devices. In this thesis, efficient error detection schemes for architectures based on low-latency, low-complexity Viterbi decoders are presented. The merit of the proposed schemes is that reliability requirements, overhead tolerance, and performance degradation limits are embedded in the structures and can be adapted accordingly. We also present three variants of recomputing with encoded operands and its modifications to detect both transient and permanent faults, coupled with signature-based schemes. The instrumented decoder architecture has been subjected to extensive error detection assessments through simulations, and application-specific integrated circuit (ASIC) [32nm library] and field-programmable gate array (FPGA) [Xilinx Virtex-6 family] implementations for benchmark. The proposed fine-grained approaches can be utilized based on reliability objectives and performance/implementation metrics degradation tolerance

    Education and Research Integration of Emerging Multidisciplinary Medical Devices Security

    Get PDF
    Traditional embedded systems such as secure smart cards and nano-sensor networks have been utilized in various usage models. Nevertheless, emerging secure deeply-embedded systems, e.g., implantable and wearable medical devices, have comparably larger “attack surface”. Specifically, with respect to medical devices, a security breach can be life-threatening (for which adopting traditional solutions might not be practical due to tight constraints of these often-battery-powered systems), and unlike traditional embedded systems, it is not only a matter of financial loss. Unfortunately, although emerging cryptographic engineering research mechanisms for such deeply-embedded systems have started solving this critical, vital problem, university education (at both graduate and undergraduate level) lags comparably. One of the pivotal reasons for such a lag is the multi-disciplinary nature of the emerging security bottlenecks. Based on the aforementioned motivation, in this work, at Rochester Institute of Technology, we present an effective research and education integration strategy to overcome this issue in one of the most critical deeply-embedded systems, i.e., medical devices. Moreover, we present the results of two years of implementation of the presented strategy at graduate-level through fault analysis attacks, a variant of side-channel attacks. We note that the authors also supervise an undergraduate student and the outcome of the presented work has been assessed for that student as well; however, the emphasis is on graduate-level integration. The results of the presented work show the success of the presented methodology while pinpointing the challenges encountered compared to traditional embedded system security research/teaching integration of medical devices security. We would like to emphasize that our integration approaches are general and scalable to other critical infrastructures as well

    Efficient Error detection Architectures for Low-Energy Block Ciphers with the Case Study of Midori Benchmarked on FPGA

    Get PDF
    Achieving secure, high performance implementations for constrained applications such as implantable and wearable medical devices is a priority in efficient block ciphers. However, security of these algorithms is not guaranteed in presence of malicious and natural faults. Recently, a new lightweight block cipher, Midori, has been proposed which optimizes the energy consumption besides having low latency and hardware complexity. This algorithm is proposed in two energy-efficient varients, i.e., Midori64 and Midori128, with block sizes equal to 64 and 128 bits. In this thesis, fault diagnosis schemes for variants of Midori are proposed. To the best of the our knowledge, there has been no fault diagnosis scheme presented in the literature for Midori to date. The fault diagnosis schemes are provided for the nonlinear S-box layer and for the round structures with both 64-bit and 128-bit Midori symmetric key ciphers. The proposed schemes are benchmarked on field-programmable gate array (FPGA) and their error coverage is assessed with fault-injection simulations. These proposed error detection architectures make the implementations of this new low-energy lightweight block cipher more reliable

    Multidisciplinary Approaches and Challenges in Integrating Emerging Medical Devices Security Research and Education

    Get PDF
    Traditional embedded systems such as secure smart cards and nano-sensor networks have been utilized in various usage models. Nevertheless, emerging secure deeply-embedded systems, e.g., implantable and wearable medical devices, have comparably larger “attack surface”. Specifically, with respect to medical devices, a security breach can be life-threatening (for which adopting traditional solutions might not be practical due to tight constraints of these often-battery-powered systems), and unlike traditional embedded systems, it is not only a matter of financial loss. Unfortunately, although emerging cryptographic engineering research mechanisms for such deeply-embedded systems have started solving this critical, vital problem, university education (at both graduate and undergraduate level) lags comparably. One of the pivotal reasons for such a lag is the multi-disciplinary nature of the emerging security bottlenecks. Based on the aforementioned motivation, in this work, at Rochester Institute of Technology, we present an effective research and education integration strategy to overcome this issue in one of the most critical deeply-embedded systems, i.e., medical devices. Moreover, we present the results of two years of implementation of the presented strategy at graduate-level through fault analysis attacks, a variant of side-channel attacks. We note that the authors also supervise an undergraduate student and the outcome of the presented work has been assessed for that student as well; however, the emphasis is on graduate-level integration. The results of the presented work show the success of the presented methodology while pinpointing the challenges encountered compared to traditional embedded system security research/teaching integration of medical devices security. We would like to emphasize that our integration approaches are general and scalable to other critical infrastructures as well

    A flexible heterogeneous hardware/software solution for real-time high-definition H.264 motion estimation

    Get PDF
    International audienceThe MPEG-4 AVC/H.264 video compression standard introduces a high degree of motion estimation complexity. Quarter-pixel accuracy and variable block-size significantly enhance compression performances over previous standards, but increase computation requirements. Firstly, a DSP-based solution achieves real-time integer motion estimation. Nevertheless, fractional-pixel refinement is too computationally intensive to be efficiently processed on a software-based processor. Secondly, to address this restriction, a flexible and low complexity VLSI sub-pixel refinement coprocessor is designed. Thanks to an improved datapath, a high throughput is achieved with low logic resources. Finally, we propose a heterogeneous (DSP-FPGA) solution to handle real-time motion estimation with variable block-size and fractional-pixel accuracy for high-definition video. It combines efficiency and programmability. The flexibility offers complexity versus performance trade-offs. The system achieves motion estimation of 720p sequences at up to 60 frames per second

    Design Techniques for High Performance Wireline Communication and Security Systems

    Full text link
    As the amount of data traffic grows exponentially on the internet, towards thousands of exabytes by 2020, high performance and high efficiency communication and security solutions are constantly in high demand, calling for innovative solutions. Within server communication dominates todays network data transfer, outweighing between-server and server-to-user data transfer by an order of magnitude. Solutions for within-server communication tend to be very wideband, i.e. on the order of tens of gigahertz, equalizers are widely deployed to provide extended bandwidth at reasonable cost. However, using equalizers typically costs the available signal-to-noise ratio (SNR) at the receiver side. What is worse is that the SNR available at the channel becomes worse as data rate increases, making it harder to meet the tight constraint on error rate, delay, and power consumption. In this thesis, two equalization solutions that address optimal equalizer implementations are discussed. One is a low-power high-speed maximum likelihood sequence detection (MLSD) that achieves record energy efficiency, below 10 pico-Joule per bit. The other one is a phase-shaping equalizer design that suppresses inter-symbol interference at almost zero cost of SNR. The growing amount of communication use also challenges the design of security subsystems, and the emerging need for post-quantum security adds to the difficulties. Most of currently deployed cryptographic primitives rely on the hardness of discrete logarithms that could potentially be solved efficiently with a powerful enough quantum computer. Efficient post-quantum encryption solutions have become of substantial value. In this thesis a fast and efficient lattice encryption application-specific integrated circuit is presented that surpasses the energy efficiency of embedded processors by 4 orders of magnitude.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146092/1/shisong_1.pd

    Architectural Solutions for NanoMagnet Logic

    Get PDF
    The successful era of CMOS technology is coming to an end. The limit on minimum fabrication dimensions of transistors and the increasing leakage power hinder the technological scaling that has characterized the last decades. In several different ways, this problem has been addressed changing the architectures implemented in CMOS, adopting parallel processors and thus increasing the throughput at the same operating frequency. However, architectural alternatives cannot be the definitive answer to a continuous increase in performance dictated by Moore’s law. This problem must be addressed from a technological point of view. Several alternative technologies that could substitute CMOS in next years are currently under study. Among them, magnetic technologies such as NanoMagnet Logic (NML) are interesting because they do not dissipate any leakage power. More- over, magnets have memory capability, so it is possible to merge logic and memory in the same device. However, magnetic circuits, and NML in this specific research, have also some important drawbacks that need to be addressed: first, the circuit clock frequency is limited to 100 MHz, to avoid errors in data propagation; second, there is a connection between circuit layout and timing, and in particular, longer wires will have longer latency. These drawbacks are intrinsic to the technology and for this reason they cannot be avoided. The only chance is to limit their impact from an architectural point of view. The first step followed in the research path of this thesis is indeed the choice and optimization of architectures able to deal with the problems of NML. Systolic Ar- rays are identified as an ideal solution for this technology, because they are regular structures with local interconnections that limit the long latency of wires; more- over they are composed of several Processing Elements that work in parallel, thus exploit parallelization to increase throughput (limiting the impact of the low clock frequency). Through the analysis of Systolic Arrays for NML, several possible im- provements have been identified and addressed: 1) it has been defined a rigorous way to increase throughput with interleaving, providing equations that allow to esti- mate the number of operations to be interleaved and the rules to provide inputs; 2) a latency insensitive circuit has been designed, that exploits a data communication protocol between processing elements to avoid data synchronization problems. This feature has been exploited to design a latency insensitive Systolic Array that is able to execute the Floyd-Steinberg dithering algorithm. All the improvements presented in this framework apply to Systolic Arrays implemented in any technology. So, they can also be exploited to increase performance of today’s CMOS parallel circuits. This research path is presented in Chapter 3. While Systolic Arrays are an interesting solution for NML, their usage could be quite limited because they are normally application-specific. The second re- search path addresses this problem. A Reconfigurable Systolic Array is presented, that can be programmed to execute several algorithms. This architecture has been tested implementing many algorithms, including FIR and IIR filters, Discrete Cosine Transform and Matrix Multiplication. This research path is presented in Chapter 4. In common Von Neumann architectures, the logic part of the circuit and the memory one are separated. Today bus communication between logic and memory represents the bottleneck of the system. This problem is addressed presenting Logic- In-Memory (LIM), an architecture where memory elements are merged in logic ones. This research path aims at defining a real LIM architectures. This has been done in two steps. The first step is represented by an architecture composed of three layers: memory, routing and logic. In the second step instead the routing plane is no more present, and its features are inherited by the memory plane. In this solution, a pyramidal memory model is used, where memories near logic elements contain the most probably used data, and other memory layers contain the remaining data and instruction set. This circuit has been tested with odd-even sort algorithms and it has been benchmarked against GPUs and ASIC. This research path is presented in Chapter 5. MagnetoElastic NML (ME-NML) is a technological improvement of the NML principle, proposed by researchers of Politecnico di Torino, where the clock system is based on the induced stretch of a piezoelectric substrate when a voltage is ap- plied to its boundaries. The main advantage of this solution is that it consumes much less power than the classic clock implementation. This technology has not yet been investigated from an architectural point of view and considering complex circuits. In this research field, a standard methodology for the design of ME-NML circuits has been proposed. It is based on a Standard Cell Library and an enhanced VHDL model. The effectiveness of this methodology has been proved designing a Galois Field Multiplier. Moreover the serial-parallel trade-off in ME-NML has been investigated, designing three different solutions for the Multiply and Accumulate structure. This research path is presented in Chapter 6. While ME-NML is an extremely interesting technology, it needs to be combined with other faster technologies to have a real competitive system. Signal interfaces between NML and other technologies (mainly CMOS) have been rarely presented in literature. A mixed-technology multiplexer is designed and presented as the basis for a CMOS to NML interface. The reverse interface (from ME-NML to CMOS) is instead based on a sensing circuit for the Faraday effect: a change in the polarization of a magnet induces an electric field that can be used to generate an input signal for a CMOS circuit. This research path is presented in Chapter 7. The research work presented in this thesis represents a fundamental milestone in the path towards nanotechnologies. The most important achievement is the de- sign and simulation of complex circuits with NML, benchmarking this technology with real application examples. The characterization of a technology considering complex functions is a major step to be performed and that has not yet been ad- dressed in literature for NML. Indeed, only in this way it is possible to intercept in advance any weakness of NanoMagnet Logic that cannot be discovered consid- ering only small circuits. Moreover, the architectural improvements introduced in this thesis, although technology-driven, can be actually applied to any technology. We have demonstrated the advantages that can derive applying them to CMOS cir- cuits. This thesis represents therefore a major step in two directions: the first is the enhancement of NML technology; the second is a general improvement of parallel architectures and the development of the new Logic-In-Memory paradigm

    High performance and error resilient probabilistic inference system for machine learning

    Get PDF
    Many real-world machine learning applications can be considered as inferring the best label assignment of maximum a posteriori probability (MAP) problems. Since these MAP problems are NP-hard in general, they are often dealt with using approximate inference algorithms on Markov random field (MRF) such as belief propagation (BP). However, this approximate inference is still computationally demanding, and thus custom hardware accelerators have been attractive for high performance and energy efficiency. There are various custom hardware implementations that employ BP to achieve reasonable performance for the real-world applications such as stereo matching. Due to lack of convergence guarantees, however, BP often fails to provide the right answer, thus degrading performance of the hardware. Therefore, we consider sequential tree-reweighted message passing (TRW-S), which avoids many of these convergence problems with BP via sequential execution of its computations but challenges parallel implementation for high throughput. In this work, therefore, we propose a novel streaming hardware architecture that parallelizes the sequential computations of TRW-S. Experimental results on stereo matching benchmarks show promising performance of our hardware implementation compared to the software implementation as well as other BP-based custom hardware or GPU implementations. From this result, we further demonstrate video-rate speed and high quality stereo matching using a hybrid CPU+FPGA platform. We propose three frame-level optimization techniques to fully exploit computational resources of a hybrid CPU+FPGA platform and achieve significant speed-up. We first propose a message reuse scheme which is guided by simple scene change detection. This scheme allows a current inference to be made based on a determination of whether the current result is expected to be similar to the inference result of the previous frame. We also consider frame level parallelization to process multiple frames in parallel using multiple FPGAs available in the platform. This parallelized hardware procedure is further pipelined with data management in CPU to overlap the execution time of the two and thereby reduce the entire processing time of the stereo video sequence. From experimental results with the real-world stereo video sequences, we see video-rate speed of our stereo matching system for QVGA stereo videos. Next, we consider error resilience of the message passing hardware for energy efficient hardware implementation. Modern nanoscale CMOS process technologies suffer in reliability caused by process, temperature and voltage variations. Conventional approaches to deal with such unreliability (e.g., design for the worst-case scenario) are complex and inefficient in terms of hardware resources and energy consumption. As machine learning applications are inherently probabilistic and robust to errors, statistical error compensation (SEC) techniques can play a significant role in achieving robust and energy-efficient implementation. SEC embraces the statistical nature of errors and utilizes statistical and probabilistic techniques to build robust systems. Energy-efficiency is obtained by trading off the enhanced robustness with energy. In this work, we analyze the error resilience of our message passing inference hardware subject to the hardware errors (e.g. errors caused by timing violation in circuits) and explore application of a popular SEC technique, algorithmic noise tolerance (ANT), to this hardware. Analysis and simulations show that the TRW-S message passing hardware is tolerant to small magnitude arithmetic errors, but large magnitude errors cause significantly inaccurate inference results which need to be corrected using SEC. Experimental results show that the proposed ANT-based hardware can tolerate an error rate of 21.3%, with performance degradation of only 3.5 % with an energy savings of 39.7 %, compared to an error-free hardware. Lastly, we extend our TRW-S hardware toward a general purpose machine learning framework. We propose advanced streaming architecture with flexible choice of MRF setting to achieve 10-40x speedup across a variety of computer vision applications. Furthermore, we provide better theoretical understanding of error resiliency of TRW-S, and of the implication of ANT for TRW-S, under more general MRF setting, along with strong empirical support
    corecore