986 research outputs found

    Fast Implementation of Lifting Based DWT Architecture For Image Compression

    Get PDF
    Technological growth in semiconductor industry have led to unprecedented demand for faster area efficient and low power VLSI circuits for complex image processing applications DWT-IDWT is one of the most popular IP that is used for image transformation In this work a high speed low power DWT IDWT architecture is designed and implemented on ASIC using 130nm Technology 2D DWT architecture based on lifting scheme architecture uses multipliers and adders thus consuming power This paper addresses power reduction in multiplier by proposing a modified algorithm for BZFAD multiplier The proposed BZFAD multiplier is 65 faster and occupies 44 less area compared with the generic multipliers The DWT architecture designed based on modified BZFAD multiplier achieves 35 less power reduction and operates at frequency of 200MHz with latency of 1536 clock cycles for 512x512 image The developed DWT can be used as an IP for VLSI implementatio

    A Synthesizable single-cycle multiply-accumulator

    Get PDF
    The multiplication and multiply-accumulate operations are expensive to implement in hardware for Digital Signal Processing, video, and graphics applications. A standard multiply-accumulator has three inputs and a single output that is equal to the product of two of its inputs added to the third input. For some applications it is desirable for a multiply-accumulator to have two outputs; one output that is the product of the first two inputs, and a second output that is the multiply-accumulate result. The goal of this thesis is to investigate algorithms and architectures used to design multipliers and multiply-accumulators, and to create a multiply-accumulator that computes both outputs in a single clock cycle. Often times in high speed designs the most time-consuming operations are pipelined to meet the system timing requirements. If the multiply-accumulate computation can be reduced to a single-cycle operation the overall processor performance can be improved for many applications. A multiply-accumulator with two outputs can be created using a combination of standard multiply, add, or multiply-accumulate components. Using these components, a multiplier and a multiply-accumulator can be used to produce the outputs in the most time-efficient manner. A multiplier and an adder will result in a smaller design with a larger worst-case delay. Therefore, the goal is to create a multiply-accumulator that is comparable in speed, but requires less area than a design using an industry standard multiplier and multiply-accumulator

    Energy-efficient embedded machine learning algorithms for smart sensing systems

    Get PDF
    Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

    A study of arithmetic circuits and the effect of utilising Reed-Muller techniques

    Get PDF
    Reed-Muller algebraic techniques, as an alternative means in logic design, became more attractive recently, because of their compact representations of logic functions and yielding of easily testable circuits. It is claimed by some researchers that Reed-Muller algebraic techniques are particularly suitable for arithmetic circuits. In fact, no practical application in this field can be found in the open literature.This project investigates existing Reed-Muller algebraic techniques and explores their application in arithmetic circuits. The work described in this thesis is concerned with practical applications in arithmetic circuits, especially for minimizing logic circuits at the transistor level. These results are compared with those obtained using the conventional Boolean algebraic techniques. This work is also related to wider fields, from logic level design to layout level design in CMOS circuits, the current leading technology in VLSI. The emphasis is put on circuit level (transistor level) design. The results show that, although Boolean logic is believed to be a more general tool in logic design, it is not the best tool in all situations. Reed-Muller logic can generate good results which can't be easily obtained by using Boolean logic.F or testing purposes, a gate fault model is often used in the conventional implementation of Reed-Muller logic, which leads to Reed-Muller logic being restricted to using a small gate set. This usually leads to generating more complex circuits. When a cell fault model, which is more suitable for regular and iterative circuits, such as arithmetic circuits, is used instead of the gate fault model in Reed-Muller logic, a wider gate set can be employed to realize Reed-Muller functions. As a result, many circuits designed using Reed-Muller logic can be comparable to that designed using Boolean logic. This conclusion is demonstrated by testing many randomly generated functions.The main aim of this project is to develop arithmetic circuits for practical application. A number of practical arithmetic circuits are reported. The first one is a carry chain adder. Utilising the CMOS circuit characteristics, a simple and high speed carry chain is constructed to perform the carry operation. The proposed carry chain adder can be reconstructed to form a fast carry skip adder, and it is also found to be a good application for residue number adders. An algorithm for an on-line adder and its implementation are also developed. Another circuit is a parallel multiplier based on 5:3 counter. The simulations show that the proposed circuits are better than many previous designs, in terms of the number of transistors and speed. In addition, a 4:2 compressor for a carry free adder is investigated. It is shown that the two main schemes to construct the 4:2 compressor have a unified structure. A variant of the Baugh and Wooley algorithm is also studied and generalized in this work

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

    Serial-data computation in VLSI

    Get PDF

    The implementation and applications of multiple-valued logic

    Get PDF
    Multiple-Valued Logic (MVL) takes two major forms. Multiple-valued circuits can implement the logic directly by using multiple-valued signals, or the logic can be implemented indirectly with binary circuits, by using more than one binary signal to represent a single multiple-valued signal. Techniques such as carry-save addition can be viewed as indirectly implemented MVL. Both direct and indirect techniques have been shown in the past to provide advantages over conventional arithmetic and logic techniques in algorithms required widely in computing for applications such as image and signal processing. It is possible to implement basic MVL building blocks at the transistor level. However, these circuits are difficult to design due to their non binary nature. In the design stage they are more like analogue circuits than binary circuits. Current integrated circuit technologies are biased towards binary circuitry. However, in spite of this, there is potential for power and area savings from MVL circuits, especially in technologies such as BiCMOS. This thesis shows that the use of voltage mode MVL will, in general not provide bandwidth increases on circuit buses because the buses become slower as the number of signal levels increases. Current mode MVL circuits however do have potential to reduce power and area requirements of arithmetic circuitry. The design of transistor level circuits is investigated in terms of a modern production technology. A novel methodology for the design of current mode MVL circuits is developed. The methodology is based upon the novel concept of the use of non-linear current encoding of signals, providing the opportunity for the efficient design of many previously unimplemented circuits in current mode MVL. This methodology is used to design a useful set of basic MVL building blocks, and fabrication results are reported. The creation of libraries of MVL circuits is also discussed. The CORDIC algorithm for two dimensional vector rotation is examined in detail as an example for indirect MVL implementation. The algorithm is extended to a set of three dimensional vector rotators using conventional arithmetic, redundant radix four arithmetic, and Taylor's series expansions. These algorithms can be used for two dimensional vector rotations in which no scale factor corrections are needed. The new algorithms are compared in terms of basic VLSI criteria against previously reported algorithms. A pipelined version of the redundant arithmetic algorithm is floorplanned and partially laid out to give indications of wiring overheads, and layout densities. An indirectly implemented MVL algorithm such as the CORDIC algorithm described in this thesis would clearly benefit from direct implementation in MVL

    Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021

    Get PDF
    The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing
    • …
    corecore