168 research outputs found

    Algorithms and VLSI architectures for parametric additive synthesis

    Get PDF
    A parametric additive synthesis approach to sound synthesis is advantageous as it can model sounds in a large scale manner, unlike the classical sinusoidal additive based synthesis paradigms. It is known that a large body of naturally occurring sounds are resonant in character and thus fit the concept well. This thesis is concerned with the computational optimisation of a super class of form ant synthesis which extends the sinusoidal parameters with a spread parameter known as band width. Here a modified formant algorithm is introduced which can be traced back to work done at IRCAM, Paris. When impulse driven, a filter based approach to modelling a formant limits the computational work-load. It is assumed that the filter's coefficients are fixed at initialisation, thus avoiding interpolation which can cause the filter to become chaotic. A filter which is more complex than a second order section is required. Temporal resolution of an impulse generator is achieved by using a two stage polyphase decimator which drives many filterbanks. Each filterbank describes one formant and is composed of sub-elements which allow variation of the formant’s parameters. A resource manager is discussed to overcome the possibility of all sub- banks operating in unison. All filterbanks for one voice are connected in series to the impulse generator and their outputs are summed and scaled accordingly. An explorative study of number systems for DSP algorithms and their architectures is investigated. I invented a new theoretical mechanism for multi-level logic based DSP. Its aims are to reduce the number of transistors and to increase their functionality. A review of synthesis algorithms and VLSI architectures are discussed in a case study between a filter based bit-serial and a CORDIC based sinusoidal generator. They are both of similar size, but the latter is always guaranteed to be stable

    THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

    Get PDF
    A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35µm CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles

    Resiliency Mechanisms for In-Memory Column Stores

    Get PDF
    The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems. In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION 1.1 Contributions of this Thesis 1.2 Outline 2 PROBLEM DESCRIPTION AND RELATED WORK 2.1 Reliable Data Management on Reliable Hardware 2.2 The Shift Towards Unreliable Hardware 2.3 Hardware-Based Mitigation of Bit Flips 2.4 Data Management System Requirements 2.5 Software-Based Techniques For Handling Bit Flips 2.5.1 Operating System-Level Techniques 2.5.2 Compiler-Level Techniques 2.5.3 Application-Level Techniques 2.6 Summary and Conclusions 3 ANALYSIS OF CODING TECHNIQUES 3.1 Selection of Error Codes 3.1.1 Hamming Coding 3.1.2 XOR Checksums 3.1.3 AN Coding 3.1.4 Summary and Conclusions 3.2 Probabilities of Silent Data Corruption 3.2.1 Probabilities of Hamming Codes 3.2.2 Probabilities of XOR Checksums 3.2.3 Probabilities of AN Codes 3.2.4 Concrete Error Models 3.2.5 Summary and Conclusions 3.3 Throughput Considerations 3.3.1 Test Systems Descriptions 3.3.2 Vectorizing Hamming Coding 3.3.3 Vectorizing XOR Checksums 3.3.4 Vectorizing AN Coding 3.3.5 Summary and Conclusions 3.4 Comparison of Error Codes 3.4.1 Effectiveness 3.4.2 Efficiency 3.4.3 Runtime Adaptability 3.5 Performance Optimizations for AN Coding 3.5.1 The Modular Multiplicative Inverse 3.5.2 Faster Softening 3.5.3 Faster Error Detection 3.5.4 Comparison to Original AN Coding 3.5.5 The Multiplicative Inverse Anomaly 3.6 Summary 4 BIT FLIP DETECTING STORAGE 4.1 Column Store Architecture 4.1.1 Logical Data Types 4.1.2 Storage Model 4.1.3 Data Representation 4.1.4 Data Layout 4.1.5 Tree Index Structures 4.1.6 Summary 4.2 Hardened Data Storage 4.2.1 Hardened Physical Data Types 4.2.2 Hardened Lightweight Compression 4.2.3 Hardened Data Layout 4.2.4 UDI Operations 4.2.5 Summary and Conclusions 4.3 Hardened Tree Index Structures 4.3.1 B-Tree Verification Techniques 4.3.2 Justification For Further Techniques 4.3.3 The Error Detecting B-Tree 4.4 Summary 5 BIT FLIP DETECTING QUERY PROCESSING 5.1 Column Store Query Processing 5.2 Bit Flip Detection Opportunities 5.2.1 Early Onetime Detection 5.2.2 Late Onetime Detection 5.2.3 Continuous Detection 5.2.4 Miscellaneous Processing Aspects 5.2.5 Summary and Conclusions 5.3 Hardened Intermediate Results 5.3.1 Materialization of Hardened Intermediates 5.3.2 Hardened Bitmaps 5.4 Summary 6 END-TO-END EVALUATION 6.1 Prototype Implementation 6.1.1 AHEAD Architecture 6.1.2 Diversity of Physical Operators 6.1.3 One Concrete Operator Realization 6.1.4 Summary and Conclusions 6.2 Performance of Individual Operators 6.2.1 Selection on One Predicate 6.2.2 Selection on Two Predicates 6.2.3 Join Operators 6.2.4 Grouping and Aggregation 6.2.5 Delta Operator 6.2.6 Summary and Conclusions 6.3 Star Schema Benchmark Queries 6.3.1 Query Runtimes 6.3.2 Improvements Through Vectorization 6.3.3 Storage Overhead 6.3.4 Summary and Conclusions 6.4 Error Detecting B-Tree 6.4.1 Single Key Lookup 6.4.2 Key Value-Pair Insertion 6.5 Summary 7 SUMMARY AND CONCLUSIONS 7.1 Future Work A APPENDIX A.1 List of Golden As A.2 More on Hamming Coding A.2.1 Code examples A.2.2 Vectorization BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES LIST OF LISTINGS LIST OF ACRONYMS LIST OF SYMBOLS LIST OF DEFINITION

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

    Energy area and speed optimized signal processing on FPGA

    Get PDF
    Matrix multiplication and Fast Fourier transform are two computational intensive DSP functions widely used as kernel operations in the applications such as graphics, imaging and wireless communication. Traditionally the performance metrics for signal processing has been latency and throughput. Energy efficiency has become increasingly important with proliferation of portable mobile devices as in software defined radio. A FPGA based system is a viable solution for requirement of adaptability and high computational power. But one limitation in FPGA is the limitation of resources. So there is need for optimization between energy, area and latency. There are numerous ways to map an algorithm to FPGA. So for the process of optimization the parameters must be determined by low level simulation of each of the designs possible which gives rise to vast time consumption. So there is need for a high level energy model in which parameters can be determined at algorithm and architectural level rather than low level simulation. In this dissertation matrix multiplication algorithms are implemented with pipelining and parallel processing features to increase throughput and reduce latency there by reduce the energy dissipation. But it increases area by the increased numbers of processing elements. The major area of the design is used by multiplier which further increases with increase in input word width which is difficult for VLSI implementation. So a word width decomposition technique is used with these algorithms to keep the size of multipliers fixed irrespective of the width of input data. FFT algorithms are implemented with pipelining to increase throughput. To reduce energy and area due to the complex multipliers used in the design for multiplication with twiddle factors, distributed arithmetic is used to provide multiplier less architecture. To compensate speed performance parallel distributed arithmetic models are used. This dissertation also proposes method of optimization of the parameters at high level for these two kernel applications by constructing a high level energy model using specified algorithms and architectures. Results obtained from the model are compared with those obtained from low level simulation for estimation of error

    Low Power Digital Filter Implementation in FPGA

    Get PDF
    Digital filters suitable for hearing aid application on low power perspective have been developed and implemented in FPGA in this dissertation. Hearing aids are primarily meant for improving hearing and speech comprehensions. Digital hearing aids score over their analog counterparts. This happens as digital hearing aids provide flexible gain besides facilitating feedback reduction and noise elimination. Recent advances in DSP and Microelectronics have led to the development of superior digital hearing aids. Many researchers have investigated several algorithms suitable for hearing aid application that demands low noise, feedback cancellation, echo cancellation, etc., however the toughest challenge is the implementation. Furthermore, the additional constraints are power and area. The device must consume as minimum power as possible to support extended battery life and should be as small as possible for increased portability. In this thesis we have made an attempt to investigate possible digital filter algorithms those are hardware configurable on low power view point. Suitability of decimation filter for hearing aid application is investigated. In this dissertation decimation filter is implemented using ‘Distributed Arithmetic’ approach.While designing this filter, it is observed that, comb-half band FIR-FIR filter design uses less hardware compared to the comb-FIR-FIR filter design. The power consumption is also less in case of comb-half band FIR-FIR filter design compared to the comb-FIR-FIR filter. This filter is implemented in Virtex-II pro board from Xilinx and the resource estimator from the system generator is used to estimate the resources. However ‘Distributed Arithmetic’ is highly serial in nature and its latency is high; power consumption found is not very low in this type of filter implementation. So we have proceeded for ‘Adaptive Hearing Aid’ using Booth-Wallace tree multiplier. This algorithm is also implemented in FPGA and power calculation of the whole system is done using Xilinx Xpower analyser. It is observed that power consumed by the hearing aid with Booth-Wallace tree multiplier is less than the hearing aid using Booth multiplier (about 25%). So we can conclude that the hearing aid using Booth-Wallace tree multiplier consumes less power comparatively. The above two approached are purely algorithmic approach. Next we proceed to combine circuit level VLSI design and with algorithmic approach for further possible reduction in power. A MAC based FDF-FIR filter (algorithm) that uses dual edge triggered latch (DET) (circuit) is used for hearing aid device. It is observed that DET based MAC FIR filter consumes less power than the traditional (single edge triggered, SET) one (about 41%). The proposed low power latch provides a power saving upto 65% in the FIR filter. This technique consumes less power compared to previous approaches that uses low power technique only at algorithmic abstraction level. The DET based MAC FIR filter is tested for real-time validation and it is observed that it works perfectly for various signals (speech, music, voice with music). The gain of the filter is tested and is found to be 27 dB (maximum) that matches with most of the hearing aid (manufacturer’s) specifications. Hence it can be concluded that FDF FIR digital filter in conjunction with low power latch is a strong candidate for hearing aid application

    Comparaison du coût matériel des posit et IEEE-754

    Get PDF
    The posit number system is an elegant encoding of floating-point values proposed as a drop-in replacement for the IEEE-754 standard. On the one side, posits sacrifice some of IEEE-754 complexity (directed rounding modes, infinities, NaNs). On the other side, their variable-size exponent and significand fields require extra encoding and decoding steps, and their higher best-case accuracy requires wider data-paths. The posit encoding/decoding overhead can be reduced by keeping posits decoded in processor registers, with the operators suitably modified to avoid double-rounding issues.An unbiased quantitative comparison of the hardware costs of these two encodings is based on an analytical study and an open-source C++ library suitable for High-Level Synthesis. This library offers posit and IEEE-754 parametrized operators for addition/subtraction, multiplication, and exact accumulation, all developed with the same high design effort and fully compliant to their respective standards.This library improves the state of the art of posit hardware arithmetic, and still, IEEE-754 operators remain between 30% and 60% faster and smaller than their posit counterparts.Les posit sont un encodage des nombres en virgule flottantes présentés comme une alternative au standard IEEE-754.D'un côté, les posit se débarassent d'une partie de la complexité des IEEE-754 (arrondis dirigés, infinis, NaN).De l'autre, l'encodage à taille variable de leur exposant et fraction nécessitent une étape de décodage et d'encodage supplémentaire, et leur meilleure précision maximale nécessitent des chemins de données plus larges.Le surcoût lié à l'encodage/décodage des posit peut être réduit en gardant les posit décodés dans les registres du CPU, en modifiant les opérateurs de façon à éviter les problèmes de double arrondi.Une comparaison quantitative et non biaisée du coût matériel de ces deux encodages est basée sur une étude analytique et une bibliothèque C++ open-source, compatible avec les outils de synthèse haut niveau (HLS). Cette bibliothèque offre des opérateurs pour l'addition/soustraction, multiplication et exact accumulation pour des formats posit et IEEE-754 arbitraires. Les opérateurs pour ces deux encodages sont développés avec le même effort et suivent leurs standards respectifs.Les opérateurs proposés améliorent l'état de l'art des opérateurs arithmétiques posit, et malgré cela les opérateurs IEEE-754 restent entre 30% et 60% plus rapide et plus petits que leur équivalent posit

    Proceedings of the 7th Conference on Real Numbers and Computers (RNC'7)

    Get PDF
    These are the proceedings of RNC7

    Implementação eficiente da Curve25519 para microcontroladores ARM

    Get PDF
    Orientador: Diego de Freitas AranhaDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Com o advento da computação ubíqua, o fenômeno da Internet das Coisas (de Internet of Things) fará que com inúmeros dispositivos conectem-se um com os outros, enquanto trocam dados muitas vezes sensíveis pela sua natureza. Danos irreparáveis podem ser causados caso o sigilo destes seja quebrado. Isso causa preocupações acerca da segurança da comunicação e dos próprios dispositivos, que geralmente têm carência de mecanismos de proteção contra interferências físicas e pouca ou nenhuma medida de segurança. Enquanto desenvolver criptografia segura e eficiente como um meio de prover segurança à informação não é inédito, esse novo ambiente, com uma grande superfície de ataque, tem imposto novos desafios para a engenharia criptográfica. Uma abordagem segura para resolver este problema é utilizar blocos bem conhecidos e profundamente analisados, tal como o protocolo Segurança da Camada de Transporte (de Transport Layer Security, TLS). Na última versão desse padrão, as opções para Criptografia de Curvas Elípticas (de Elliptic Curve Cryptography - ECC) são expandidas para além de parâmetros estabelecidos por governos, tal como a proposta Curve25519 e protocolos criptográficos relacionados. Esse trabalho pesquisa implementações seguras e eficientes de Curve25519 para construir um esquema de troca de chaves em um microcontrolador ARM Cortex-M4, além do esquema de assinatura digital Ed25519 e a proposta de esquema de assinaturas digitais qDSA. Como resultado, operações de desempenho crítico, tal como o multiplicador de 256 bits, foram otimizadas; em particular, aceleração de 50% foi alcançada, impactando o desempenho de protocolos em alto nívelAbstract: With the advent of ubiquitous computing, the Internet of Things will undertake numerous devices connected to each other, while exchanging data often sensitive by nature. Breaching the secrecy of this data may cause irreparable damage. This raises concerns about the security of their communication and the devices themselves, which usually lack tamper resistance mechanisms or physical protection and even low to no security mesures. While developing efficient and secure cryptography as a mean to provide information security services is not a new problem, this new environment, with a wide attack surface, imposes new challenges to cryptographic engineering. A safe approach to solve this problem is reusing well-known and thoroughly analyzed blocks, such as the Transport Layer Security (TLS) protocol. In the last version of this standard, Elliptic Curve Cryptography options were expanded beyond government-backed parameters, such as the Curve25519 proposal and related cryptographic protocols. This work investigates efficient and secure implementations of Curve25519 to build a key exchange protocol on an ARM Cortex-M4 microcontroller, along the related signature scheme Ed25519 and a digital signature scheme proposal called qDSA. As result, performance-critical operations, such as a 256-bit multiplier, are greatly optimized; in this particular case, a 50% speedup is achieved, impacting the performance of higher-level protocolsMestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoCAPESFuncam

    Serial-data computation in VLSI

    Get PDF
    corecore