Search CORE

374 research outputs found

Serial-data computation in VLSI

Author: Smith Stewart Gresty
Publication venue: The University of Edinburgh
Publication date: 01/01/1987
Field of study

Turvaliste reaalarvuoperatsioonide efektiivsemaks muutmine

Author: Krips Toomas
Publication venue
Publication date: 07/05/2019
Field of study

Tänapäeval on andmed ja nende analüüsimine laialt levinud ja neist on palju kasu. Selle populaarsuse tõttu on ka rohkem levinud igasugused kombinatsioonid, kuidas andmed ja nende põhjal arvutamine omavahel suhestuda võivad. Meie töö fookuseks on siinkohal need juhtumid, kus andmete omanikud ja need osapooled, kes neid analüüsima peaks, ei lange kas osaliselt või täielikult kokku. Selle näiteks võib tuua meditsiiniandmed, mida nende omanikud tahaks ühest küljest salajas hoida, aga mille kollektiivsel analüüsimine on kasulik. Teiseks näiteks on arvutuste delegeerimine suurema arvutusvõimsusega, ent mitte täiesti usaldusväärsele osapoolele. Valdkond, mis selliseid probleeme uurib, kannab nime turvaline ühisarvutus. Antud valdkond on eelkõige keskendunud juhtumile, kus andmed on kas täisarvulisel või bitilisel kujul, kuna neid on lihtsam analüüsida ja teised juhtumid saab nendest tuletada, sest kõige, mis üldse arvutatav on, väljaarvutamiseks piisab bittide liitmisest ja korrutamisest. See on teoorias tõsi, samas, kui kõike otse bittide või täisarvude tasemel teha, on tulemus ebaefektiivne. Seepärast vaatleb see doktoritöö turvalist ühisarvutust reaalarvudel ja meetodeid, kuidas seda efektiivsemaks teha. Esiteks vaatleme ujukoma- ja püsikomaarve. Ujukomaarvud on väga paindlikud ja täpsed, aga on teisalt jälle üsna keeruka struktuuriga. Püsikomaarvud on lihtsa olemusega, ent kannatavad täpsuses. Töö esimene meetod vaatlebki nende kombineerimist, et mõlema häid omadusi ära kasutada. Teine tehnika baseerub tõigal, et antud paradigmas juhtub, et ei ole erilist ajalist vahet, kas paralleelis teha üks tehe või miljon. Sestap katsume töö teises meetodis teha paralleelselt hästi palju mingit lihtsat operatsiooni, et välja arvutada mõnd keerulisemat. Kolmas tehnika kasutab reaalarvude kujutamiseks täisarvupaare, (a,b), mis kujutavad reaalarvu a- φb, kus φ=1.618... on kuldlõige. Osutub, et see võimaldab meil üsna efektiivselt liita ja korrutada ja saavutada mõistlik täpsus.Nowadays data and its analysis are ubiquitous and very useful. Due to this popularity, different combinations of how these two can relate to each other proliferate. We focus on the cases where the owners of the data and those who compute on them don't coincide either partially or totally. Examples are medicinal data where the owners want secrecy but where doing statistics on them collectively is useful, or outsourcing computation. The discipline that studies these cases is called secure computation. This field has been mostly working on integer and bit data types, as they are easier to work on, and due to it being possible to reduce the other cases to integer and bit manipulations. However, using these reductions bluntly will give inefficient results. Thus this thesis studies secure computation on real numbers and presents three methods for improving efficiency. The first method concerns with fixed-point and floating-point numbers. Fixed-point numbers are simple in construction, but can lack precision and flexibility. Floating-point numbers, on the other hand, are precise and flexible, but are rather complicated in nature, which in secure setting translates to expensive operations. The first method thus combines those two number types for greater efficiency. The second method is based on the fact that in the concrete paradigm we use, it does not matter timewise whether we perform one or million operations in parallel. Thus we attempt to perform many instances of a fast operation in parallel in order to evaluate a more complicated one. Thirdly we introduce a new real number type. We use pairs of integers (a,b) to represent the real number a- φb where φ=1.618... is the golden ratio. This number type allows us to perform addition and multiplication relatively quicky and also achieves reasonable granularity.https://www.ester.ee/record=b522708

DSpace at Tartu University Library

Computing the fast Fourier transform on SIMD microprocessors

Author: Blake Anthony Martin
Publication venue: 'University of Waikato'
Publication date: 18/06/2012
Field of study

This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

Research Commons@Waikato

Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision

Author: Anderson
Block
Dekker
Felix Höfling
Flenner
Frenkel
Götze
Hansen
Harvey
Knuth
Knuth
Kob
Kob
Kob
Lippert
Liu
Mosayebi
Peter H. Colberg
Plimpton
Preis
Rapaport
Sagan
Stone
van Meel
Voelz
Weeks
Xu
Yang
Zagha
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10^8 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.Comment: 12 pages, 7 figures, to appear in Comp. Phys. Comm., HALMD package licensed under the GPL, see http://research.colberg.org/projects/halm

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Crossref

Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

Author: Samarawickrama Mahendra
Publication venue: Faculty of Engineering and Information Technologies, School of Electrical and Information Engineering
Publication date: 28/02/2017
Field of study

Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture

Sydney eScholarship

A high-speed integrated circuit with applications to RSA Cryptography

Author: Onions Paul David
Publication venue: 'University of Plymouth'
Publication date: 01/01/1995
Field of study

Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

Plymouth Electronic Archive and Research Library

OpenGrey Repository

Recommended from our members

An alternative architecture for performing basic computer arithmetical operations

Author: Djang Kevin
Publication venue: 'Oregon State University'
Publication date
Field of study

The arithmetic portions of almost all modern processor architectures are of very similar design. We use the term "traditional" to describe this design, the primary characteristics of which are native support for integer and floating-point number types and special disjoint instructions and hardware for each supported type. Decades of refinement have endowed this traditional arithmetic architecture with high performance, but also certain inherent limitations. The highly-specific instruction sets and circuitry that provide optimized performance for supported number types, also make it difficult to synthesize unsupported number types and manipulate them in an efficient manner. This trait also applies when using supported number types for arbitrary ranges greater than those directly implemented by the processor. In this thesis we present an alternative to the traditional computer arithmetic architecture, designed to address the limitations of the traditional approach while preserving most of its benefits. Instead of the specific number representation support provided by the instructions, hardware and native data types in a traditional ALU/FPU pair, we define a single data type, the XLU digit that forms a base from which other number types may be easily derived, along with a set of instruction primitives from which basic arithmetic operations may be efficiently realized. Our data type has a signed-digit representation, which allows algorithms for addition, subtraction and multiplication to achieve a high degree of parallelism at the primitive instruction level. The instruction primitives and algorithms are designed to hide or eliminate as much branching as possible, further increasing instruction-level independence. We provide details of the data type, an overview of the set of instruction primitives, and a discussion of how to use those instruction primitives to perform basic arithmetic algorithms for addition, subtraction and multiplication. We also give examples for three derived number representations; integer, fixed-point and floating-point numbers. We believe that our approach of building from a unified base provides flexibility and scalability beyond that of the traditional arithmetic architecture. Our data type, the XLU digit, and the primitive operations to manipulate it may be implemented with modest amounts of circuitry, and this, together with the highly parallel nature of the entire design means that many XLU circuit blocks can be realized in the same silicon area as one traditional ALU/FPV pair. An ALU or FPU may only work when it has the correct type to work on, whereas we believe any and all XLUs available to the processor can be kept busy almost all of the time, achieving greater utilization of the available silicon

ScholarsArchive@OSU

Doctor of Philosophy

Author: Ha Linh Khanh
Publication venue: University of Utah
Publication date: 15/08/2011
Field of study

dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware

The University of Utah: J. Willard Marriott Digital Library

A computer-aided design for digital filter implementation

Author: Lai P. K. M. J.
Lai P. K. M. J.
Publication venue: Department of Electrical Engineering, Imperial College London
Publication date: 01/01/1979
Field of study

Imperial Users onl

Spiral - Imperial College Digital Repository