374 research outputs found

    Serial-data computation in VLSI

    Get PDF

    Turvaliste reaalarvuoperatsioonide efektiivsemaks muutmine

    Get PDF
    TĂ€napĂ€eval on andmed ja nende analĂŒĂŒsimine laialt levinud ja neist on palju kasu. Selle populaarsuse tĂ”ttu on ka rohkem levinud igasugused kombinatsioonid, kuidas andmed ja nende pĂ”hjal arvutamine omavahel suhestuda vĂ”ivad. Meie töö fookuseks on siinkohal need juhtumid, kus andmete omanikud ja need osapooled, kes neid analĂŒĂŒsima peaks, ei lange kas osaliselt vĂ”i tĂ€ielikult kokku. Selle nĂ€iteks vĂ”ib tuua meditsiiniandmed, mida nende omanikud tahaks ĂŒhest kĂŒljest salajas hoida, aga mille kollektiivsel analĂŒĂŒsimine on kasulik. Teiseks nĂ€iteks on arvutuste delegeerimine suurema arvutusvĂ”imsusega, ent mitte tĂ€iesti usaldusvÀÀrsele osapoolele. Valdkond, mis selliseid probleeme uurib, kannab nime turvaline ĂŒhisarvutus. Antud valdkond on eelkĂ”ige keskendunud juhtumile, kus andmed on kas tĂ€isarvulisel vĂ”i bitilisel kujul, kuna neid on lihtsam analĂŒĂŒsida ja teised juhtumid saab nendest tuletada, sest kĂ”ige, mis ĂŒldse arvutatav on, vĂ€ljaarvutamiseks piisab bittide liitmisest ja korrutamisest. See on teoorias tĂ”si, samas, kui kĂ”ike otse bittide vĂ”i tĂ€isarvude tasemel teha, on tulemus ebaefektiivne. SeepĂ€rast vaatleb see doktoritöö turvalist ĂŒhisarvutust reaalarvudel ja meetodeid, kuidas seda efektiivsemaks teha. Esiteks vaatleme ujukoma- ja pĂŒsikomaarve. Ujukomaarvud on vĂ€ga paindlikud ja tĂ€psed, aga on teisalt jĂ€lle ĂŒsna keeruka struktuuriga. PĂŒsikomaarvud on lihtsa olemusega, ent kannatavad tĂ€psuses. Töö esimene meetod vaatlebki nende kombineerimist, et mĂ”lema hĂ€id omadusi Ă€ra kasutada. Teine tehnika baseerub tĂ”igal, et antud paradigmas juhtub, et ei ole erilist ajalist vahet, kas paralleelis teha ĂŒks tehe vĂ”i miljon. Sestap katsume töö teises meetodis teha paralleelselt hĂ€sti palju mingit lihtsat operatsiooni, et vĂ€lja arvutada mĂ”nd keerulisemat. Kolmas tehnika kasutab reaalarvude kujutamiseks tĂ€isarvupaare, (a,b), mis kujutavad reaalarvu a- φb, kus φ=1.618... on kuldlĂ”ige. Osutub, et see vĂ”imaldab meil ĂŒsna efektiivselt liita ja korrutada ja saavutada mĂ”istlik tĂ€psus.Nowadays data and its analysis are ubiquitous and very useful. Due to this popularity, different combinations of how these two can relate to each other proliferate. We focus on the cases where the owners of the data and those who compute on them don't coincide either partially or totally. Examples are medicinal data where the owners want secrecy but where doing statistics on them collectively is useful, or outsourcing computation. The discipline that studies these cases is called secure computation. This field has been mostly working on integer and bit data types, as they are easier to work on, and due to it being possible to reduce the other cases to integer and bit manipulations. However, using these reductions bluntly will give inefficient results. Thus this thesis studies secure computation on real numbers and presents three methods for improving efficiency. The first method concerns with fixed-point and floating-point numbers. Fixed-point numbers are simple in construction, but can lack precision and flexibility. Floating-point numbers, on the other hand, are precise and flexible, but are rather complicated in nature, which in secure setting translates to expensive operations. The first method thus combines those two number types for greater efficiency. The second method is based on the fact that in the concrete paradigm we use, it does not matter timewise whether we perform one or million operations in parallel. Thus we attempt to perform many instances of a fast operation in parallel in order to evaluate a more complicated one. Thirdly we introduce a new real number type. We use pairs of integers (a,b) to represent the real number a- φb where φ=1.618... is the golden ratio. This number type allows us to perform addition and multiplication relatively quicky and also achieves reasonable granularity.https://www.ester.ee/record=b522708

    Computing the fast Fourier transform on SIMD microprocessors

    Get PDF
    This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

    Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision

    Full text link
    Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10^8 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.Comment: 12 pages, 7 figures, to appear in Comp. Phys. Comm., HALMD package licensed under the GPL, see http://research.colberg.org/projects/halm

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture

    A high-speed integrated circuit with applications to RSA Cryptography

    Get PDF
    Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

    Doctor of Philosophy

    Get PDF
    dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware

    A computer-aided design for digital filter implementation

    Get PDF
    Imperial Users onl
    • 

    corecore