374 research outputs found
Turvaliste reaalarvuoperatsioonide efektiivsemaks muutmine
TĂ€napĂ€eval on andmed ja nende analĂŒĂŒsimine laialt levinud ja neist on palju kasu. Selle populaarsuse tĂ”ttu on ka rohkem levinud igasugused kombinatsioonid, kuidas andmed ja nende pĂ”hjal arvutamine omavahel suhestuda vĂ”ivad. Meie töö fookuseks on siinkohal need juhtumid, kus andmete omanikud ja need osapooled, kes neid analĂŒĂŒsima peaks, ei lange kas osaliselt vĂ”i tĂ€ielikult kokku. Selle nĂ€iteks vĂ”ib tuua meditsiiniandmed, mida nende omanikud tahaks ĂŒhest kĂŒljest salajas hoida, aga mille kollektiivsel analĂŒĂŒsimine on kasulik. Teiseks nĂ€iteks on arvutuste delegeerimine suurema arvutusvĂ”imsusega, ent mitte tĂ€iesti usaldusvÀÀrsele osapoolele. Valdkond, mis selliseid probleeme uurib, kannab nime turvaline ĂŒhisarvutus.
Antud valdkond on eelkĂ”ige keskendunud juhtumile, kus andmed on kas tĂ€isarvulisel vĂ”i bitilisel kujul, kuna neid on lihtsam analĂŒĂŒsida ja teised juhtumid saab nendest tuletada, sest kĂ”ige, mis ĂŒldse arvutatav on, vĂ€ljaarvutamiseks piisab bittide liitmisest ja korrutamisest. See on teoorias tĂ”si, samas, kui kĂ”ike otse bittide vĂ”i tĂ€isarvude tasemel teha, on tulemus ebaefektiivne. SeepĂ€rast vaatleb see doktoritöö turvalist ĂŒhisarvutust reaalarvudel ja meetodeid, kuidas seda efektiivsemaks teha.
Esiteks vaatleme ujukoma- ja pĂŒsikomaarve. Ujukomaarvud on vĂ€ga paindlikud ja tĂ€psed, aga on teisalt jĂ€lle ĂŒsna keeruka struktuuriga. PĂŒsikomaarvud on lihtsa olemusega, ent kannatavad tĂ€psuses. Töö esimene meetod vaatlebki nende kombineerimist, et mĂ”lema hĂ€id omadusi Ă€ra kasutada.
Teine tehnika baseerub tĂ”igal, et antud paradigmas juhtub, et ei ole erilist ajalist vahet, kas paralleelis teha ĂŒks tehe vĂ”i miljon. Sestap katsume töö teises meetodis teha paralleelselt hĂ€sti palju mingit lihtsat operatsiooni, et vĂ€lja arvutada mĂ”nd keerulisemat.
Kolmas tehnika kasutab reaalarvude kujutamiseks tĂ€isarvupaare, (a,b), mis kujutavad reaalarvu a- Ïb, kus Ï=1.618... on kuldlĂ”ige. Osutub, et see vĂ”imaldab meil ĂŒsna efektiivselt liita ja korrutada ja saavutada mĂ”istlik tĂ€psus.Nowadays data and its analysis are ubiquitous and very useful. Due to this popularity, different combinations of how these two can relate to each other proliferate. We focus on the cases where the owners of the data and those who compute on them don't coincide either partially or totally. Examples are medicinal data where the owners want secrecy but where doing statistics on them collectively is useful, or outsourcing computation. The discipline that studies these cases is called secure computation.
This field has been mostly working on integer and bit data types, as they are easier to work on, and due to it being possible to reduce the other cases to integer and bit manipulations. However, using these reductions bluntly will give inefficient results. Thus this thesis studies secure computation on real numbers and presents three methods for improving efficiency.
The first method concerns with fixed-point and floating-point numbers. Fixed-point numbers are simple in construction, but can lack precision and flexibility. Floating-point numbers, on the other hand, are precise and flexible, but are rather complicated in nature, which in secure setting translates to expensive operations. The first method thus combines those two number types for greater efficiency.
The second method is based on the fact that in the concrete paradigm we use, it does not matter timewise whether we perform one or million operations in parallel. Thus we attempt to perform many instances of a fast operation in parallel in order to evaluate a more complicated one.
Thirdly we introduce a new real number type. We use pairs of integers (a,b) to represent the real number a- Ïb where Ï=1.618... is the golden ratio. This number type allows us to perform addition and multiplication relatively quicky and also achieves reasonable granularity.https://www.ester.ee/record=b522708
Computing the fast Fourier transform on SIMD microprocessors
This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (âFastest Fourier Transform in the Westâ), SPIRAL and Intel Integrated Performance Primitives (IPP).
The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL.
The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (âStreaming Fast Fourier Trans- formâ), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)
Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision
Modern graphics processing units (GPUs) provide impressive computing
resources, which can be accessed conveniently through the CUDA programming
interface. We describe how GPUs can be used to considerably speed up molecular
dynamics (MD) simulations for system sizes ranging up to about 1 million
particles. Particular emphasis is put on the numerical long-time stability in
terms of energy and momentum conservation, and caveats on limited
floating-point precision are issued. Strict energy conservation over 10^8 MD
steps is obtained by double-single emulation of the floating-point arithmetic
in accuracy-critical parts of the algorithm. For the slow dynamics of a
supercooled binary Lennard-Jones mixture, we demonstrate that the use of
single-floating point precision may result in quantitatively and even
physically wrong results. For simulations of a Lennard-Jones fluid, the
described implementation shows speedup factors of up to 80 compared to a serial
implementation for the CPU, and a single GPU was found to compare with a
parallelised MD simulation using 64 distributed cores.Comment: 12 pages, 7 figures, to appear in Comp. Phys. Comm., HALMD package
licensed under the GPL, see http://research.colberg.org/projects/halm
Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field
Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture
A high-speed integrated circuit with applications to RSA Cryptography
Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and
private communications systems has led to an increasing need for these systems to be
secure against unauthorised access and eavesdropping. To this end, modern computer
security systems employ public-key ciphers, of which probably the most well known is the
RSA ciphersystem, to provide both secrecy and authentication facilities.
The basic RSA cryptographic operation is a modular exponentiation where the modulus
and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable
encryption rates using the RSA cipher requires that it be implemented in hardware.
This thesis presents the design of a high-performance VLSI device, called the WHiSpER
chip, that can perform the modular exponentiations required by the RSA cryptosystem
for moduli and exponents up to 506 bits long. The design has an expected throughput
in excess of 64kbit/s making it attractive for use both as a general RSA processor within
the security function provider of a security system, and for direct use on moderate-speed
public communication networks such as ISDN.
The thesis investigates the low-level techniques used for implementing high-speed arithmetic
hardware in general, and reviews the methods used by designers of existing modular
multiplication/exponentiation circuits with respect to circuit speed and efficiency.
A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic,
together with an efficient multiplier architecture, are proposed that remove the
speed bottleneck of previous designs.
Finally, the implementation of the new algorithm and architecture within the WHiSpER
chip is detailed, along with a discussion of the application of the chip to ciphering and key
generation
Recommended from our members
An alternative architecture for performing basic computer arithmetical operations
The arithmetic portions of almost all modern processor architectures are of very similar design. We use the term "traditional" to describe this design, the primary characteristics of which are native support for integer and floating-point number types and special disjoint instructions and hardware for each supported type. Decades of refinement have endowed this traditional arithmetic architecture with high performance, but also certain inherent limitations.
The highly-specific instruction sets and circuitry that provide optimized performance for supported number types, also make it difficult to synthesize unsupported number types and manipulate them in an efficient manner. This trait also applies when using supported number types for arbitrary ranges greater than those directly implemented by the processor.
In this thesis we present an alternative to the traditional computer arithmetic architecture, designed to address the limitations of the traditional approach while preserving most of its benefits.
Instead of the specific number representation support provided by the instructions, hardware and native data types in a traditional ALU/FPU pair, we define a single data type, the XLU digit that forms a base from which other number types may be easily derived, along with a set of instruction primitives from which basic arithmetic operations may be efficiently realized.
Our data type has a signed-digit representation, which allows algorithms for addition, subtraction and multiplication to achieve a high degree of parallelism at the primitive instruction level. The instruction primitives and algorithms are designed to hide or eliminate as much branching as possible, further increasing instruction-level independence.
We provide details of the data type, an overview of the set of instruction primitives, and a discussion of how to use those instruction primitives to perform basic arithmetic algorithms for addition, subtraction and multiplication. We also give examples for three derived number representations; integer, fixed-point and floating-point numbers.
We believe that our approach of building from a unified base provides flexibility and scalability beyond that of the traditional arithmetic architecture.
Our data type, the XLU digit, and the primitive operations to manipulate it may be implemented with modest amounts of circuitry, and this, together with the highly parallel nature of the entire design means that many XLU circuit blocks can be realized in the same silicon area as one traditional ALU/FPV pair. An ALU or FPU may only work when it has the correct type to work on, whereas we believe any and all XLUs available to the processor can be kept busy almost all of the time, achieving greater utilization of the available silicon
Doctor of Philosophy
dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware
A computer-aided design for digital filter implementation
Imperial Users onl
- âŠ