49 research outputs found

    Fast Fourier transforms on energy-efficient application-specific processors

    Get PDF
    Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and multimedia domains. Traditionally application-specific fixed-function circuits have been used in these designs in form of application-specific integrated circuits (ASIC) to reach the required performance and energy-efficiency. The complexity of these applications has increased over the years, thus the design complexity has increased even faster, which implies increased design time. At the same time, there are more and more standards to be supported, thus using optimised fixed-function implementations for all the functions in all the standards is impractical. The non-recurring engineering costs for integrated circuits have also increased significantly, so manufacturers can only afford fewer chip iterations. Although tailoring the circuit for a specific application provides the best performance and/or energy-efficiency, such approach lacks flexibility. E.g., if an error is found after the manufacturing, an expensive chip iteration is required. In addition, new functionalities cannot be added afterwards to support evolution of standards. Flexibility can be obtained with software based implementation technologies. Unfortunately, general-purpose processors do not provide the energy-efficiency of the fixed-function circuit designs. A useful trade-off between flexibility and performance is implementation based on application-specific processors (ASP) where programmability provides the flexibility and computational resources customised for the given application provide the performance. In this Thesis, application-specific processors are considered by using fast Fourier transform as the representative algorithm. The architectural template used here is transport triggered architecture (TTA) which resembles very long instruction word machines but the operand execution resembles data flow machines rather than traditional operand triggering. The developed TTA processors exploit inherent parallelism of the application. In addition, several characteristics of the application have been identified and those are exploited by developing customised functional units for speeding up the execution. Several customisations are proposed for the data path of the processor but it is also important to match the memory bandwidth to the computation speed. This calls for a memory organisation supporting parallel memory accesses. The proposed optimisations have been used to improve the energy-efficiency of the processor and experiments show that a programmable solution can have energy-efficiency comparable to fixed-function ASIC designs

    Doctor of Philosophy

    Get PDF
    dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware

    Reconfigurable architecture for very large scale microelectronic systems

    Get PDF

    Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies

    Get PDF
    Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor โ€“ RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system โ€“ Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support

    ํƒ€์ž„ ์œˆ๋„์šฐ ์นด์šดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋กœ์šฐ ํ•ด๋จธ๋ง ๋ฐฉ์ง€ ๋ฐ ์ฃผ๊ธฐ์–ต์žฅ์น˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2020. 8. ์•ˆ์ •ํ˜ธ.Computer systems using DRAM are exposed to row-hammer (RH) attacks, which can flip data in a DRAM row without directly accessing a row but by frequently activating its adjacent ones. There have been a number of proposals to prevent RH, including both probabilistic and deterministic solutions. However, the probabilistic solutions provide protection with no capability to detect attacks and have a non-zero probability for missing protection. Otherwise, counter-based deterministic solutions either incur large area overhead or suffer from noticeable performance drop on adversarial memory access patterns. To overcome these challenges, we propose a new counter-based RH prevention solution named Time Window Counter (TWiCe) based row refresh, which accurately detects potential RH attacks only using a small number of counters with a minimal performance impact. We first make a key observation that the number of rows that can cause RH is limited by the maximum values of row activation frequency and DRAM cell retention time. We calculate the maximum number of required counter entries per DRAM bank, with which TWiCe prevents RH with a strong deterministic guarantee. TWiCe incurs no performance overhead on normal DRAM operations and less than 0.7% area and energy overheads over contemporary DRAM devices. Our evaluation shows that TWiCe makes no more than 0.006% of additional DRAM row activations for adversarial memory access patterns, including RH attack scenarios. To reduce the area and energy overhead further, we propose the threshold adjusted rank-level TWiCe. We first introduce pseudo-associative TWiCe (pa-TWiCe) that can search for hundreds of TWiCe table entries energy-efficiently. In addition, by exploiting pa-TWiCe structure, we propose rank-level TWiCe that reduces the number of required entries further by managing the table entries at a rank-level. We also adjust the thresholds of TWiCe to reduce the number of entries without the increase of false-positive detection on general workloads. Finally, we propose extend TWiCe as a hot-page detector to improve main-memory performance. TWiCe table contains the row addresses that have been frequently activated recently, and they are likely to be activated again due to temporal locality in memory accesses. We show how the hot-page detection in TWiCe can be combined with a DRAM page swap methodology to reduce the DRAM latency for the hot pages. Also, our evaluation shows that low-latency DRAM using TWiCe achieves up to 12.2% IPC improvement over a baseline DDR4 device for a multi-threaded workload.DRAM์„ ์ฃผ๊ธฐ์–ต์žฅ์น˜๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์€ ๋กœ์šฐ ํ•ด๋จธ๋ง ๊ณต๊ฒฉ์— ๋…ธ์ถœ๋œ๋‹ค. ๋กœ์šฐ ํ•ด๋จธ๋ง์€ ์ธ์ ‘ DRAM ๋กœ์šฐ๋ฅผ ์ž์ฃผ activationํ•จ์œผ๋กœ์จ ํŠน์ • DRAM ๋กœ์šฐ ๋ฐ์ดํ„ฐ์— ์ง์ ‘ ์ ‘๊ทผํ•˜์ง€ ์•Š๊ณ ์„œ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋’ค์ง‘์„ ์ˆ˜ ์žˆ๋Š” ํ˜„์ƒ์„ ๋งํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋กœ์šฐ ํ•ด๋จธ๋ง ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ™•๋ฅ ์ ์ธ ๋ฐฉ์ง€ ๊ธฐ๋ฒ•๊ณผ ๊ฒฐ์ •๋ก ์  ๋ฐฉ์ง€ ๊ธฐ๋ฒ•๋“ค์ด ์—ฐ๊ตฌ๋˜์–ด ์™”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ํ™•๋ฅ ์ ์ธ ๋ฐฉ์ง€ ๊ธฐ๋ฒ•์€ ๊ณต๊ฒฉ ์ž์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์—†๊ณ , ๋ฐฉ์ง€์— ์‹คํŒจํ•  ํ™•๋ฅ ์ด 0์ด ์•„๋‹ˆ๋ผ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ์นด์šดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๊ฒฐ์ •๋ก ์  ๋ฐฉ์ง€ ๊ธฐ๋ฒ•๋“ค์€ ํฐ ์นฉ ๋ฉด์  ๋น„์šฉ์„ ๋ฐœ์ƒ์‹œํ‚ค๊ฑฐ๋‚˜ ํŠน์ • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์—์„œ ํ˜„์ €ํ•œ ์„ฑ๋Šฅ ํ•˜๋ฝ์„ ์•ผ๊ธฐํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” TWiCe (Time Window Counter based row refresh)๋ผ๋Š” ์ƒˆ๋กœ์šด ์นด์šดํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒฐ์ •๋ก ์  ๋ฐฉ์ง€ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. TWiCe๋Š” ์ ์€ ์ˆ˜์˜ ์นด์šดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋กœ์šฐ ํ•ด๋จธ๋ง ๊ณต๊ฒฉ์„ ์ •ํ™•ํ•˜๊ฒŒ ํƒ์ง€ํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์šฐ๋ฆฌ๋Š” DRAM ํƒ€์ด๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์˜ํ•ด ๋กœ์šฐ activation ๋นˆ๋„๊ฐ€ ์ œํ•œ๋˜๊ณ  DRAM ์…€์ด ์ฃผ๊ธฐ์ ์œผ๋กœ ๋ฆฌํ”„๋ ˆ์‹œ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋กœ์šฐ ํ•ด๋จธ๋ง์„ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋Š” DRAM ๋กœ์šฐ์˜ ์ˆ˜๊ฐ€ ํ•œ์ •๋œ๋‹ค๋Š” ์‚ฌ์‹ค์— ์ฃผ๋ชฉํ•˜์˜€๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ์šฐ๋ฆฌ๋Š” TWiCe๊ฐ€ ํ™•์‹คํ•œ ๊ฒฐ์ •๋ก ์  ๋ฐฉ์ง€๋ฅผ ๋ณด์žฅํ•  ๊ฒฝ์šฐ ํ•„์š”ํ•œ DRAM ๋ฑ…ํฌ ๋‹น ํ•„์š”ํ•œ ์นด์šดํ„ฐ ์ˆ˜์˜ ์ตœ๋Œ€๊ฐ’์„ ๊ตฌํ•˜์˜€๋‹ค. TWiCe๋Š” ์ผ๋ฐ˜์ ์ธ DRAM ๋™์ž‘ ๊ณผ์ •์—์„œ๋Š” ์„ฑ๋Šฅ์— ์•„๋ฌด๋Ÿฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์œผ๋ฉฐ, ํ˜„๋Œ€ DRAM ๋””๋ฐ”์ด์Šค์—์„œ 0.7% ์ดํ•˜์˜ ์นฉ ๋ฉด์  ์ฆ๊ฐ€ ๋ฐ ์—๋„ˆ์ง€ ์ฆ๊ฐ€๋งŒ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ง„ํ–‰ํ•œ ํ‰๊ฐ€์—์„œ TWiCe๋Š” ๋กœ์šฐ ํ•ด๋จธ๋ง ๊ณต๊ฒฉ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์—์„œ 0.006% ์ดํ•˜์˜ ์ถ”๊ฐ€์ ์ธ DRAM activation์„ ์š”๊ตฌํ•˜์˜€๋‹ค. ๋˜ํ•œ TWiCe์˜ ์นฉ ๋ฉด์  ๋ฐ ์—๋„ˆ์ง€ ๋น„์šฉ์„ ๋”์šฑ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” threshold๊ฐ€ ์กฐ์ •๋œ ๋žญํฌ ๋‹จ์œ„ TWiCe๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ €, ์ˆ˜๋ฐฑ๊ฐœ๊ฐ€ ๋„˜๋Š” TWiCe ํ…Œ์ด๋ธ” ํ•ญ๋ชฉ ๊ฒ€์ƒ‰์„ ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” pa-TWiCe (pseudo-associatvie TWiCe)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ , ํ…Œ์ด๋ธ” ํ•ญ๋ชฉ์„ ๋žญํฌ ๋‹จ์œ„๋กœ ๊ด€๋ฆฌํ•˜์—ฌ ํ•„์š”ํ•œ ํ…Œ์ด๋ธ” ํ•ญ๋ชฉ์˜ ์ˆ˜๋ฅผ ๋”์šฑ ์ค„์ธ ๋žญํฌ ๋‹จ์œ„ TWiCe๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” TWiCe์˜ threshold ๊ฐ’์„ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ์ผ๋ฐ˜์ ์ธ ์›Œํฌ๋กœ๋“œ ์ƒ์—์„œ ๊ฑฐ์ง“ ์–‘์„ฑ(false-positive) ํƒ์ง€๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š๋Š” ์„ ์—์„œ TWiCe์˜ ํ…Œ์ด๋ธ” ํ•ญ๋ชฉ ์ˆ˜๋ฅผ ๋”์šฑ ์ค„์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์˜ ์ฃผ๊ธฐ์–ต์žฅ์น˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด TWiCe๋ฅผ hot-page ๊ฐ์ง€๊ธฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์˜ ์‹œ๊ฐ„์  ์ง€์—ญ์„ฑ์— ์˜ํ•ด ์ตœ๊ทผ ์ž์ฃผ activation๋œ DRAM ๋กœ์šฐ๋“ค์€ ๋‹ค์‹œ activation๋  ํ™•๋ฅ ์ด ๋†’๊ณ , TWiCe๋Š” ์ตœ๊ทผ ์ž์ฃผ activation๋œ DRAM ๋กœ์šฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‚ฌ์‹ค์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” hot-page์— ๋Œ€ํ•œ DRAM ์ ‘๊ทผ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ด๋Š” DRAM ํŽ˜์ด์ง€ ์Šค์™‘(swap) ๊ธฐ๋ฒ•๋“ค์— TWiCe๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์ธ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ˆ˜ํ–‰ํ•œ ํ‰๊ฐ€์—์„œ TWiCe๋ฅผ ์‚ฌ์šฉํ•œ ์ €์ง€์—ฐ์‹œ๊ฐ„ DRAM์€ ๋ฉ€ํ‹ฐ ์“ฐ๋ ˆ๋”ฉ ์›Œํฌ๋กœ๋“œ๋“ค์—์„œ ๊ธฐ์กด DDR4 ๋””๋ฐ”์ด์Šค ๋Œ€๋น„ IPC๋ฅผ ์ตœ๋Œ€ 12.2% ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค.Introduction 1 1.1 Time Window Counter Based Row Refresh to Prevent Row-hammering 2 1.2 Optimizing Time Window Counter 6 1.3 Using Time Window Counters to Improve Main Memory Performance 8 1.4 Outline 10 Background of DRAM and Row-hammering 11 2.1 DRAM Device Organization 12 2.2 Sparing DRAM Rows to Combat Reliability Challenges 13 2.3 Main Memory Subsystem Organization and Operation 14 2.4 Row-hammering (RH) 18 2.5 Previous RH Prevention Solutions 20 2.6 Limitations of the Previous RH Solutions 21 TWiCe: Time Window Counter based RH Prevention 26 3.1 TWiCe: Time Window Counter 26 3.2 Proof of RH Prevention 30 3.3 Counter Table Size 33 3.4 Architecting TWiCe 35 3.4.1 Location of TWiCe Table 35 3.4.2 Augmenting DRAM Interface with a New Adjacent Row Refresh (ARR) Command 37 3.5 Analysis 40 3.6 Evaluation 42 Optimizing TWiCe to Reduce Implementation Cost 47 4.1 Pseudo-associative TWiCe 47 4.2 Rank-level TWiCe 50 4.3 Adjusting Threshold to Reduce Table Size 55 4.4 Analysis 57 4.5 Evaluation 59 Augmenting TWiCe for Hot-page Detection 62 5.1 Necessity of Counters for Detecting Hot Pages 62 5.2 Previous Studies on Migration for Asymmetric Low-latency DRAM 64 5.3 Extending TWiCe for Dynamic Hot-page Detection 67 5.4 Additional Components and Methodology 70 5.5 Analysis and Evaluation 73 5.5.1 Overhead Analysis 73 5.5.2 Evaluation 75 Conclusion 82 6.1 Future work 84 Bibliography 85 ๊ตญ๋ฌธ์ดˆ๋ก 94Docto

    High-Performance VLSI Architectures for Lattice-Based Cryptography

    Get PDF
    Lattice-based cryptography is a cryptographic primitive built upon the hard problems on point lattices. Cryptosystems relying on lattice-based cryptography have attracted huge attention in the last decade since they have post-quantum-resistant security and the remarkable construction of the algorithm. In particular, homomorphic encryption (HE) and post-quantum cryptography (PQC) are the two main applications of lattice-based cryptography. Meanwhile, the efficient hardware implementations for these advanced cryptography schemes are demanding to achieve a high-performance implementation. This dissertation aims to investigate the novel and high-performance very large-scale integration (VLSI) architectures for lattice-based cryptography, including the HE and PQC schemes. This dissertation first presents different architectures for the number-theoretic transform (NTT)-based polynomial multiplication, one of the crucial parts of the fundamental arithmetic for lattice-based HE and PQC schemes. Then a high-speed modular integer multiplier is proposed, particularly for lattice-based cryptography. In addition, a novel modular polynomial multiplier is presented to exploit the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication for lattice-based PQC scheme. Afterward, an NTT and Chinese remainder theorem (CRT)-based high-speed modular polynomial multiplier is presented for HE schemes whose moduli are large integers

    Harnessing the power of GPUs for problems in real algebraic geometry

    Get PDF
    This thesis presents novel parallel algorithms to leverage the power of GPUs (Graphics Processing Units) for exact computations with polynomials having large integer coefficients. The significance of such computations, especially in real algebraic geometry, is hard to undermine. On massively-parallel architectures such as GPU, the degree of datalevel parallelism exposed by an algorithm is the main performance factor. We attain high efficiency through the use of structured matrix theory to assist the realization of relevant operations on polynomials on the graphics hardware. A detailed complexity analysis, assuming the PRAM model, also confirms that our approach achieves a substantially better parallel complexity in comparison to classical algorithms used for symbolic computations. Aside from the theoretical considerations, a large portion of this work is dedicated to the actual algorithm development and optimization techniques where we pay close attention to the specifics of the graphics hardware. As a byproduct of this work, we have developed high-throughput modular arithmetic which we expect to be useful for other GPU applications, in particular, open-key cryptography. We further discuss the algorithms for the solution of a system of polynomial equations, topology computation of algebraic curves and curve visualization which can profit to the full extent from the GPU acceleration. Extensive benchmarking on a real data demonstrates the superiority of our algorithms over several state-of-the-art approaches available to date. This thesis is written in English.Diese Arbeit beschรคftigt sich mit neuen parallelen Algorithmen, die das Leistungspotenzial der Grafik-Prozessoren (GPUs) zur exakten Berechnungen mit ganzzahlige Polynomen nutzen. Solche symbolische Berechnungen sind von groรŸer Bedeutung zur Lรถsung vieler Probleme aus der reellen algebraischen Geometrie. Fรผr die effziente Implementierung eines Algorithmus auf massiv-parallelen Hardwarearchitekturen, wie z.B. GPU, ist vor allem auf eine hohe Datenparallelitรคt zu achten. Unter Verwendung von Ergebnissen aus der strukturierten Matrix-Theorie konnten wir die entsprechenden Operationen mit Polynomen auf der Grafikkarte leicht รผbertragen. AuรŸerdem zeigt eine Komplexitรคtanalyse im PRAM-Rechenmodell, dass die von uns entwickelten Verfahren eine deutlich bessere Komplexitรคt aufweisen als dies fรผr die klassischen Verfahren der Fall ist. Neben dem theoretischen Ergebnis liegt ein weiterer Schwerpunkt dieser Arbeit in der praktischen Implementierung der betrachteten Algorithmen, wobei wir auf der Besonderheiten der Grafikhardware achten. Im Rahmen dieser Arbeit haben wir hocheffiziente modulare Arithmetik entwickelt, von der wir erwarten, dass sie sich fรผr andere GPU Anwendungen, insbesondere der Public-Key-Kryptographie, als nรผtzlich erweisen wird. Darรผber hinaus betrachten wir Algorithmen fรผr die Lรถsung eines Systems von Polynomgleichungen, Topologie Berechnung der algebraischen Kurven und deren Visualisierung welche in vollem Umfang von der GPU-Leistung profitieren kรถnnen. Zahlreiche Experimente belegen dass wir zur Zeit die beste Verfahren zur Verfรผgung stellen. Diese Dissertation ist in englischer Sprache verfasst

    Toatie : functional hardware description with dependent types

    Get PDF
    Describing correct circuits remains a tall order, despite four decades of evolution in Hardware Description Languages (HDLs). Many enticing circuit architectures require recursive structures or complex compile-time computation โ€” two patterns that prove difficult to capture in traditional HDLs. In a signal processing context, the Fast FIR Algorithm (FFA) structure for efficient parallel filtering proves to be naturally recursive, and most Multiple Constant Multiplication (MCM) blocks decompose multiplications into graphs of simple shifts and adds using demanding compile time computation. Generalised versions of both remain mostly in academic folklore. The implementations which do exist are often ad hoc circuit generators, written in software languages. These pose challenges for verification and are resistant to composition. Embedded functional HDLs, that represent circuits as data, allow for these descriptions at the cost of forcing the designer to work at the gate-level. A promising alternative is to use a stand-alone compiler, representing circuits as plain functions, exemplified by the CฮปaSH HDL. This, however, raises new challenges in capturing a circuitโ€™s staging โ€” which expressions in the single language should be reduced during compile-time elaboration, and which should remain in the circuitโ€™s run-time? To better reflect the physical separation between circuit phases, this work proposes a new functional HDL (representing circuits as functions) with first-class staging constructs. Orthogonal to this, there are also long-standing challenges in the verification of parameterised circuit families. Industry surveys have consistently reported that only a slim minority of FPGA projects reach production without non-trivial bugs. While a healthy growth in the adoption of automatic formal methods is also reported, the majority of testing remains dynamic โ€” presenting difficulties for testing entire circuit families at once. This research offers an alternative verification methodology via the combination of dependent types and automatic synthesis of user-defined data types. Given precise enough types for synthesisable data, this environment can be used to develop circuit families with full functional verification in a correct-by-construction fashion. This approach allows for verification of entire circuit families (not just one concrete member) and side-steps the state-space explosion of model checking methods. Beyond the existing work, this research offers synthesis of combinatorial circuits โ€” not just a software model of their behaviour. This additional step requires careful consideration of staging, erasure & irrelevance, deriving bit representations of user-defined data types, and a new synthesis scheme. This thesis contributes steps towards HDLs with sufficient expressivity for awkward, combinatorial signal processing structures, allowing for a correct-by-construction approach, and a prototype compiler for netlist synthesis.Describing correct circuits remains a tall order, despite four decades of evolution in Hardware Description Languages (HDLs). Many enticing circuit architectures require recursive structures or complex compile-time computation โ€” two patterns that prove difficult to capture in traditional HDLs. In a signal processing context, the Fast FIR Algorithm (FFA) structure for efficient parallel filtering proves to be naturally recursive, and most Multiple Constant Multiplication (MCM) blocks decompose multiplications into graphs of simple shifts and adds using demanding compile time computation. Generalised versions of both remain mostly in academic folklore. The implementations which do exist are often ad hoc circuit generators, written in software languages. These pose challenges for verification and are resistant to composition. Embedded functional HDLs, that represent circuits as data, allow for these descriptions at the cost of forcing the designer to work at the gate-level. A promising alternative is to use a stand-alone compiler, representing circuits as plain functions, exemplified by the CฮปaSH HDL. This, however, raises new challenges in capturing a circuitโ€™s staging โ€” which expressions in the single language should be reduced during compile-time elaboration, and which should remain in the circuitโ€™s run-time? To better reflect the physical separation between circuit phases, this work proposes a new functional HDL (representing circuits as functions) with first-class staging constructs. Orthogonal to this, there are also long-standing challenges in the verification of parameterised circuit families. Industry surveys have consistently reported that only a slim minority of FPGA projects reach production without non-trivial bugs. While a healthy growth in the adoption of automatic formal methods is also reported, the majority of testing remains dynamic โ€” presenting difficulties for testing entire circuit families at once. This research offers an alternative verification methodology via the combination of dependent types and automatic synthesis of user-defined data types. Given precise enough types for synthesisable data, this environment can be used to develop circuit families with full functional verification in a correct-by-construction fashion. This approach allows for verification of entire circuit families (not just one concrete member) and side-steps the state-space explosion of model checking methods. Beyond the existing work, this research offers synthesis of combinatorial circuits โ€” not just a software model of their behaviour. This additional step requires careful consideration of staging, erasure & irrelevance, deriving bit representations of user-defined data types, and a new synthesis scheme. This thesis contributes steps towards HDLs with sufficient expressivity for awkward, combinatorial signal processing structures, allowing for a correct-by-construction approach, and a prototype compiler for netlist synthesis

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings
    corecore