Search CORE

253 research outputs found

Towards Complete Emulation of Quantum Algorithms using High-Performance Reconfigurable Computing

Author: Mahmud Naveed
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2022
Field of study

Quantum computing is a promising technology that can potentially demonstrate supremacy over classical computing in solving specific classically-intractable problems. However, in its current nascent stage, quantum computing faces major challenges. Two of the main challenges are quantum state decoherence and low scalability of current quantum devices. Decoherence is a process in which the state of the quantum computer is destroyed by interaction with the environment. Decoherence places constraints on the realistic applicability of quantum algorithms as real-life applications usually require complex equivalent quantum circuits to be realized. For example, encoding classical data on quantum computers for solving I/O and data-intensive applications generally requires complex quantum circuits that violate decoherence constraints. In addition, current quantum devices are of intermediate scale, having low quantum bit (qubit) counts and often producing inaccurate or noisy measurements. Consequently, benchmarking of existing quantum algorithms and the investigation of new applications are heavily dependent on classical simulations that use costly, resource-intensive computing platforms. Hardware-based emulation has been alternatively proposed as a more cost-effective and power-efficient approach. Hardware-based emulation methods can take advantage of hardware parallelism and acceleration to produce results at a higher throughput and lower power requirements.This work proposes a hardware-based emulation methodology for quantum algorithms, using cost-effective Field Programmable Gate Array (FPGA) technology. The proposed methodology consists of three components that are required for complete emulation of quantum algorithms; the first component models classical-to-quantum (C2Q) data encoding, the second emulates the behavior of quantum algorithms, and the third models the process of measuring the quantum state and extracting classical information, i.e., quantum-to-classical (Q2C) data decoding. The proposed emulation methodology is used to investigate and optimize methods for C2Q/Q2C data encoding/decoding, as well as several important quantum algorithms such as Quantum Fourier Transform (QFT), Quantum Haar Transform (QHT), and Quantum Grover’s Search (QGS). This work delivers contributions in terms of reducing complexities of quantum circuits, extending and optimizing quantum algorithms, and developing new quantum applications. For example, decoherence-optimized circuits for C2Q/Q2C data encoding/decoding are proposed and evaluated using the proposed emulation methodology. Multi-level decomposable forms of optimized QHT circuits are presented and used to demonstrate dimension reduction of high-resolution data. Additionally, a novel extension to the QGS algorithm is proposed to enable search for dynamically changing multi-patterns of unordered data. Finally, a novel quantum application is presented that combines QHT and dynamic multi-pattern QGS to perform pattern recognition using dimension reduction on high-resolution spatio-spectral data. For higher emulation performance and scalability of the framework, hardware design techniques and hardware architectural optimizations are investigated and proposed. The emulation architectures are designed and implemented on a high-performance reconfigurable computer (HPRC). For reference and comparison, implementations of the proposed quantum circuits are also performed on a state-of-the-art quantum computer. Experimental results show that the proposed hardware architectures enable emulation of quantum algorithms with higher scalability, higher accuracy, and higher throughput, compared to existing hardware-based emulators. As a case study, quantum image processing using multi-spectral images is considered for the experimental evaluations. The analysis and results of this work demonstrate that quantum computers and methodologies based on quantum algorithms will be highly useful in realistic data-intensive domains such as remote-sensing hyperspectral imagery and high-energy physics (HEP)

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Platforms

Author: Emanuele Cesena
Javier Varela
Kaveh Aasaraai
Kevin Bowers
Nicolas Stalder
Rahul Maganti
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 28/11/2022
Field of study

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier-Transform (FFT) on finite fields. NTT is being increasingly used in blockchain and zero-knowledge proof applications. Although FFT and NTT are widely studied for FPGA implementation, we believe CycloneNTT is the first to solve this problem for large data sets (

\ge2^{24}

, 64-bit numbers) that would not fit in the on-chip RAM. CycloneNTT uses a state-of-the-art butterfly network and maps the dataflow to hybrid FIFOs composed of on-chip SRAM and external memory. This manifests into a quasi-streaming data access pattern minimizing external memory access latency and maximizing throughput. We implement two variants of CycloneNTT optimized for DDR and HBM external memories. Although historically this problem has been shown to be memory-bound, CycloneNTT\u27s quasi-streaming access pattern is optimized to the point that when using HBM (Xilinx C1100), the architecture becomes compute-bound. On the DDR-based platform (AWS F1), the latency of the application is equal to the streaming of the entire dataset

\log N

times to/from external memory. Moreover, exploiting HBM\u27s larger number of channels, and following a series of additional optimizations, CycloneNTT only requires

\frac{1}{6}\log N

passes

Computing and Compressing Electron Repulsion Integrals on FPGAs

Author: Kenter Tobias
Kühne Thomas D.
Plessl Christian
Schade Robert
Wu Xin
Publication venue
Publication date: 23/03/2023
Field of study

The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS (

10 \times 10^9

ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around

10^{-7}

10^{-5}

Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x

arXiv.org e-Print Archive

Online Modeling and Tuning of Parallel Stream Processing Systems

Author: Beard Jonathan Curtis
Publication venue: Washington University Open Scholarship
Publication date: 15/08/2015
Field of study

Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework

Washington University St. Louis: Open Scholarship

Communication-Efficient Probabilistic Algorithms: Selection, Sampling, and Checking

Author: Hübschle-Schneider Lorenz
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 14/12/2020
Field of study

Diese Dissertation behandelt drei grundlegende Klassen von Problemen in Big-Data-Systemen, für die wir kommunikationseffiziente probabilistische Algorithmen entwickeln. Im ersten Teil betrachten wir verschiedene Selektionsprobleme, im zweiten Teil das Ziehen gewichteter Stichproben (Weighted Sampling) und im dritten Teil die probabilistische Korrektheitsprüfung von Basisoperationen in Big-Data-Frameworks (Checking). Diese Arbeit ist durch einen wachsenden Bedarf an Kommunikationseffizienz motiviert, der daher rührt, dass der auf das Netzwerk und seine Nutzung zurückzuführende Anteil sowohl der Anschaffungskosten als auch des Energieverbrauchs von Supercomputern und der Laufzeit verteilter Anwendungen immer weiter wächst. Überraschend wenige kommunikationseffiziente Algorithmen sind für grundlegende Big-Data-Probleme bekannt. In dieser Arbeit schließen wir einige dieser Lücken. Zunächst betrachten wir verschiedene Selektionsprobleme, beginnend mit der verteilten Version des klassischen Selektionsproblems, d. h. dem Auffinden des Elements von Rang

k

in einer großen verteilten Eingabe. Wir zeigen, wie dieses Problem kommunikationseffizient gelöst werden kann, ohne anzunehmen, dass die Elemente der Eingabe zufällig verteilt seien. Hierzu ersetzen wir die Methode zur Pivotwahl in einem schon lange bekannten Algorithmus und zeigen, dass dies hinreichend ist. Anschließend zeigen wir, dass die Selektion aus lokal sortierten Folgen – multisequence selection – wesentlich schneller lösbar ist, wenn der genaue Rang des Ausgabeelements in einem gewissen Bereich variieren darf. Dies benutzen wir anschließend, um eine verteilte Prioritätswarteschlange mit Bulk-Operationen zu konstruieren. Später werden wir diese verwenden, um gewichtete Stichproben aus Datenströmen zu ziehen (Reservoir Sampling). Schließlich betrachten wir das Problem, die global häufigsten Objekte sowie die, deren zugehörige Werte die größten Summen ergeben, mit einem stichprobenbasierten Ansatz zu identifizieren. Im Kapitel über gewichtete Stichproben werden zunächst neue Konstruktionsalgorithmen für eine klassische Datenstruktur für dieses Problem, sogenannte Alias-Tabellen, vorgestellt. Zu Beginn stellen wir den ersten Linearzeit-Konstruktionsalgorithmus für diese Datenstruktur vor, der mit konstant viel Zusatzspeicher auskommt. Anschließend parallelisieren wir diesen Algorithmus für Shared Memory und erhalten so den ersten parallelen Konstruktionsalgorithmus für Aliastabellen. Hiernach zeigen wir, wie das Problem für verteilte Systeme mit einem zweistufigen Algorithmus angegangen werden kann. Anschließend stellen wir einen ausgabesensitiven Algorithmus für gewichtete Stichproben mit Zurücklegen vor. Ausgabesensitiv bedeutet, dass die Laufzeit des Algorithmus sich auf die Anzahl der eindeutigen Elemente in der Ausgabe bezieht und nicht auf die Größe der Stichprobe. Dieser Algorithmus kann sowohl sequentiell als auch auf Shared-Memory-Maschinen und verteilten Systemen eingesetzt werden und ist der erste derartige Algorithmus in allen drei Kategorien. Wir passen ihn anschließend an das Ziehen gewichteter Stichproben ohne Zurücklegen an, indem wir ihn mit einem Schätzer für die Anzahl der eindeutigen Elemente in einer Stichprobe mit Zurücklegen kombinieren. Poisson-Sampling, eine Verallgemeinerung des Bernoulli-Sampling auf gewichtete Elemente, kann auf ganzzahlige Sortierung zurückgeführt werden, und wir zeigen, wie ein bestehender Ansatz parallelisiert werden kann. Für das Sampling aus Datenströmen passen wir einen sequentiellen Algorithmus an und zeigen, wie er in einem Mini-Batch-Modell unter Verwendung unserer im Selektionskapitel eingeführten Bulk-Prioritätswarteschlange parallelisiert werden kann. Das Kapitel endet mit einer ausführlichen Evaluierung unserer Aliastabellen-Konstruktionsalgorithmen, unseres ausgabesensitiven Algorithmus für gewichtete Stichproben mit Zurücklegen und unseres Algorithmus für gewichtetes Reservoir-Sampling. Um die Korrektheit verteilter Algorithmen probabilistisch zu verifizieren, schlagen wir Checker für grundlegende Operationen von Big-Data-Frameworks vor. Wir zeigen, dass die Überprüfung zahlreicher Operationen auf zwei „Kern“-Checker reduziert werden kann, nämlich die Prüfung von Aggregationen und ob eine Folge eine Permutation einer anderen Folge ist. Während mehrere Ansätze für letzteres Problem seit geraumer Zeit bekannt sind und sich auch einfach parallelisieren lassen, ist unser Summenaggregations-Checker eine neuartige Anwendung der gleichen Datenstruktur, die auch zählenden Bloom-Filtern und dem Count-Min-Sketch zugrunde liegt. Wir haben beide Checker in Thrill, einem Big-Data-Framework, implementiert. Experimente mit absichtlich herbeigeführten Fehlern bestätigen die von unserer theoretischen Analyse vorhergesagte Erkennungsgenauigkeit. Dies gilt selbst dann, wenn wir häufig verwendete schnelle Hash-Funktionen mit in der Theorie suboptimalen Eigenschaften verwenden. Skalierungsexperimente auf einem Supercomputer zeigen, dass unsere Checker nur sehr geringen Laufzeit-Overhead haben, welcher im Bereich von

2\,\%

liegt und dabei die Korrektheit des Ergebnisses nahezu garantiert wird