293 research outputs found
FPGA acceleration of sequence analysis tools in bioinformatics
Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility.
BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code.
Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics.
The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools
Recovery From Node Failure in Distributed Query Processing
While distributed query processing has many advantages, the use of many independent, physically widespread computers almost universally leads to reliability issues. Several techniques have been developed to provide redundancy and the ability to recover from node failure during query processing. In this survey, we examine three techniques--upstream backup, active standby, and passive standby--that have been used in both distributed stream data processing and the distributed processing of static data. We also compare several recent systems that use these techniques, and explore which recovery techniques work well under various conditions
Metadata-Aware Query Processing over Data Streams
Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
μ΄μ’ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν νμ₯ν μ»΄ν¨ν° μμ€ν μ€κ³
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021. 2. κΉμ₯μ°.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model.
To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μμ°μ΄ μ²λ¦¬μ μ€μμ±μ΄ λλλ¨μ λ°λΌ μ¬λ¬ κΈ°μ
λ° μ°κ΅¬μ§λ€μ λ€μνκ³ λ³΅μ‘ν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ μ μνκ³ μλ€. μ¦ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ ννκ° λ³΅μ‘ν΄μ§κ³ ,λ‘κ·λͺ¨κ° 컀μ§λ©°, μ’
λ₯κ° λ€μν΄μ§λ μμμ 보μ¬μ€λ€. λ³Έ νμλ
Όλ¬Έμ μ΄λ¬ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ 볡μ‘μ±, νμ₯μ±, λ€μμ±μ ν΄κ²°νκΈ° μν΄ μ¬λ¬ ν΅μ¬ μμ΄λμ΄λ₯Ό μ μνμλ€. κ°κ°μ ν΅μ¬ μμ΄λμ΄λ€μ λ€μκ³Ό κ°λ€. (1) λ€μν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ μ€λ²ν€λ λΆν¬λλ₯Ό μμλ΄κΈ° μν μ μ /λμ λΆμμ μννλ€. (2) μ±λ₯ λΆμμ ν΅ν΄ μμλΈ μ£Όλ μ±λ₯ λ³λͺ© μμλ€μ λ©λͺ¨λ¦¬ μ¬μ©μ μ΅μ ν νκΈ° μν μ μ²΄λ‘ μ λͺ¨λΈ λ³λ ¬ν κΈ°μ μ μ μνλ€. (3) μ¬λ¬ μ°μ°λ€μ μ°μ°λμ κ°μνλ κΈ°μ κ³Ό μ°μ°λ κ°μλ‘ μΈν skewness λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν dynamic scheduler κΈ°μ μ μ μνλ€. (4) ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ λ€μμ±μ ν΄κ²°νκΈ° μν΄ κ° λͺ¨λΈμ μ΅μ νλ λμμΈμ μ μνλ κΈ°μ μ μ μνλ€. μ΄λ¬ν ν΅μ¬ κΈ°μ λ€μ μ¬λ¬ μ’
λ₯μ νλμ¨μ΄ κ°μκΈ° (μ: CPU, GPU, FPGA, ASIC) μλ λ²μ©μ μΌλ‘ μ¬μ©λ μ μκΈ° λλ¬Έμ λ§€μ° ν¨κ³Όμ μ΄λ―λ‘, μ μλ κΈ°μ λ€μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν μ»΄ν¨ν° μμ€ν
μ€κ³ λΆμΌμ κ΄λ²μνκ² μ μ©λ μ μλ€. λ³Έ λ
Όλ¬Έμμλ ν΄λΉ κΈ°μ λ€μ μ μ©νμ¬ CPU, GPU, FPGA κ°κ°μ νκ²½μμ, μ μλ κΈ°μ λ€μ΄ λͺ¨λ μ μλ―Έν μ±λ₯ν₯μμ λ¬μ±ν¨μ 보μ¬μ€λ€.1 INTRODUCTION 1
2 Background 6
2.1 Memory Networks 6
2.2 Deep Learning for NLP 9
3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14
3.1 Motivation & Design Goals 14
3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15
3.1.2 Performance Problems in MemNN - High Computation 16
3.1.3 Performance Problems in MemNN - Shared Cache Contention 17
3.1.4 Design Goals 18
3.2 MnnFast 19
3.2.1 Column-Based Algorithm 19
3.2.2 Zero Skipping 22
3.2.3 Embedding Cache 25
3.3 Implementation 26
3.3.1 General-Purpose Architecture - CPU 26
3.3.2 General-Purpose Architecture - GPU 28
3.3.3 Custom Hardware (FPGA) 29
3.4 Evaluation 31
3.4.1 Experimental Setup 31
3.4.2 CPU 33
3.4.3 GPU 35
3.4.4 FPGA 37
3.4.5 Comparison Between CPU and FPGA 39
3.5 Conclusion 39
4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40
4.1 Motivation & Design Goals 40
4.1.1 High Model Complexity 40
4.1.2 High Memory Bandwidth 41
4.1.3 Heavy Computation 42
4.1.4 Huge Performance Variation 43
4.1.5 Design Goals 43
4.2 NLP-Fast 44
4.2.1 Bottleneck Analysis of NLP Models 44
4.2.2 Holistic Model Partitioning 47
4.2.3 Cross-operation Zero Skipping 51
4.2.4 Adaptive Hardware Reconfiguration 54
4.3 NLP-Fast Toolkit 56
4.4 Implementation 59
4.4.1 General-Purpose Architecture - CPU 59
4.4.2 General-Purpose Architecture - GPU 61
4.4.3 Custom Hardware (FPGA) 62
4.5 Evaluation 64
4.5.1 Experimental Setup 65
4.5.2 CPU 65
4.5.3 GPU 67
4.5.4 FPGA 69
4.6 Conclusion 72
5 Related Work 73
5.1 Various DNN Accelerators 73
5.2 Various NLP Accelerators 74
5.3 Model Partitioning 75
5.4 Approximation 76
5.5 Improving Flexibility 78
5.6 Resource Optimization 78
6 Conclusion 80
Abstract (In Korean) 106Docto
Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution
Cloud-based data analysis is nowadays common practice because of the lower
system management overhead as well as the pay-as-you-go pricing model. The
pricing model, however, is not always suitable for query processing as heavy
use results in high costs. For example, in query-as-a-service systems, where
users are charged per processed byte, collections of queries accessing the same
data frequently can become expensive. The problem is compounded by the limited
options for the user to optimize query execution when using declarative
interfaces such as SQL. In this paper, we show how, without modifying existing
systems and without the involvement of the cloud provider, it is possible to
significantly reduce the overhead, and hence the cost, of query-as-a-service
systems. Our approach is based on query rewriting so that multiple concurrent
queries are combined into a single query. Our experiments show the aggregated
amount of work done by the shared execution is smaller than in a
query-at-a-time approach. Since queries are charged per byte processed, the
cost of executing a group of queries is often the same as executing a single
one of them. As an example, we demonstrate how the shared execution of the
TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google
BigQuery than using a query-at-a-time approach while achieving a higher
throughput
- β¦