Search CORE

293 research outputs found

FPGA acceleration of sequence analysis tools in bioinformatics

Author: Mahram Atabak
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

Boston University Institutional Repository (OpenBU)

Recovery From Node Failure in Distributed Query Processing

Author: Taylor Nicholas E.
Publication venue: ScholarlyCommons
Publication date: 12/11/2008
Field of study

While distributed query processing has many advantages, the use of many independent, physically widespread computers almost universally leads to reliability issues. Several techniques have been developed to provide redundancy and the ability to recover from node failure during query processing. In this survey, we examine three techniques--upstream backup, active standby, and passive standby--that have been used in both distributed stream data processing and the distributed processing of static data. We also compare several recent systems that use these techniques, and explore which recovery techniques work well under various conditions

ScholarlyCommons@Penn

Metadata-Aware Query Processing over Data Streams

Author: Ding Luping
Publication venue: Digital WPI
Publication date: 22/04/2008
Field of study

Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques

DigitalCommons@WPI

Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

Author: Armeniakos Giorgos
Hanif Muhammad Abdullah
Jiao Xun
Leon Vasileios
Pekmestzi Kiamal
Shafique Muhammad
Soudris Dimitrios
Publication venue
Publication date: 20/07/2023
Field of study

The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

arXiv.org e-Print Archive

이종 자연어 처리 모델을 위한 확장형 컴퓨터 시스템 설계

Author: 김준성
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 김장우.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.자연어 처리의 중요성이 대두됨에 따라 여러 기업 및 연구진들은 다양하고 복잡한 종류의 자연어 처리 모델들을 제시하고 있다. 즉 자연어 처리 모델들은 형태가 복잡해지고,로규모가 커지며, 종류가 다양해지는 양상을 보여준다. 본 학위논문은 이러한 자연어 처리 모델의 복잡성, 확장성, 다양성을 해결하기 위해 여러 핵심 아이디어를 제시하였다. 각각의 핵심 아이디어들은 다음과 같다. (1) 다양한 종류의 자연어 처리 모델의 성능 오버헤드 분포도를 알아내기 위한 정적/동적 분석을 수행한다. (2) 성능 분석을 통해 알아낸 주된 성능 병목 요소들의 메모리 사용을 최적화 하기 위한 전체론적 모델 병렬화 기술을 제시한다. (3) 여러 연산들의 연산량을 감소하는 기술과 연산량 감소로 인한 skewness 문제를 해결하기 위한 dynamic scheduler 기술을 제시한다. (4) 현 자연어 처리 모델의 성능 다양성을 해결하기 위해 각 모델에 최적화된 디자인을 제시하는 기술을 제시한다. 이러한 핵심 기술들은 여러 종류의 하드웨어 가속기 (예: CPU, GPU, FPGA, ASIC) 에도 범용적으로 사용될 수 있기 때문에 매우 효과적이므로, 제시된 기술들은 자연어 처리 모델을 위한 컴퓨터 시스템 설계 분야에 광범위하게 적용될 수 있다. 본 논문에서는 해당 기술들을 적용하여 CPU, GPU, FPGA 각각의 환경에서, 제시된 기술들이 모두 유의미한 성능향상을 달성함을 보여준다.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

SNU Open Repository and Archive

Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

Author: Chen Chung-Min
Graefe Goetz
Harizopoulos Stavros
Lang Christian A.
Manegold Stefan
Transaction Processing Performance Council
Zukowski Marcin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/09/2018
Field of study

Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google BigQuery than using a query-at-a-time approach while achieving a higher throughput

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref