293 research outputs found

    FPGA acceleration of sequence analysis tools in bioinformatics

    Full text link
    Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

    Recovery From Node Failure in Distributed Query Processing

    Get PDF
    While distributed query processing has many advantages, the use of many independent, physically widespread computers almost universally leads to reliability issues. Several techniques have been developed to provide redundancy and the ability to recover from node failure during query processing. In this survey, we examine three techniques--upstream backup, active standby, and passive standby--that have been used in both distributed stream data processing and the distributed processing of static data. We also compare several recent systems that use these techniques, and explore which recovery techniques work well under various conditions

    Metadata-Aware Query Processing over Data Streams

    Get PDF
    Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    이쒅 μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ ν™•μž₯ν˜• 컴퓨터 μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. κΉ€μž₯우.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μžμ—°μ–΄ 처리의 μ€‘μš”μ„±μ΄ λŒ€λ‘λ¨μ— 따라 μ—¬λŸ¬ κΈ°μ—… 및 연ꡬ진듀은 λ‹€μ–‘ν•˜κ³  λ³΅μž‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ„ μ œμ‹œν•˜κ³  μžˆλ‹€. 즉 μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ€ ν˜•νƒœκ°€ λ³΅μž‘ν•΄μ§€κ³ ,둜규λͺ¨κ°€ 컀지며, μ’…λ₯˜κ°€ λ‹€μ–‘ν•΄μ§€λŠ” 양상을 보여쀀닀. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ μ΄λŸ¬ν•œ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ λ³΅μž‘μ„±, ν™•μž₯μ„±, 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬ 핡심 아이디어λ₯Ό μ œμ‹œν•˜μ˜€λ‹€. 각각의 핡심 아이디어듀은 λ‹€μŒκ³Ό κ°™λ‹€. (1) λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ μ˜€λ²„ν—€λ“œ 뢄포도λ₯Ό μ•Œμ•„λ‚΄κΈ° μœ„ν•œ 정적/동적 뢄석을 μˆ˜ν–‰ν•œλ‹€. (2) μ„±λŠ₯ 뢄석을 톡해 μ•Œμ•„λ‚Έ 주된 μ„±λŠ₯ 병λͺ© μš”μ†Œλ“€μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μ΅œμ ν™” ν•˜κΈ° μœ„ν•œ 전체둠적 λͺ¨λΈ 병렬화 κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (3) μ—¬λŸ¬ μ—°μ‚°λ“€μ˜ μ—°μ‚°λŸ‰μ„ κ°μ†Œν•˜λŠ” 기술과 μ—°μ‚°λŸ‰ κ°μ†Œλ‘œ μΈν•œ skewness 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ dynamic scheduler κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (4) ν˜„ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 각 λͺ¨λΈμ— μ΅œμ ν™”λœ λ””μžμΈμ„ μ œμ‹œν•˜λŠ” κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. μ΄λŸ¬ν•œ 핡심 κΈ°μˆ λ“€μ€ μ—¬λŸ¬ μ’…λ₯˜μ˜ ν•˜λ“œμ›¨μ–΄ 가속기 (예: CPU, GPU, FPGA, ASIC) 에도 λ²”μš©μ μœΌλ‘œ μ‚¬μš©λ  수 있기 λ•Œλ¬Έμ— 맀우 νš¨κ³Όμ μ΄λ―€λ‘œ, μ œμ‹œλœ κΈ°μˆ λ“€μ€ μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ 컴퓨터 μ‹œμŠ€ν…œ 섀계 뢄야에 κ΄‘λ²”μœ„ν•˜κ²Œ 적용될 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•΄λ‹Ή κΈ°μˆ λ“€μ„ μ μš©ν•˜μ—¬ CPU, GPU, FPGA 각각의 ν™˜κ²½μ—μ„œ, μ œμ‹œλœ κΈ°μˆ λ“€μ΄ λͺ¨λ‘ μœ μ˜λ―Έν•œ μ„±λŠ₯ν–₯상을 달성함을 보여쀀닀.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

    Full text link
    Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google BigQuery than using a query-at-a-time approach while achieving a higher throughput
    • …
    corecore