143 research outputs found

    Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

    Get PDF
    Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures

    ์ด์ข… ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ํ™•์žฅํ˜• ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€์žฅ์šฐ.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋จ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ธฐ์—… ๋ฐ ์—ฐ๊ตฌ์ง„๋“ค์€ ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์€ ํ˜•ํƒœ๊ฐ€ ๋ณต์žกํ•ด์ง€๊ณ ,๋กœ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉฐ, ์ข…๋ฅ˜๊ฐ€ ๋‹ค์–‘ํ•ด์ง€๋Š” ์–‘์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ, ํ™•์žฅ์„ฑ, ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ฐ๊ฐ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„ํฌ๋„๋ฅผ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•œ ์ •์ /๋™์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. (2) ์„ฑ๋Šฅ ๋ถ„์„์„ ํ†ตํ•ด ์•Œ์•„๋‚ธ ์ฃผ๋œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ์š”์†Œ๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒด๋ก ์  ๋ชจ๋ธ ๋ณ‘๋ ฌํ™” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (3) ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ๋“ค์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œํ•˜๋Š” ๊ธฐ์ˆ ๊ณผ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋กœ ์ธํ•œ skewness ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ dynamic scheduler ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (4) ํ˜„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ ๋””์ž์ธ์„ ์ œ์‹œํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋“ค์€ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ (์˜ˆ: CPU, GPU, FPGA, ASIC) ์—๋„ ๋ฒ”์šฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ํšจ๊ณผ์ ์ด๋ฏ€๋กœ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๋ถ„์•ผ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ์ˆ ๋“ค์„ ์ ์šฉํ•˜์—ฌ CPU, GPU, FPGA ๊ฐ๊ฐ์˜ ํ™˜๊ฒฝ์—์„œ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์ด ๋ชจ๋‘ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    Data-intensive Systems on Modern Hardware : Leveraging Near-Data Processing to Counter the Growth of Data

    Get PDF
    Over the last decades, a tremendous change toward using information technology in almost every daily routine of our lives can be perceived in our society, entailing an incredible growth of data collected day-by-day on Web, IoT, and AI applications. At the same time, magneto-mechanical HDDs are being replaced by semiconductor storage such as SSDs, equipped with modern Non-Volatile Memories, like Flash, which yield significantly faster access latencies and higher levels of parallelism. Likewise, the execution speed of processing units increased considerably as nowadays server architectures comprise up to multiple hundreds of independently working CPU cores along with a variety of specialized computing co-processors such as GPUs or FPGAs. However, the burden of moving the continuously growing data to the best fitting processing unit is inherently linked to todayโ€™s computer architecture that is based on the data-to-code paradigm. In the light of Amdahl's Law, this leads to the conclusion that even with today's powerful processing units, the speedup of systems is limited since the fraction of parallel work is largely I/O-bound. Therefore, throughout this cumulative dissertation, we investigate the paradigm shift toward code-to-data, formally known as Near-Data Processing (NDP), which relieves the contention on the I/O bus by offloading processing to intelligent computational storage devices, where the data is originally located. Firstly, we identified Native Storage Management as the essential foundation for NDP due to its direct control of physical storage management within the database. Upon this, the interface is extended to propagate address mapping information and to invoke NDP functionality on the storage device. As the former can become very large, we introduce Physical Page Pointers as one novel NDP abstraction for self-contained immutable database objects. Secondly, the on-device navigation and interpretation of data are elaborated. Therefore, we introduce cross-layer Parsers and Accessors as another NDP abstraction that can be executed on the heterogeneous processing capabilities of modern computational storage devices. Thereby, the compute placement and resource configuration per NDP request is identified as a major performance criteria. Our experimental evaluation shows an improvement in the execution durations of 1.4x to 2.7x compared to traditional systems. Moreover, we propose a framework for the automatic generation of Parsers and Accessors on FPGAs to ease their application in NDP. Thirdly, we investigate the interplay of NDP and modern workload characteristics like HTAP. Therefore, we present different offloading models and focus on an intervention-free execution. By propagating the Shared State with the latest modifications of the database to the computational storage device, it is able to process data with transactional guarantees. Thus, we achieve to extend the design space of HTAP with NDP by providing a solution that optimizes for performance isolation, data freshness, and the reduction of data transfers. In contrast to traditional systems, we experience no significant drop in performance when an OLAP query is invoked but a steady and 30% faster throughput. Lastly, in-situ result-set management and consumption as well as NDP pipelines are proposed to achieve flexibility in processing data on heterogeneous hardware. As those produce final and intermediary results, we continue investigating their management and identified that an on-device materialization comes at a low cost but enables novel consumption modes and reuse semantics. Thereby, we achieve significant performance improvements of up to 400x by reusing once materialized results multiple times

    Quantification and segmentation of breast cancer diagnosis: efficient hardware accelerator approach

    Get PDF
    The mammography image eccentric area is the breast density percentage measurement. The technical challenge of quantification in radiology leads to misinterpretation in screening. Data feedback from society, institutional, and industry shows that quantification and segmentation frameworks have rapidly become the primary methodologies for structuring and interpreting mammogram digital images. Segmentation clustering algorithms have setbacks on overlapping clusters, proportion, and multidimensional scaling to map and leverage the data. In combination, mammogram quantification creates a long-standing focus area. The algorithm proposed must reduce complexity and target data points distributed in iterative, and boost cluster centroid merged into a single updating process to evade the large storage requirement. The mammogram database's initial test segment is critical for evaluating performance and determining the Area Under the Curve (AUC) to alias with medical policy. In addition, a new image clustering algorithm anticipates the need for largescale serial and parallel processing. There is no solution on the market, and it is necessary to implement communication protocols between devices. Exploiting and targeting utilization hardware tasks will further extend the prospect of improvement in the cluster. Benchmarking their resources and performance is required. Finally, the medical imperatives cluster was objectively validated using qualitative and quantitative inspection. The proposed method should overcome the technical challenges that radiologists face

    Hardware Acceleration for Unstructured Big Data and Natural Language Processing.

    Full text link
    The confluence of the rapid growth in electronic data in recent years, and the renewed interest in domain-specific hardware accelerators presents exciting technical opportunities. Traditional scale-out solutions for processing the vast amounts of text data have been shown to be energy- and cost-inefficient. In contrast, custom hardware accelerators can provide higher throughputs, lower latencies, and significant energy savings. In this thesis, I present a set of hardware accelerators for unstructured big-data processing and natural language processing. The first accelerator, called HAWK, aims to speed up the processing of ad hoc queries against large in-memory logs. HAWK is motivated by the observation that traditional software-based tools for processing large text corpora use memory bandwidth inefficiently due to software overheads, and, thus, fall far short of peak scan rates possible on modern memory systems. HAWK is designed to process data at a constant rate of 32 GB/sโ€”faster than most extant memory systems. I demonstrate that HAWK outperforms state-of-the-art software solutions for text processing, almost by an order of magnitude in many cases. HAWK occupies an area of 45 sq-mm in its pareto-optimal configuration and consumes 22 W of power, well within the area and power envelopes of modern CPU chips. The second accelerator I propose aims to speed up similarity measurement calculations for semantic search in the natural language processing space. By leveraging the latency hiding concepts of multi-threading and simple scheduling mechanisms, my design maximizes functional unit utilization. This similarity measurement accelerator provides speedups of 36x-42x over optimized software running on server-class cores, while requiring 56x-58x lower energy, and only 1.3% of the area.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116712/1/prateekt_1.pd

    Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions

    Full text link
    The adoption of hardware accelerators, such as Field-Programmable Gate Arrays, into general-purpose computation pipelines continues to rise, driven by recent trends in data collection and analysis as well as pressure from challenging physical design constraints in hardware. The architectural designs of many of these accelerators stand in stark contrast to the traditional von Neumann model of CPUs. Consequently, existing programming languages, maintenance tools, and techniques are not directly applicable to these devices, meaning that additional architectural knowledge is required for effective programming and configuration. Current programming models and techniques are akin to assembly-level programming on a CPU, thus placing significant burden on developers tasked with using these architectures. Because programming is currently performed at such low levels of abstraction, the software development process is tedious and challenging and hinders the adoption of hardware accelerators. This dissertation explores the thesis that theoretical finite automata provide a suitable abstraction for bridging the gap between high-level programming models and maintenance tools familiar to developers and the low-level hardware representations that enable high-performance execution on hardware accelerators. We adopt a principled hardware/software co-design methodology to develop a programming model providing the key properties that we observe are necessary for success, namely performance and scalability, ease of use, expressive power, and legacy support. First, we develop a framework that allows developers to port existing, legacy code to run on hardware accelerators by leveraging automata learning algorithms in a novel composition with software verification, string solvers, and high-performance automata architectures. Next, we design a domain-specific programming language to aid programmers writing pattern-searching algorithms and develop compilation algorithms to produce finite automata, which supports efficient execution on a wide variety of processing architectures. Then, we develop an interactive debugger for our new language, which allows developers to accurately identify the locations of bugs in software while maintaining support for high-throughput data processing. Finally, we develop two new automata-derived accelerator architectures to support additional applications, including the detection of security attacks and the parsing of recursive and tree-structured data. Using empirical studies, logical reasoning, and statistical analyses, we demonstrate that our prototype artifacts scale to real-world applications, maintain manageable overheads, and support developers' use of hardware accelerators. Collectively, the research efforts detailed in this dissertation help ease the adoption and use of hardware accelerators for data analysis applications, while supporting high-performance computation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155224/1/angstadt_1.pd
    • โ€ฆ
    corecore