457 research outputs found

    Energy Aware Runtime Systems for Elastic Stream Processing Platforms

    Get PDF
    Following an invariant growth in the required computational performance of processors, the multicore revolution started around 20 years ago. This revolution was mainly an answer to power dissipation constraints restricting the increase of clock frequency in single-core processors. The multicore revolution not only brought in the challenge of parallel programming, i.e. being able to develop software exploiting the entire capabilities of manycore architectures, but also the challenge of programming heterogeneous platforms. The question of “on which processing element to map a specific computational unit?”, is well known in the embedded community. With the introduction of general-purpose graphics processing units (GPGPUs), digital signal processors (DSPs) along with many-core processors on different system-on-chip platforms, heterogeneous parallel platforms are nowadays widespread over several domains, from consumer devices to media processing platforms for telecom operators. Finding mapping together with a suitable hardware architecture is a process called design-space exploration. This process is very challenging in heterogeneous many-core architectures, which promise to offer benefits in terms of energy efficiency. The main problem is the exponential explosion of space exploration. With the recent trend of increasing levels of heterogeneity in the chip, selecting the parameters to take into account when mapping software to hardware is still an open research topic in the embedded area. For example, the current Linux scheduler has poor performance when mapping tasks to computing elements available in hardware. The only metric considered is CPU workload, which as was shown in recent work does not match true performance demands from the applications. Doing so may produce an incorrect allocation of resources, resulting in a waste of energy. The origin of this research work comes from the observation that these approaches do not provide full support for the dynamic behavior of stream processing applications, especially if these behaviors are established only at runtime. This research will contribute to the general goal of developing energy-efficient solutions to design streaming applications on heterogeneous and parallel hardware platforms. Streaming applications are nowadays widely spread in the software domain. Their distinctive characiteristic is the retrieving of multiple streams of data and the need to process them in real time. The proposed work will develop new approaches to address the challenging problem of efficient runtime coordination of dynamic applications, focusing on energy and performance management.Efter en oföränderlig tillväxt i prestandakrav hos processorer, började den flerkärniga processor-revolutionen för ungefär 20 år sedan. Denna revolution skedde till största del som en lösning till begränsningar i energieffekten allt eftersom klockfrekvensen kontinuerligt höjdes i en-kärniga processorer. Den flerkärniga processor-revolutionen medförde inte enbart utmaningen gällande parallellprogrammering, m.a.o. förmågan att utveckla mjukvara som använder sig av alla delelement i de flerkärniga processorerna, men också utmaningen med programmering av heterogena plattformar. Frågeställningen ”på vilken processorelement skall en viss beräkning utföras?” är väl känt inom ramen för inbyggda datorsystem. Efter introduktionen av grafikprocessorer för allmänna beräkningar (GPGPU), signalprocesserings-processorer (DSP) samt flerkärniga processorer på olika system-on-chip plattformar, är heterogena parallella plattformar idag omfattande inom många domäner, från konsumtionsartiklar till mediaprocesseringsplattformar för telekommunikationsoperatörer. Processen att placera beräkningarna på en passande hårdvaruplattform kallas för utforskning av en designrymd (design-space exploration). Denna process är mycket utmanande för heterogena flerkärniga arkitekturer, och kan medföra fördelar när det gäller energieffektivitet. Det största problemet är att de olika valmöjligheterna i designrymden kan växa exponentiellt. Enligt den nuvarande trenden som förespår ökad heterogeniska aspekter i processorerna är utmaningen att hitta den mest passande placeringen av beräkningarna på hårdvaran ännu en forskningsfråga inom ramen för inbyggda datorsystem. Till exempel, den nuvarande schemaläggaren i Linux operativsystemet är inkapabel att hitta en effektiv placering av beräkningarna på den underliggande hårdvaran. Det enda mätsättet som används är processorns belastning vilket, som visats i tidigare forskning, inte motsvarar den verkliga prestandan i applikationen. Användning av detta mätsätt vid resursallokering resulterar i slöseri med energi. Denna forskning härstammar från observationerna att dessa tillvägagångssätt inte stöder det dynamiska beteendet hos ström-processeringsapplikationer (stream processing applications), speciellt om beteendena bara etableras vid körtid. Denna forskning kontribuerar till det allmänna målet att utveckla energieffektiva lösningar för ström-applikationer (streaming applications) på heterogena flerkärniga hårdvaruplattformar. Ström-applikationer är numera mycket vanliga i mjukvarudomän. Deras distinkta karaktär är inläsning av flertalet dataströmmar, och behov av att processera dem i realtid. Arbetet i denna forskning understöder utvecklingen av nya sätt för att lösa det utmanade problemet att effektivt koordinera dynamiska applikationer i realtid och fokus på energi- och prestandahantering

    Domain Computing: The Next Generation of Computing

    Get PDF
    Computers are indispensable in our daily lives. The first generation of computing started the era of human automation computing. These machine’s computational resources, however, were completely centralized in local machines. With the appearance of networks, the second generation of computing significantly improved data availability and portability so that computing resources could be efficiently shared among the networks. The service-oriented third generation of computing provided functionality by breaking down applications into services, on-demand computing through utility and cloud infrastructures, as well as ubiquitous accesses from wide-spread geographical networks. Services as primary computing resources are far spread from lo- cal to worldwide. These services loosely couple applications and servers, which allows services to scale up easily with higher availability. The complexity of locating, utilizing and optimizing computational resources becomes even more challenging as these resources become more available, fault-tolerant, scalable, better per- forming, and spatially distributed. The critical question becomes how do applications dynamically utilize and optimize unique/duplicate/competitive resources at runtime in the most efficient and effective way without code changes, as well as providing high available, scalable, secured and easy development services. Domain computing proposes a new way to manage computational resources and applications. Domain computing dy- namically manages resources within logic entities, domains, and without being bound to physical machines so that application functionality can be extended at runtime. Moreover, domain computing introduces domains as a replacement of a traditional computer in order to run applications and link different computational resources that are distributed over networks into domains so that a user can greatly improve and optimize the resource utilization at a global level. By negotiating with different layers, domain computing dynamically links different resources, shares resources and cooperates with domains at runtime so applications can more quickly adapt to dynamically changing environments and gain better performance. Also, domain computing presents a new way to develop applications which are resource stateless based. In this work, a prototype sys- tem was built and the performance of its various aspects has been examined, including network throughput, response time, variance, resource publishing and subscription, and secured communications

    병렬 및 분산 임베디드 시스템을 위한 모델 기반 코드 생성 프레임워크

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 하순회.소프트웨어 설계 생산성 및 유지보수성을 향상시키기 위해 다양한 소프트웨어 개발 방법론이 제안되었지만, 대부분의 연구는 응용 소프트웨어를 하나의 프로세서에서 동작시키는 데에 초점을 맞추고 있다. 또한, 임베디드 시스템을 개발하는 데에 필요한 지연이나 자원 요구 사항에 대한 비기능적 요구 사항을 고려하지 않고 있기 때문에 일반적인 소프트웨어 개발 방법론을 임베디드 소프트웨어를 개발하는 데에 적용하는 것은 적합하지 않다. 이 논문에서는 병렬 및 분산 임베디드 시스템을 대상으로 하는 소프트웨어를 모델로 표현하고, 이를 소프트웨어 분석이나 개발에 활용하는 개발 방법론을 소개한다. 우리의 모델에서 응용 소프트웨어는 계층적으로 표현할 수 있는 여러 개의 태스크로 이루어져 있으며, 하드웨어 플랫폼과 독립적으로 명세한다. 태스크 간의 통신 및 동기화는 모델이 정의한 규약이 정해져 있고, 이러한 규약을 통해 실제 프로그램을 실행하기 전에 소프트웨어 에러를 정적 분석을 통해 확인할 수 있고, 이는 응용의 검증 복잡도를 줄이는 데에 기여한다. 지정한 하드웨어 플랫폼에서 동작하는 프로그램은 태스크들을 프로세서에 매핑한 이후에 자동적으로 합성할 수 있다. 위의 모델 기반 소프트웨어 개발 방법론에서 사용하는 프로그램 합성기를 본 논문에서 제안하였는데, 명세한 플랫폼 요구 사항을 바탕으로 병렬 및 분산 임베디드 시스템을에서 동작하는 코드를 생성한다. 여러 개의 정형적 모델들을 계층적으로 표현하여 응용의 동적 행태를 나타고, 합성기는 여러 모델로 구성된 계층적인 모델로부터 병렬성을 고려하여 태스크를 실행할 수 있다. 또한, 프로그램 합성기에서 다양한 플랫폼이나 네트워크를 지원할 수 있도록 코드를 관리하는 방법도 보여주고 있다. 본 논문에서 제시하는 소프트웨어 개발 방법론은 6개의 하드웨어 플랫폼과 3 종류의 네트워크로 구성되어 있는 실제 감시 소프트웨어 시스템 응용 예제와 이종 멀티 프로세서를 활용하는 원격 딥 러닝 예제를 수행하여 개발 방법론의 적용 가능성을 시험하였다. 또한, 프로그램 합성기가 새로운 플랫폼이나 네트워크를 지원하기 위해 필요로 하는 개발 비용도 실제 측정 및 예측하여 상대적으로 적은 노력으로 새로운 플랫폼을 지원할 수 있음을 확인하였다. 많은 임베디드 시스템에서 예상치 못한 하드웨어 에러에 대해 결함을 감내하는 것을 필요로 하기 때문에 결함 감내에 대한 코드를 자동으로 생성하는 연구도 진행하였다. 본 기법에서 결함 감내 설정에 따라 태스크 그래프를 수정하는 방식을 활용하였으며, 결함 감내의 비기능적 요구 사항을 응용 개발자가 쉽게 적용할 수 있도록 하였다. 또한, 결함 감내 지원하는 것과 관련하여 실제 수동으로 구현했을 경우와 비교하였고, 결함 주입 도구를 이용하여 결함 발생 시나리오를 재현하거나, 임의로 결함을 주입하는 실험을 수행하였다. 마지막으로 결함 감내를 실험할 때에 활용한 결함 주입 도구는 본 논문의 또 다른 기여 사항 중 하나로 리눅스 환경으로 대상으로 응용 영역 및 커널 영역에 결함을 주입하는 도구를 개발하였다. 시스템의 견고성을 검증하기 위해 결함을 주입하여 결함 시나리오를 재현하는 것은 널리 사용되는 방법으로, 본 논문에서 개발된 결함 주입 도구는 시스템이 동작하는 도중에 재현 가능한 결함을 주입할 수 있는 도구이다. 커널 영역에서의 결함 주입을 위해 두 종류의 결함 주입 방법을 제공하며, 하나는 커널 GNU 디버거를 이용한 방법이고, 다른 하나는 ARM 하드웨어 브레이크포인트를 활용한 방법이다. 응용 영역에서 결함을 주입하기 위해 GDB 기반 결함 주입 방법을 이용하여 동일 시스템 혹은 원격 시스템의 응용에 결함을 주입할 수 있다. 결함 주입 도구에 대한 실험은 ODROID-XU4 보드에서 진행하였다.While various software development methodologies have been proposed to increase the design productivity and maintainability of software, they usually focus on the development of application software running on a single processing element, without concern about the non-functional requirements of an embedded system such as latency and resource requirements. In this thesis, we present a model-based software development method for parallel and distributed embedded systems. An application is specified as a set of tasks that follow a set of given rules for communication and synchronization in a hierarchical fashion, independently of the hardware platform. Having such rules enables us to perform static analysis to check some software errors at compile time to reduce the verification difficulty. Platform-specific program is synthesized automatically after mapping of tasks onto processing elements is determined. The program synthesizer is also proposed to generate codes which satisfies platform requirements for parallel and distributed embedded systems. As multiple models which can express dynamic behaviors can be depicted hierarchically, the synthesizer supports to manage multiple task graphs with a different hierarchy to run tasks with parallelism. Also, the synthesizer shows methods of managing codes for heterogeneous platforms and generating various communication methods. The viability of the proposed software development method is verified with a real-life surveillance application that runs on six processing elements with three remote communication methods, and remote deep learning example is conducted to use heterogeneous multiprocessing components on distributed systems. Also, supporting a new platform and network requires a small effort by measuring and estimating development costs. Since tolerance to unexpected errors is a required feature of many embedded systems, we also support an automatic fault-tolerant code generation. Fault tolerance can be applied by modifying the task graph based on the selected fault tolerance configurations, so the non-functional requirement of fault tolerance can be easily adopted by an application developer. To compare the effort of supporting fault tolerance, manual implementation of fault tolerance is performed. Also, the fault tolerance method is tested with the fault injection tool to emulate fault scenarios and inject faults randomly. Our fault injection tool, which has used for testing our fault-tolerance method, is another work of this thesis. Emulating fault scenarios by intentionally injecting faults is commonly used to test and verify the robustness of a system. To emulate faults on an embedded system, we present a run-time fault injection framework that can inject a fault on both a kernel and application layer of Linux-based systems. For injecting faults on a kernel layer, two complementary fault injection techniques are used. One is based on Kernel GNU Debugger, and the other is using a hardware breakpoint supported by the ARM architecture. For application-level fault injection, the GDB-based fault injection method is used to inject a fault on a remote application. The viability of the proposed fault injection tool is proved by real-life experiments with an ODROID-XU4 system.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 6 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 HOPES: Hope of Parallel Embedded Software 9 2.1.1 Software Development Procedure 9 2.1.2 Components of HOPES 12 2.2 Universal Execution Model 13 2.2.1 Task Graph Specification 13 2.2.2 Dataflow specification of an Application 15 2.2.3 Task Code Specification and Generic APIs 21 2.2.4 Meta-data Specification 23 Chapter 3 Program Synthesis for Parallel and Distributed Embedded Systems 24 3.1 Motivational Example 24 3.2 Program Synthesis Overview 26 3.3 Program Synthesis from Hierarchically-mixed Models 30 3.4 Platform Code Synthesis 33 3.5 Communication Code Synthesis 36 3.6 Experiments 40 3.6.1 Development Cost of Supporting New Platforms and Networks 40 3.6.2 Program Synthesis for the Surveillance System Example 44 3.6.3 Remote GPU-accelerated Deep Learning Example 46 3.7 Document Generation 48 3.8 Related Works 49 Chapter 4 Model Transformation for Fault-tolerant Code Synthesis 56 4.1 Fault-tolerant Code Synthesis Techniques 56 4.2 Applying Fault Tolerance Techniques in HOPES 61 4.3 Experiments 62 4.3.1 Development Cost of Applying Fault Tolerance 62 4.3.2 Fault Tolerance Experiments 62 4.4 Random Fault Injection Experiments 65 4.5 Related Works 68 Chapter 5 Fault Injection Framework for Linux-based Embedded Systems 70 5.1 Background 70 5.1.1 Fault Injection Techniques 70 5.1.2 Kernel GNU Debugger 71 5.1.3 ARM Hardware Breakpoint 72 5.2 Fault Injection Framework 74 5.2.1 Overview 74 5.2.2 Architecture 75 5.2.3 Fault Injection Techniques 79 5.2.4 Implementation 83 5.3 Experiments 90 5.3.1 Experiment Setup 90 5.3.2 Performance Comparison of Two Fault Injection Methods 90 5.3.3 Bit-flip Fault Experiments 92 5.3.4 eMMC Controller Fault Experiments 94 Chapter 6 Conclusion 97 Bibliography 99 요 약 108Docto

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

    Get PDF
    The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency -- requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY) - an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5x better MAC/cycle than the GreenWaves proprietary software solution and 18.1x better than the state-of-the-art result on an STM32-F746 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746. We release all our developments - the DORY framework, the optimized backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication in IEEE Transactions on Computers (https://ieeexplore.ieee.org/document/9381618