Search CORE

338 research outputs found

병렬 및 분산 임베디드 시스템을 위한 모델 기반 코드 생성 프레임워크

Author: 정은진
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 하순회.소프트웨어 설계 생산성 및 유지보수성을 향상시키기 위해 다양한 소프트웨어 개발 방법론이 제안되었지만, 대부분의 연구는 응용 소프트웨어를 하나의 프로세서에서 동작시키는 데에 초점을 맞추고 있다. 또한, 임베디드 시스템을 개발하는 데에 필요한 지연이나 자원 요구 사항에 대한 비기능적 요구 사항을 고려하지 않고 있기 때문에 일반적인 소프트웨어 개발 방법론을 임베디드 소프트웨어를 개발하는 데에 적용하는 것은 적합하지 않다. 이 논문에서는 병렬 및 분산 임베디드 시스템을 대상으로 하는 소프트웨어를 모델로 표현하고, 이를 소프트웨어 분석이나 개발에 활용하는 개발 방법론을 소개한다. 우리의 모델에서 응용 소프트웨어는 계층적으로 표현할 수 있는 여러 개의 태스크로 이루어져 있으며, 하드웨어 플랫폼과 독립적으로 명세한다. 태스크 간의 통신 및 동기화는 모델이 정의한 규약이 정해져 있고, 이러한 규약을 통해 실제 프로그램을 실행하기 전에 소프트웨어 에러를 정적 분석을 통해 확인할 수 있고, 이는 응용의 검증 복잡도를 줄이는 데에 기여한다. 지정한 하드웨어 플랫폼에서 동작하는 프로그램은 태스크들을 프로세서에 매핑한 이후에 자동적으로 합성할 수 있다. 위의 모델 기반 소프트웨어 개발 방법론에서 사용하는 프로그램 합성기를 본 논문에서 제안하였는데, 명세한 플랫폼 요구 사항을 바탕으로 병렬 및 분산 임베디드 시스템을에서 동작하는 코드를 생성한다. 여러 개의 정형적 모델들을 계층적으로 표현하여 응용의 동적 행태를 나타고, 합성기는 여러 모델로 구성된 계층적인 모델로부터 병렬성을 고려하여 태스크를 실행할 수 있다. 또한, 프로그램 합성기에서 다양한 플랫폼이나 네트워크를 지원할 수 있도록 코드를 관리하는 방법도 보여주고 있다. 본 논문에서 제시하는 소프트웨어 개발 방법론은 6개의 하드웨어 플랫폼과 3 종류의 네트워크로 구성되어 있는 실제 감시 소프트웨어 시스템 응용 예제와 이종 멀티 프로세서를 활용하는 원격 딥 러닝 예제를 수행하여 개발 방법론의 적용 가능성을 시험하였다. 또한, 프로그램 합성기가 새로운 플랫폼이나 네트워크를 지원하기 위해 필요로 하는 개발 비용도 실제 측정 및 예측하여 상대적으로 적은 노력으로 새로운 플랫폼을 지원할 수 있음을 확인하였다. 많은 임베디드 시스템에서 예상치 못한 하드웨어 에러에 대해 결함을 감내하는 것을 필요로 하기 때문에 결함 감내에 대한 코드를 자동으로 생성하는 연구도 진행하였다. 본 기법에서 결함 감내 설정에 따라 태스크 그래프를 수정하는 방식을 활용하였으며, 결함 감내의 비기능적 요구 사항을 응용 개발자가 쉽게 적용할 수 있도록 하였다. 또한, 결함 감내 지원하는 것과 관련하여 실제 수동으로 구현했을 경우와 비교하였고, 결함 주입 도구를 이용하여 결함 발생 시나리오를 재현하거나, 임의로 결함을 주입하는 실험을 수행하였다. 마지막으로 결함 감내를 실험할 때에 활용한 결함 주입 도구는 본 논문의 또 다른 기여 사항 중 하나로 리눅스 환경으로 대상으로 응용 영역 및 커널 영역에 결함을 주입하는 도구를 개발하였다. 시스템의 견고성을 검증하기 위해 결함을 주입하여 결함 시나리오를 재현하는 것은 널리 사용되는 방법으로, 본 논문에서 개발된 결함 주입 도구는 시스템이 동작하는 도중에 재현 가능한 결함을 주입할 수 있는 도구이다. 커널 영역에서의 결함 주입을 위해 두 종류의 결함 주입 방법을 제공하며, 하나는 커널 GNU 디버거를 이용한 방법이고, 다른 하나는 ARM 하드웨어 브레이크포인트를 활용한 방법이다. 응용 영역에서 결함을 주입하기 위해 GDB 기반 결함 주입 방법을 이용하여 동일 시스템 혹은 원격 시스템의 응용에 결함을 주입할 수 있다. 결함 주입 도구에 대한 실험은 ODROID-XU4 보드에서 진행하였다.While various software development methodologies have been proposed to increase the design productivity and maintainability of software, they usually focus on the development of application software running on a single processing element, without concern about the non-functional requirements of an embedded system such as latency and resource requirements. In this thesis, we present a model-based software development method for parallel and distributed embedded systems. An application is specified as a set of tasks that follow a set of given rules for communication and synchronization in a hierarchical fashion, independently of the hardware platform. Having such rules enables us to perform static analysis to check some software errors at compile time to reduce the verification difficulty. Platform-specific program is synthesized automatically after mapping of tasks onto processing elements is determined. The program synthesizer is also proposed to generate codes which satisfies platform requirements for parallel and distributed embedded systems. As multiple models which can express dynamic behaviors can be depicted hierarchically, the synthesizer supports to manage multiple task graphs with a different hierarchy to run tasks with parallelism. Also, the synthesizer shows methods of managing codes for heterogeneous platforms and generating various communication methods. The viability of the proposed software development method is verified with a real-life surveillance application that runs on six processing elements with three remote communication methods, and remote deep learning example is conducted to use heterogeneous multiprocessing components on distributed systems. Also, supporting a new platform and network requires a small effort by measuring and estimating development costs. Since tolerance to unexpected errors is a required feature of many embedded systems, we also support an automatic fault-tolerant code generation. Fault tolerance can be applied by modifying the task graph based on the selected fault tolerance configurations, so the non-functional requirement of fault tolerance can be easily adopted by an application developer. To compare the effort of supporting fault tolerance, manual implementation of fault tolerance is performed. Also, the fault tolerance method is tested with the fault injection tool to emulate fault scenarios and inject faults randomly. Our fault injection tool, which has used for testing our fault-tolerance method, is another work of this thesis. Emulating fault scenarios by intentionally injecting faults is commonly used to test and verify the robustness of a system. To emulate faults on an embedded system, we present a run-time fault injection framework that can inject a fault on both a kernel and application layer of Linux-based systems. For injecting faults on a kernel layer, two complementary fault injection techniques are used. One is based on Kernel GNU Debugger, and the other is using a hardware breakpoint supported by the ARM architecture. For application-level fault injection, the GDB-based fault injection method is used to inject a fault on a remote application. The viability of the proposed fault injection tool is proved by real-life experiments with an ODROID-XU4 system.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 6 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 HOPES: Hope of Parallel Embedded Software 9 2.1.1 Software Development Procedure 9 2.1.2 Components of HOPES 12 2.2 Universal Execution Model 13 2.2.1 Task Graph Specification 13 2.2.2 Dataflow specification of an Application 15 2.2.3 Task Code Specification and Generic APIs 21 2.2.4 Meta-data Specification 23 Chapter 3 Program Synthesis for Parallel and Distributed Embedded Systems 24 3.1 Motivational Example 24 3.2 Program Synthesis Overview 26 3.3 Program Synthesis from Hierarchically-mixed Models 30 3.4 Platform Code Synthesis 33 3.5 Communication Code Synthesis 36 3.6 Experiments 40 3.6.1 Development Cost of Supporting New Platforms and Networks 40 3.6.2 Program Synthesis for the Surveillance System Example 44 3.6.3 Remote GPU-accelerated Deep Learning Example 46 3.7 Document Generation 48 3.8 Related Works 49 Chapter 4 Model Transformation for Fault-tolerant Code Synthesis 56 4.1 Fault-tolerant Code Synthesis Techniques 56 4.2 Applying Fault Tolerance Techniques in HOPES 61 4.3 Experiments 62 4.3.1 Development Cost of Applying Fault Tolerance 62 4.3.2 Fault Tolerance Experiments 62 4.4 Random Fault Injection Experiments 65 4.5 Related Works 68 Chapter 5 Fault Injection Framework for Linux-based Embedded Systems 70 5.1 Background 70 5.1.1 Fault Injection Techniques 70 5.1.2 Kernel GNU Debugger 71 5.1.3 ARM Hardware Breakpoint 72 5.2 Fault Injection Framework 74 5.2.1 Overview 74 5.2.2 Architecture 75 5.2.3 Fault Injection Techniques 79 5.2.4 Implementation 83 5.3 Experiments 90 5.3.1 Experiment Setup 90 5.3.2 Performance Comparison of Two Fault Injection Methods 90 5.3.3 Bit-flip Fault Experiments 92 5.3.4 eMMC Controller Fault Experiments 94 Chapter 6 Conclusion 97 Bibliography 99 요 약 108Docto

SNU Open Repository and Archive

GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP

Full detector simulation was among the largest CPU consumer in all CERN experiment software stacks for the first two runs of the Large Hadron Collider (LHC). In the early 2010's, the projections were that simulation demands would scale linearly with luminosity increase, compensated only partially by an increase of computing resources. The extension of fast simulation approaches to more use cases, covering a larger fraction of the simulation budget, is only part of the solution due to intrinsic precision limitations. The remainder corresponds to speeding-up the simulation software by several factors, which is out of reach using simple optimizations on the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport codes in order to make them benefit from fine-grained parallelism features such as vectorization, but also from increased code and data locality. This paper presents extensively the results and achievements of this R&D, as well as the conclusions and lessons learnt from the beta prototype.Comment: 34 pages, 26 figures, 24 table

arXiv.org e-Print Archive

CERN Document Server

A Streamline Upwind/ Petrov-Galerkin FEM based time-accurate solution of 3D time-domain Maxwell\u27s equations for dispersive materials

Author: Rajamohan Srijith
Publication venue: UTC Scholar
Publication date: 01/08/2014
Field of study

Although the simulation of most broadband frequency responses are made under the assumption of constant electromagnetic material parameters, this is not a valid assumption for many materials found in nature. In this dissertation the time-accurate solution of the 3D time-domain Maxwell’s equations for dispersive materials in a Streamline Upwind/Petrov-Galerkin framework is investigated. For this purpose the permittivity associated with a material is expressed as a matrix, enabling the solution of anisotropic material models with multiple poles. Here diagonally isotropic models with up to 2 poles are investigated. Near-field to far-field transformations are implemented to enable the solution of open boundary problems such as radiation patterns and radar cross sections. A unified perfectly matched layer absorbing boundary layer is implemented to efficiently terminate the computational region. Numerical simulations of these equations are tightly coupled together and compared against a loosely coupled approach to improve efficiency. An alternative diagonal stabilization matrix is proposed which is implemented and compared with a non-sparse stabilization matrix derived from the flux Jacobians. Along with this new stabilization parameter, scalability is improved by coupling the equations for the perfectly matched layer with those of Maxwell’s equations. Further efficiency gains are achieved by allowing for a variable number of equations to be solved throughout the domain

UTC Scholar

Doctor of Philosophy

Author: Meng Qingyu
Publication venue: University of Utah
Publication date: 01/08/2014
Field of study

dissertationRecent trends in high performance computing present larger and more diverse computers using multicore nodes possibly with accelerators and/or coprocessors and reduced memory. These changes pose formidable challenges for applications code to attain scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities oer a portable solution for easy programming. The Uintah framework, for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. However, the original Uintah code had limited scalability as tasks were run in a predefined order based solely on static analysis of the task graph and used only message passing interface (MPI) for parallelism. By using a new hybrid multithread and MPI runtime system, this research has made it possible for Uintah to scale to 700K central processing unit (CPU) cores when solving challenging fluid-structure interaction problems. Those problems often involve moving objects with adaptive mesh refinement and thus with highly variable and unpredictable work patterns. This research has also demonstrated an ability to run capability jobs on the heterogeneous systems with Nvidia graphics processing unit (GPU) accelerators or Intel Xeon Phi coprocessors. The new runtime system for Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for multicore CPUs and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases without significant changes to application code. This research concludes that the adaptive directed acyclic graph (DAG)-based approach provides a very powerful abstraction for solving challenging multiscale multiphysics engineering problems. Excellent scalability with regard to the different processors and communications performance are achieved on some of the largest and most powerful computers available today

The University of Utah: J. Willard Marriott Digital Library

Harnessing emerging supercomputers for remote and interactive visual discovery in astronomy

Author: Dykes Timothy
Publication venue
Publication date: 01/12/2018
Field of study

Portsmouth University Research Portal (Pure)

Architectural explorations for streaming accelerators with customized memory layouts

Author: Shafiq Muhammad
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2012
Field of study

El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

High-Performance Modelling and Simulation for Big Data Applications

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/02/2021
Field of study

This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

Directory of Open Access Books (DOAB)

A mixed-signal computer architecture and its application to power system problems

Author: Kyriakidis Theodoros
Publication venue: Lausanne, EPFL
Publication date: 26/08/2015
Field of study

Radical changes are taking place in the landscape of modern power systems. This massive shift in the way the system is designed and operated has been termed the advent of the ``smart grid''. One of its implications is a strong market pull for faster power system analysis computing. This work is concerned in particular with transient simulation, which is one of the most demanding power system analyses. This refers to the imitation of the operation of the real-world system over time, for time scales that cover the majority of slow electromechanical transient phenomena. The general mathematical formulation of the simulation problem includes a set of non-linear differential algebraic equations (DAEs). In the algebraic part of this set, heavy linear algebra computations are included, which are related to the admittance matrix of the topology. These computations are a critical factor to the overall performance of a transient simulator. This work proposes the use of analog electronic computing as a means of exceeding the performance barriers of conventional digital computers for the linear algebra operations. Analog computing is integrated in the frame of a power system transient simulator yielding significant computational performance benefits to the latter. Two hybrid, analog and digital computers are presented. The first prototype has been implemented using reconfigurable hardware. In its core, analog computing is used for linear algebra operations, while pipelined digital resources on a field programmable gate array (FPGA) handle all remaining computations. The properties of the analog hardware are thoroughly examined, with special attention to accuracy and timing. The application of the platform to the transient analysis of power system dynamics showed a speedup of two orders of magnitude against conventional software solutions. The second prototype is proposed as a future conceptual architecture that would overcome the limitations of the already implemented hardware, while retaining its virtues. The design space of this future architecture has been thoroughly explored, with the help of a software emulator. For one possible suggested implementation, speedups of four orders of magnitude against software solvers have been observed for the linear algebra operations

Infoscience - École polytechnique fédérale de Lausanne