1,054 research outputs found

    Dynamic Load Balancing Based on Applications Global States Monitoring

    Get PDF
    8 pages à paraîtreInternational audienceThe paper presents how to use a special novel distributed program design framework with evolved global control mechanisms to assure processor load balancing during execution of application programs. The new framework supports a programmer with an API and GUI for automated graphical design of program execution control based on global application states monitoring. The framework provides highlevel distributed control primitives at process level and a special control infrastructure for global asynchronous execution control at thread level. Both kinds of control assume observations of current multicore processor performance and communication throughput enabled in the executive distributed system. Methods for designing processor load balancing control based on a system of program and system properties metrics and computational data migration between application executive processes is presented and assessed by experiments with execution of graph representations of distributed programs

    Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays

    Get PDF
    The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism

    Distributed multi-threading in GNU prolog

    Get PDF
    Embora a computação paralela já tenha sido alvo de inúmeros estudos, o processo de a tornar acessível as massas ainda mal começou. Através da combinação com o Prolog de um ambiente de programação distribuída e multithreaded, como o PM2, torna-se possível ter computações paralelas e concorrentes usando programação em logica. Com este objetivo foi desenvolvido o PM2-Prolog, um interface Prolog para o sistema PM2. Tal sistema permite correr aplicações Prolog multithreaded em múltiplas instâncias do GNU Prolog num ambiente distribuído, tirando, assim, partido dos recursos disponíveis nos computadores ligados numa rede. Em problemas computacionalmente pesados, onde o tempo de execução é crucial, existe particular vantagem em usar este sistema. A API do sistema oferece primitivas para gestão de threads e para comunicação explícita entre threads. Testes preliminares mostram um ganho de desempenho quase linear, em comparação com uma versão sequencial. /ABSTRACT - Although parallel computing has been widely researched, the process of bringing concurrency and parallel programming to the mainstream has just begun. Combining a distributed multi-threading environment like PM2 with Prolog, opens the way to exploit concurrency and parallel computing using logic programming. To achieve such a purpose, we developed PM2-Prolog, a Prolog interface to the PM2 system. It allows multithreaded Prolog applications to run in multiple GNU Prolog engines in a distributed environment, thus taking advantage of the resources available on a computer network. This is especially useful for computationally intensive problems, where performance is an important factor. The system API offers thread management primitives, as well as explicit communication between threads. Preliminary test results show an almost linear speedup, when compared to a sequential version

    CEEME: compensating events based execution monitoring enforcement for Cyber-Physical Systems

    Get PDF
    Fundamentally, inherently observable events in Cyber-Physical Systems with tight coupling between cyber and physical components can result in a confidentiality violation. By observing how the physical elements react to cyber commands, adversaries can identify critical links in the system and force the cyber control algorithm to make erroneous decisions. Thus, there is a propensity for a breach in confidentiality leading to further attacks on availability or integrity. Due to the highly integrated nature of Cyber-Physical Systems, it is also extremely difficult to map the system semantics into a security framework under existing security models. The far-reaching objective of this research is to develop a science of selfobfuscating systems based on the composition of simple building blocks. A model of Nondeducibility composes the building blocks under Information Flow Security Properties. To this end, this work presents fundamental theories on external observability for basic regular networks and the novel concept of event compensation that can enforce Information Flow Security Properties at runtime --Abstract, page iii

    Acta Cybernetica : Volume 9. Number 3.

    Get PDF

    Some techniques for automated, resource-aware distributed and mobile computing in a multi-paradigm programming system

    Get PDF
    Distributed parallel execution systems speed up applications by splitting tasks into processes whose execution is assigned to different receiving nodes in a high-bandwidth network. On the distributing side, a fundamental problem is grouping and scheduling such tasks such that each one involves sufñcient computational cost when compared to the task creation and communication costs and other such practical overheads. On the receiving side, an important issue is to have some assurance of the correctness and characteristics of the code received and also of the kind of load the particular task is going to pose, which can be specified by means of certificates. In this paper we present in a tutorial way a number of general solutions to these problems, and illustrate them through their implementation in the Ciao multi-paradigm language and program development environment. This system includes facilities for parallel and distributed execution, an assertion language for specifying complex programs properties (including safety and resource-related properties), and compile-time and run-time tools for performing automated parallelization and resource control, as well as certification of programs with resource consumption assurances and efñcient checking of such certificates

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

    Data-Driven Self-Tuning in a Coordination Programming Language

    Get PDF
    Coordination programming is a paradigm for managing composition, communication, and synchronisation of concurrent components. AstraKahn is a new dataflow coordination language based on Gilles Kahn’s model of process network with some significant refinements. AstraKahn provides a mechanism of implicit data parallelism that is expected to rely on self-tuning, i.e. adaptive optimisation of execution parameters in order to improve the performance of the program. This is achieved by providing a programmer with a number of special network primitives that allow an AstraKahn runtime system to extract optimisation parameters and adjust them while monitoring the performance of execution. In this thesis, we present the architecture of an AstraKahn prototype including a compiler and a runtime system. On the runtime system level the built-in compound network primitives are constructed from simple ones. This approach allows us to make the implementation clear and easily extensible. As a minor contribution we present a number of potential self-tuning heuristics for a simple network pattern. Also, for illustrative purposes, a practical application of the morphism pattern is presented. The particle-in-cell problem, whose parallelisation requires load-balancing, is formulated this way

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών

    Get PDF
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού
    corecore