7 research outputs found

    Design and Implementation of a Distributed Middleware for Parallel Execution of Legacy Enterprise Applications

    Get PDF
    A typical enterprise uses a local area network of computers to perform its business. During the off-working hours, the computational capacities of these networked computers are underused or unused. In order to utilize this computational capacity an application has to be recoded to exploit concurrency inherent in a computation which is clearly not possible for legacy applications without any source code. This thesis presents the design an implementation of a distributed middleware which can automatically execute a legacy application on multiple networked computers by parallelizing it. This middleware runs multiple copies of the binary executable code in parallel on different hosts in the network. It wraps up the binary executable code of the legacy application in order to capture the kernel level data access system calls and perform them distributively over multiple computers in a safe and conflict free manner. The middleware also incorporates a dynamic scheduling technique to execute the target application in minimum time by scavenging the available CPU cycles of the hosts in the network. This dynamic scheduling also supports the CPU availability of the hosts to change over time and properly reschedule the replicas performing the computation to minimize the execution time. A prototype implementation of this middleware has been developed as a proof of concept of the design. This implementation has been evaluated with a few typical case studies and the test results confirm that the middleware works as expected

    Efficient And Scalable Evaluation Of Continuous, Spatio-temporal Queries In Mobile Computing Environments

    Get PDF
    A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. For this research, we present a two-pronged approach at addressing this problem. Firstly, we introduce an efficient and scalable system for monitoring traditional, continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit. We examine a naive CPU-based solution for continuous range-monitoring queries, and we then extend this system using the GPU. Additionally, with mobile communication devices becoming commodity, location-based services will become ubiquitous. To cope with the very high intensity of location-based queries, we propose a view oriented approach of the location database, thereby reducing computation costs by exploiting computation sharing amongst queries requiring the same view. Our studies show that by exploiting the parallel processing power of the GPU, we are able to significantly scale the number of mobile objects, while maintaining an acceptable level of performance. Our second approach was to view this research problem as one belonging to the domain of data streams. Several works have convincingly argued that the two research fields of spatiotemporal data streams and the management of moving objects can naturally come together. [IlMI10, ChFr03, MoXA04] For example, the output of a GPS receiver, monitoring the position of a mobile object, is viewed as a data stream of location updates. This data stream of location updates, along with those from the plausibly many other mobile objects, is received at a centralized server, which processes the streams upon arrival, effectively updating the answers to the currently active queries in real time. iv For this second approach, we present GEDS, a scalable, Graphics Processing Unit (GPU)-based framework for the evaluation of continuous spatio-temporal queries over spatiotemporal data streams. Specifically, GEDS employs the computation sharing and parallel processing paradigms to deliver scalability in the evaluation of continuous, spatio-temporal range queries and continuous, spatio-temporal kNN queries. The GEDS framework utilizes the parallel processing capability of the GPU, a stream processor by trade, to handle the computation required in this application. Experimental evaluation shows promising performance and shows the scalability and efficacy of GEDS in spatio-temporal data streaming environments. Additional performance studies demonstrate that, even in light of the costs associated with memory transfers, the parallel processing power provided by GEDS clearly counters and outweighs any associated costs. Finally, in an effort to move beyond the analysis of specific algorithms over the GEDS framework, we take a broader approach in our analysis of GPU computing. What algorithms are appropriate for the GPU? What types of applications can benefit from the parallel and stream processing power of the GPU? And can we identify a class of algorithms that are best suited for GPU computing? To answer these questions, we develop an abstract performance model, detailing the relationship between the CPU and the GPU. From this model, we are able to extrapolate a list of attributes common to successful GPU-based applications, thereby providing insight into which algorithms and applications are best suited for the GPU and also providing an estimated theoretical speedup for said GPU-based application

    Метод оптимізації паралельних обчислень в розподілених гетерогенних комп’ютерних системах на основі багаторівневого балансування навантаження

    Get PDF
    Структура i обсяг роботи: магістерська дисертація викладена на 128 cторiнкаx, складається зі вступу, 3 розділів, висновку, містить 20 рисунків, 4 таблиці, 13 формул, список використаних джерел із 33 найменувань на 5 cторiнкаx. Актуальність роботи. Станом на сьогодні спостерігається поява все більшої кількості комп’ютерних систем які мають гетерогенну структуру, засновану на використанні центрального процесора та акселераторів з відмінною від нього архітектурою. В той же час, для достатнього рівня швидкодії складних програмних комплексів стає не достатньо застосування лише паралельних систем, стандартною практикою для високонавантажених обчислень стає їх розгортка в розподілених комп’ютерних системах, для яких паралельні системи є лише складовими елементами. Для всіх розподілених систем одним з ключових факторів, що визначають ефективність роботи системи, є вміле балансування навантаження на всіх рівнях цієї системи. Вирішенню цієї NP-повної задачі наразі присвячено цілий розділ комп’ютерних наук, але вона досі лишається актуальною проблемою, оскільки абсолютно ідеального рішення за прийнятний час в ній знайти наразі не можливо. А в випадку, коли кінцеві вузли розподіленої системи являють собою гетерогенні комп’ютерні системи, то задача ускладняється ще більше, оскільки необхідно провести балансування обчислень між різнорідними елементами із зазвичай значно відмінними між собою характеристиками продуктивності та комунікабельності. Наразі існує великий набір стандартних методів балансування навантаження в розподілених системах, але більшість із них орієнтовані на конкретний тип задач, лише невелика кількість не прив’язані до конкретних задач. сучасні засоби організації паралельних обчислень в розподілених комп’ютерних системах з акселераторами приймають низький рівень абстракції, що веде до рядку складнощів та обмежень. Вбудовані в них методи балансування навантаження орієнтовані на гомогенні і здебільшого повнозв’язні системи. Мета роботи: підвищення ефективності проведення паралельних обчислень в гетерогенних розподілених комп’ютерних системах шляхом оптимального розподілу та балансування навантаження на її складові елементи. Завдання дослідження: 1. Виконати аналіз сучасних підходів до організації паралельних обчислень в РГКС, а також основні методи оптимізації таких обчислень. 2. Виділити основні проблеми які виникають при організації РГКС та паралельних обчислень в них, дослідити та систематизувати існуючі пропозиції їх вирішення. 3. Запропонувати методи та підходи до вирішення виявлених проблем, які дозволять підвищити ефективність паралельних обчислень в РГКС. 4. Провести експериментальні дослідження ефективності запропонованих методів та підходів шляхом розробки та тестування пакету програм для реальних РГКС. 5. Провести порівняльний аналіз створених програмних комплексів для РГКС із вже існуючими реалізаціями альтернативних підходів за показниками максимально досягнутих показників прискорення та ефективності. Об’єкт дослідження: паралельні та розподілені обчислення, високопродуктивні обчислення. Предмет дослідження: методи організації паралельних обчислень в розподілених гетерогенних комп’ютерних системах, методи планування обчислень та балансування навантаження в РГКС. Методи дослідження: методи статистичного опрацювання даних, теорія паралельних та розподілених обчислень, теорія планування, теорія оптимізації, теорія компіляторів, теорія алгоритмів, теорія графів. Публікації: 1. «Дослідження ефективності дрібнозернистого паралелізму в багатоядерних комп'ютерних системах», «Вісник НТУУ «КПІ. Інформатика, управління та обчислювальна техніка: зб. наук. праць», 2020, № 66, с. 56 – 61. Також результати та розвиток цього дослідження були представлені на міжнародних конференціях CSNT-2017, CSNT-2018, ICSFTI-2018 та опубліковані у відповідних збірниках праць даних конференцій. 2. «Neural network acceleration method in the two-component CPU-GPU computer systems», Збірник тез доповідей VI Міжнародної конференції «High Performance Computing» (HPC-UA 2020) [на рев’ю]. Також матеріали цього дослідження були представлені на міжнародних конференціях CSNT-2019, ICSFTI-2019 та опубліковані у відповідних збірниках праць даних конференцій. 3. «Застосування технології WCF для підвищення ефективності обчислень в сучасних розподілених комп’ютерних системах», Збірник тез доповідей XIII Міжнародної науково-технічної конференції «Комп’ютерні системи та мережні технології» (CSNT-2020) [очікує публікації]. 4. «Застосування технології WCF для підвищення ефективності паралельних обчислень в хмарних розподілених комп’ютерних системах», Безпека. Відмовостійкість. Інтелект: збірник праць міжнародної науково-практичної конференції ICSFTI2020 [очікує публікації].Structure and scope of master's thesis: Master's thesis is presented on 128 pages, consists of introduction, 3 sections, conclusion, contains 20 drawings, 4 tables, 13 formulas, a bibliography of 33 titles on 5 pages. Relevance of work. Today, an increasing number of computer systems have a heterogeneous structure based on the use of central processing unit and accelerators with a different architecture. At the same time, it is not enough to use only parallel systems for a sufficient level of performance of complex software systems; it is standard practice for high-load computations to deploy them in distributed computer systems, in which parallel systems are only constituent elements. For all distributed systems, one of the key characteristics of the system performance is the ability to balance workloads at all levels of the system. A whole section of computer science is currently devoted to solving NP-complete problem, but it still remains an actual problem, since it is not possible to find the perfect solution in an acceptable time for now. And if the end nodes of a distributed system are heterogeneous computer systems, the task becomes even more complicated as it is necessary to balance the computations between heterogeneous elements with typically significantly different performance and communication characteristics. Currently, there is a large set of standard load balancing methods in distributed systems, but most of them are focused on specific types of tasks, and only a small number are not tied to them. Modern parallel computing tools in distributed computer systems with accelerators accept a low level of abstraction, leading to a number of difficulties and limitations. The load balancing techniques built into them are oriented on homogeneous and fully connected systems. Purpose of work: increasing the efficiency of parallel computing in distributed heterogeneous computer systems by optimally distributing and balancing the load on its elements. Research tasks: 1. To analyze modern approaches to the organization of parallel calculations in DHCS, as well as the basic methods of optimization of such computations; 2. To identify the main problems that arise in the organization of DHCS and parallel calculations in them, to investigate and systematize existing proposals for their solution; 3. To highlight the main problems and disadvantages of existing approaches and technologies, propose own solutions of the identified problems and ideas for improving the work of DHCS; 4. Conduct experimental studies of the effectiveness of the proposed methods and approaches by developing and testing a software systems for real DHCS; 5. To carry out a comparative analysis of the created software systems for DHCS with already existing implementations of alternative approaches based on the achieved acceleration and efficiency. Object of research: parallel and distributed computing, high-performance computing. Subject of research: methods of organization of parallel calculations in distributed heterogeneous computer systems, methods of calculations planning and load balancing in DHCS. Research methods: methods of statistical data processing, theory of parallel and distributed calculations, theory of planning, theory of optimization, theory of compilers, theory of algorithms, theory of graphs. Publications: 1. “Doslidzhennia efektyvnosti dribnozernystoho paralelizmu v bahatoiadernykh kompiuternykh systemakh” [The investigation of effectiveness in using of finegrained parallelism in multicore computer systems], Visnyk NTUU “KPI”. Informatyka, upravlinnia ta obchysliuvalna tekhnika: zbirnyk naukovykh prats [Herald of NTUU “KPI”. Information technology, management and computing technics: a collection of scientific papers], vol. 66, pp. 56 – 61., in press. Also, the results of this study were presented at international conferences CSNT-2017, CSNT-2018, ICSFTI-2018 and published in the relevant proceedings of these conferences. 2. «Neural network acceleration method in the two-component CPU-GPU computer systems», Proceedings of the 6 th International Conference “High Performance Computing HPC-UA 2020” [on review]. Also, the results of this study were presented at international conferences CSNT-2019, ICSFTI-2019 and published in the relevant proceedings of these conferences. 3. «Zastosuvannia tekhnolohii WCF dlia pidvyshchennia efektyvnosti obchyslen v suchasnykh rozpodilenykh komp’iuternykh systemakh» [The application of WCF technology to increase the efficiency of computing in modern distributed computer systems], Proceedings of the XIII International Conference CSNT2020 [in press]. 4. «Zastosuvannia tekhnolohii WCF dlia pidvyshchennia efektyvnosti paralelnykh obchyslen v khmarnykh rozpodilenykh komp’iuetrnykh systemakh» [The application of WCF technology to increase the efficiency of parallel computing in cloud distributed computer systems], Proceedings of the The International Conference on Security, Fault Tolerance, Intelligence (ICSFTI2020) [in press]

    Parallel program performance prediction using deterministic task graph analysis

    No full text
    In this paper, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalability, but have been less successful in practice for providing detailed insights and metrics for program performance (leaving these to measurement or simulation). We develop a conceptually simple modeling technique called deterministic task graph analysis that provides detailed performance prediction for shared-memory programs with arbitrary task graphs, a wide variety of task scheduling policies, and significant communication and resource contention. Unlike many previous models that are stochastic models, our model assumes deterministic task execution times (while retaining the use of stochastic models for communication and resource contention). This assumption is supported by a previous study of the influence of non-deterministic delays in parallel programs. We evaluate our model in three ways. First, an experimental evaluation shows that our analysis technique is accurate and efficient for a variety of shared-memory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly non-uniform tas

    System-Level Power Estimation Methodology for MPSoC based Platforms

    Get PDF
    Avec l'essor des nouvelles technologies d'intégration sur silicium submicroniques, la consommation de puissance dans les systèmes sur puce multiprocesseur (MPSoC) est devenue un facteur primordial au niveau du flot de conception. La prise en considération de ce facteur clé dès les premières phases de conception, joue un rôle primordial puisqu'elle permet d'augmenter la fiabilité des composants et de réduire le temps d'arrivée sur le marché du produit final.Shifting the design entry point up to the system-level is the most important countermeasure adopted to manage the increasing complexity of Multiprocessor System on Chip (MPSoC). The reason is that decisions taken at this level, early in the design cycle, have the greatest impact on the final design in terms of power and energy efficiency. However, taking decisions at this level is very difficult, since the design space is extremely wide and it has so far been mostly a manual activity. Efficient system-level power estimation tools are therefore necessary to enable proper Design Space Exploration (DSE) based on power/energy and timing.VALENCIENNES-Bib. électronique (596069901) / SudocSudocFranceF

    Effective visualisation of callgraphs for optimisation of parallel programs: a design study

    Get PDF
    Parallel programs are increasingly used to perform scientific calculations on supercomputers. Optimising parallel applications to scale well, and ensuring maximum parallelisation, is a challenging task. The performance of parallel programs is affected by a range of factors, such as limited network bandwidth, parallel algorithms, memory latency and the speed of the processors. The term “performance bottlenecks” refers to obstacles that cause slow execution of the parallel programs. Visualisation tools are used to identify performance bottlenecks of parallel applications in an attempt to optimize the execution of the programs and fully utilise the available computational resources. TAU (Tuning and Analysis Utilities) callgraph visualisation is one such tool commonly used to analyse the performance of parallel programs. The callgraph visualisation shows the relationship between different parts (for example, routines, subroutines, modules and functions) of the parallel program executed during the run. TAU’s callgraph tool has limitations: it does not have the ability to effectively display large performance data (metrics) generated during the execution of the parallel program, and the relationship between different parts of the program executed during the run can be hard to see. The aim of this work is to design an effective callgraph visualisation that enables users to efficiently identify performance bottlenecks incurred during the execution of a parallel program. This design study employs a user-centred iterative methodology to develop a new callgraph visualisation, involving expert users in the three developmental stages of the system: these design stages develop prototypes of increasing fidelity, from a paper prototype to high fidelity interactive prototypes in the final design. The paper-based prototype of a new callgraph visualisation was evaluated by a single expert from the University of Oregon’s Performance Research Lab, which developed the original callgraph visualisation tool. This expert is a computer scientist who holds doctoral degree in computer and information science from University of Oregon and is the head of the University of Oregon’s Performance Research Lab. The interactive prototype (first high fidelity design) was evaluated against the original TAU callgraph system by a team of expert users, comprising doctoral graduates and undergraduate computer scientists from the University of Tennessee, United States of America (USA). The final complete prototype (second high fidelity design) of the callgraph visualisation was developed with the D3.js JavaScript library and evaluated by users (doctoral graduates and undergraduate computer science students) from the University of Tennessee, USA. Most of these users have between 3 and 20 years of experience in High Performance Computing (HPC). On the other hand, an expert has more than 20 years of experience in development of visualisation tools used to analyse the performance of parallel programs. The expert and users were chosen to test new callgraphs against original callgraphs because they have experience in analysing, debugging, parallelising, optimising and developing parallel programs. After evaluations, the final visualisation design of the callgraphs was found to be effective, interactive, informative and easy-to-use. It is anticipated that the final design of the callgraph visualisation will help parallel computing users to effectively identify performance bottlenecks within parallel programs, and enable full utilisation of computational resources within a supercomputer

    Optimizations and Cost Models for multi-core architectures: an approach based on parallel paradigms

    Get PDF
    The trend in modern microprocessor architectures is clear: multi-core chips are here to stay, and researchers expect multiprocessors with 128 to 1024 cores on a chip in some years. Yet the software community is slowly taking the path towards parallel programming: while some works target multi-cores, these are usually inherited from the previous tools for SMP architectures, and rarely exploit specific characteristics of multi-cores. But most important, current tools have no facilities to guarantee performance or portability among architectures. Our research group was one of the first to propose the structured parallel programming approach to solve the problem of performance portability and predictability. This has been successfully demonstrated years ago for distributed and shared memory multiprocessors, and we strongly believe that the same should be applied to multi-core architectures. The main problem with performance portability is that optimizations are effective only under specific conditions, making them dependent on both the specific program and the target architecture. For this reason in current parallel programming (in general, but especially with multi-cores) optimizations usually follows a try-and-decide approach: each one must be implemented and tested on the specific parallel program to understand its benefits. If we want to make a step forward and really achieve some form of performance portability, we require some kind of prediction of the expected performance of a program. The concept of performance modeling is quite old in the world of parallel programming; yet, in the last years, this kind of research saw small improvements: cost models to describe multi-cores are missing, mainly because of the increasing complexity of microarchitectures and the poor knowledge of specific implementation details of current processors. In the first part of this thesis we prove that the way of performance modeling is still feasible, by studying the Tilera TilePro64. The high number of cores on-chip in this processor (64) required the use of several innovative solutions, such as a complex interconnection network and the use of multiple memory interfaces per chip. For these features the TilePro64 can be considered an insight of what to expect in future multi-core processors. The availability of a cycle-accurate simulator and an extensive documentation allowed us to model the architecture, and in particular its memory subsystem, at the accuracy level required to compare optimizations In the second part, focused on optimizations, we cover one of the most important issue of multi-core architectures: the memory subsystem. In this area multi-core strongly differs in their structure w.r.t off-chip parallel architectures, both SMP and NUMA, thus opening new opportunities. In detail, we investigate the problem of data distribution over the memory controllers in several commercial multi-cores, and the efficient use of the cache coherency mechanisms offered by the TilePro64 processor. Finally, by using the performance model, we study different implementations, derived from the previous optimizations, of a simple test-case application. We are able to predict the best version using only profiled data from a sequential execution. The accuracy of the model has been verified by experimentally comparing the implementations on the real architecture, giving results within 1 − 2% of accuracy
    corecore