227,036 research outputs found

    Parametric micro-level performance models for parallel computing and parallel implementation of hydrostatic MM5

    Get PDF
    This dissertation presents Parametric micro-level performance models and Parallel implementation of the hydrostatic version of MM5;Parametric micro-level (PM) performance models are introduced to address the important issue of how to realistically model parallel performance. These models can be used to predict execution times and identify performance bottlenecks. The accurate prediction and analysis of execution times is achieved by incorporating precise details of interprocessor communication, memory operations, auxiliary instructions, and effects of communication and computation schedules. The parameters provide the flexibility to study various algorithmic and architectural issues. The development and verification process, parameters and the scope of applicability of these models are discussed. A coherent view of performance is obtained from the execution profiles generated by PM models. The models are targeted at a large class numerical algorithms commonly implemented on both SIMD and MIMD machines. Specific models are presented for matrix multiplication, LU decomposition, and FFT on a 2-D processor array with distributed memory. A case study includes comparison of parallel machines and parallel algorithms. In a comparison of parallel machines, PM models are used to analyze execution times so as to relate the performance to architectural attributes of a machine. In a comparison of parallel algorithms, PM models are used to study performance of two LU decomposition algorithms: non-blocked and blocked. Two algorithms are compared to identify the tradeoffs between them. This analysis is useful to determine an optimum block size for the blocked algorithm. The case study is done on MasPar MP-1 and MP-2 machines;The dissertation also describes the parallel implementation of the hydrostatic version of MM5 (the fifth generation of Mesoscale Model), which has been widely used for climate studies. The model was parallelized in machine-independent manner using the Runtime System Library (RSL), a runtime library for handling message-passing and index transformation. The dissertation discusses validation of the parallel implementation of MM5 using field data and presents performance results. The parallel model was tested on the IBM SP1, a distributed memory parallel computer

    IPCC++: A concurrentC++ for Centralized and Distributed Memory Models.

    Get PDF
    InterProcess Communication with C++, (IPCC++), is a concurrent object-oriented programming language that supports concurrency for centralized and distributed memory models while maintaining the high level of abstraction associated with object-oriented languages. The IPCC++ language model is a natural extension of the C++ programming language which introduces and supports the following features of concurrency: process concept, mechanism for process instantiation, static and dynamic process declaration, inter-object concurrency, monitor structure, condition variable, socket structure, typed message passing interprocess communication, synchronous and asynchronous communication, client/server paradigm, and run-time communication error detection. Features of concurrency are introduced as complete objects using the primitives of object-oriented programming languages as the vehicle for introduction. The underlying implementation of the components utilizes Parallel Virtual Machine (PVM), a software system that provides an abstraction of the UNIX operating system. A description of the object-oriented and concurrency paradigms are presented. The IPCC++ language model, which represents both paradigms, is defined and an overview of the language and the features it supports is provided. The environment of execution of the IPCC++ language model is described, along with the components of the model used to establish the IPCC++ environment. IPCC++ supports both the centralized and distributed memory models. Each memory model is defined along with the IPCC++ components necessary to support interprocess communication for its corresponding memory model. The centralized memory model uses the monitor structure and the condition variable of concurrency to facilitate centralized interprocess communication. In addition, the distributed memory model uses the socket structure along with a message passing protocol to support distributed interprocess communication. The producer consumer concurrency problem is presented with the corresponding IPCC++ solution designed for a centralized memory model. The dining philosopher concurrency problem is presented with the corresponding IPCC++ solution designed for a distributed memory model. The language design and concurrency features of IPCC++ are discussed and compared with current research efforts that introduce concurrency to C++ supporting centralized and distributed memory models. A description of the IPCC++ implementation model, preprocessing design, and research contributions of the IPCC++ language is provided

    On the Quality Properties of Model Transformations: Performance and Correctness

    Get PDF
    The increasing complexity of software due to continuous technological advances has motivated the use of models in the software development process. Initially, models were mainly used as drafts to help developers understand their programs. Later they were used extensively and a new discipline called Model-Driven Engineering (MDE) was born. In the MDE paradigm, aside from the models themselves, model transformations (MT) are garnering interest as they allow the analysis and manipulation of models. Therefore, the performance, scalability and correctness of model transformations have become critical issues and thus they deserve a thorough study. Existing model transformation engines are principally based on sequential and in-memory execution strategies, and hence their capabilities to transform very large models in parallel and in distributed environments are limited. Current tools and languages are not able to cope with models that are not located in a single machine and, even worse, most of them require the model to be in a single file. Moreover, once a model transformation has been written and executed-either sequentially or in parallel-it is necessary to rely on methods, mechanisms, and tools for checking its correctness. In this dissertation, our contribution is twofold. Firstly, we introduce a novel execution platform that permits the parallel execution of both out-place and in-place model transformations, regardless of whether the models fit into a single machine memory or not. This platform can be used as a target for high-level transformation language compilers, so that existing model transformations do not need to be rewritten in another language but only have to be executed more efficiently. Another advantage is that a developer who is familiar with an existing model transformation language does not need to learn a new one. In addition to performance, the correctness of model transformations is an essential aspect that needs to be addressed if MTs are going to be used in realistic industrial settings. Due to the fact that the most popular model transformation languages are rule-based, i.e., the transformations written in those languages comprise rules that define how the model elements are transformed, the second contribution of this thesis is a static approach for locating faulty rules in model transformations. Current approaches able to fully prove correctness-such as model checking techniques-require an unacceptable amount of time and memory. Our approach cannot fully prove correctness but can be very useful for identifying bugs at an early development stage, quickly and cost effectively

    A scalable system for factored learning in the cloud

    Get PDF
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 79-81).This work presents FlexGP, a new system designed for scalable machine learning in the cloud. FlexGP presents a learner-agnostic, data-parallel approach to cloud-based distributed learning using existing single-machine algorithms, without any dependence on distributed file systems or shared memory between instances. We design and implement asynchronous and decentralized launch and peer discovery protocols to start and configure a distributed network of learners. Through a unique process of factoring the data and parameters across the learners, FlexGP ensures this network consists of heterogeneous learners producing diverse models. These models are then filtered and fused to produce a meta-model for prediction. Using a thoughtfully designed test framework, FlexGP is run on a real-world regression problem from a large database. The results demonstrate the reliability and robustness of the system, even when learning from very little training data and multiple factorings, and demonstrate FlexGP as a vital tool to effectively leverage the cloud for machine learning tasks.by Owen C. Derby.M. Eng

    Scalable data abstractions for distributed parallel computations

    Get PDF
    The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines
    corecore