63 research outputs found

    Contributions to the development of an integrated toolbox of solvers in Derivative-Free Optimization

    Get PDF
    This dissertation is framed on the ongoing research project BoostDFO - Improving the performance and moving to newer dimensions in Derivative-Free Optimization. The final goal of this project is to develop efficient and robust algorithms for Global and/or Multiobjective Derivative-free Optimization. This type of optimization is typically required in complex scientific/industrial applications, where the function evaluation is time-consuming and derivatives are not available for use, neither can be numerically approximated. Often problems present several conflicting objectives or users aspire to obtain global solutions. Inspired by successful approaches used in single objective local Derivative-free Optimization, we intend to address the inherent problem of the huge execution times by resorting to parallel/cloud computing and carrying a detailed performance analysis. As result, an integrated toolbox for solving single/multi objective, local/global Derivativefree Optimization problems is made available, with recommendations for taking advantage of parallelization and cloud computing, providing easy access to several efficient and robust algorithms and allowing to tackle harder Derivative-free Optimization problems.Esta dissertação insere-se no projecto científico BoostDFO - Improving the performance and moving to newer dimensions in Derivative-Free Optimization. O objectivo final desta investigação é desenvolver algoritmos robustos e eficientes para problemas de Optimização Sem Derivadas Globais e/ou Multiobjectivo. Este tipo de optimização é tipicamente requerido em aplicações científicas/industriais complexas, onde a avaliação da função é bastante demorada e as derivadas não se encontram disponíveis, nem podem ser aproximadas numericamente. Os problemas apresentam frequentemente vários objectivos divergentes ou os utilizadores procuram obter soluções globais. Tendo por base abordagens prévias bem-sucedidas utilizadas em Optimização Sem Derivadas local e uniobjectivo, pretende-se abordar o problema inerente aos grandes tempos de execução, recorrendo ao paralelismo/computação em cloud e efectuando uma detalhada análise de desempenho. Como resultado, é disponibilizada uma ferramenta integrada destinada a problemas de Optimização Sem Derivadas uni/multiobjectivo, com optimização local/global, incluindo recomendações que permitam tirar partido do paralelismo e computação em cloud, facilitando o acesso a vários algoritmos robustos e eficientes e permitindo abordar problemas mais difíceis nesta classe

    Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems

    Get PDF
    Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ). This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Flexible and Efficient Control of Data Transfers for Loosely Coupled Components

    Get PDF
    Allowing loose coupling between the components of complex applications has many advantages, such as flexibility in the components that can participate and making it easier to model multiscale physical phenomena. To support coupling of parallel and sequential application components, I have designed and implemented a loosely coupled framework which has the following characteristics: (1) connections between participating components are separately identified from the individual components, (2) all data transfers between data exporting and importing components are determined by a runtime-based low overhead method (approximate match), (3) two runtime-based optimization approaches, collective buffering and inverse-match cache, are applied to speed up the applications in many common coupling modes, and (4) a multi-threaded multi-process control protocol that can be systematically constructed by the composition of sub-tasks protocols. The proposed framework has been applied to two real world applications, and the deployment approach and runtime performance are also studied. Currently the framework runs on x86 Linux clusters, and porting strategies for multicore x86 processors and advanced high performance computer architectures are also explored

    GPRM: a high performance programming framework for manycore processors

    Get PDF
    Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Get PDF
    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

    Application of general semi-infinite Programming to Lapidary Cutting Problems

    Get PDF
    We consider a volume maximization problem arising in gemstone cutting industry. The problem is formulated as a general semi-infinite program (GSIP) and solved using an interiorpoint method developed by Stein. It is shown, that the convexity assumption needed for the convergence of the algorithm can be satisfied by appropriate modelling. Clustering techniques are used to reduce the number of container constraints, which is necessary to make the subproblems practically tractable. An iterative process consisting of GSIP optimization and adaptive refinement steps is then employed to obtain an optimal solution which is also feasible for the original problem. Some numerical results based on realworld data are also presented

    Robustness against Relaxed Memory Models

    Get PDF
    Sequential Consistency (SC) is the memory model traditionally applied by programmers and verification tools for the analysis of multithreaded programs. SC guarantees that instructions of each thread are executed atomically and in program order. Modern CPUs implement memory models that relax the SC guarantees: threads can execute instructions out of order, stores to the memory can be observed by different threads in different order. As a result of these relaxations, multithreaded programs can show unexpected, potentially undesired behaviors, when run on real hardware. The robustness problem asks if a program has the same behaviors under SC and under a relaxed memory model. Behaviors are formalized in terms of happens-before relations — dataflow and control-flow relations between executed instructions. Programs that are robust against a memory model produce the same results under this memory model and under SC. This means, they only need to be verified under SC, and the verification results will carry over to the relaxed setting. Interestingly, robustness is a suitable correctness criterion not only for multithreaded programs, but also for parallel programs running on computer clusters. Parallel programs written in Partitioned Global Address Space (PGAS) programming model, when executed on cluster, consist of multiple processes, each running on its cluster node. These processes can directly access memories of each other over the network, without the need of explicit synchronization. Reorderings and delays introduced on the network level, just as the reorderings done by the CPUs, may result into unexpected behaviors that are hard to reproduce and fix. Our first contribution is a generic approach for solving robustness against relaxed memory models. The approach involves two steps: combinatorial analysis, followed by an algorithmic development. The aim of combinatorial analysis is to show that among program computations violating robustness there is always a computation in a certain normal form, where reorderings are applied in a restricted way. In the algorithmic development we work out a decision procedure for checking whether a program has violating normal-form computations. Our second contribution is an application of the generic approach to widely implemented memory models, including Total Store Order used in Intel x86 and Sun SPARC architectures, the memory model of Power architecture, and the PGAS memory model. We reduce robustness against TSO to SC state reachability for a modified input program. Robustness against Power and PGAS is reduced to language emptiness for a novel class of automata — multiheaded automata. The reductions lead to new decidability results. In particular, robustness is PSPACE-complete for all the considered memory models
    corecore