63 research outputs found
Contributions to the development of an integrated toolbox of solvers in Derivative-Free Optimization
This dissertation is framed on the ongoing research project BoostDFO - Improving
the performance and moving to newer dimensions in Derivative-Free Optimization. The final
goal of this project is to develop efficient and robust algorithms for Global and/or
Multiobjective Derivative-free Optimization. This type of optimization is typically required
in complex scientific/industrial applications, where the function evaluation is
time-consuming and derivatives are not available for use, neither can be numerically
approximated. Often problems present several conflicting objectives or users aspire to
obtain global solutions.
Inspired by successful approaches used in single objective local Derivative-free Optimization,
we intend to address the inherent problem of the huge execution times by
resorting to parallel/cloud computing and carrying a detailed performance analysis. As
result, an integrated toolbox for solving single/multi objective, local/global Derivativefree
Optimization problems is made available, with recommendations for taking advantage
of parallelization and cloud computing, providing easy access to several efficient and
robust algorithms and allowing to tackle harder Derivative-free Optimization problems.Esta dissertação insere-se no projecto científico BoostDFO - Improving the performance
and moving to newer dimensions in Derivative-Free Optimization. O objectivo final desta
investigação é desenvolver algoritmos robustos e eficientes para problemas de Optimização
Sem Derivadas Globais e/ou Multiobjectivo. Este tipo de optimização é tipicamente
requerido em aplicações científicas/industriais complexas, onde a avaliação da função é
bastante demorada e as derivadas não se encontram disponíveis, nem podem ser aproximadas
numericamente. Os problemas apresentam frequentemente vários objectivos
divergentes ou os utilizadores procuram obter soluções globais.
Tendo por base abordagens prévias bem-sucedidas utilizadas em Optimização Sem
Derivadas local e uniobjectivo, pretende-se abordar o problema inerente aos grandes tempos
de execução, recorrendo ao paralelismo/computação em cloud e efectuando uma
detalhada análise de desempenho. Como resultado, é disponibilizada uma ferramenta
integrada destinada a problemas de Optimização Sem Derivadas uni/multiobjectivo, com
optimização local/global, incluindo recomendações que permitam tirar partido do paralelismo
e computação em cloud, facilitando o acesso a vários algoritmos robustos e eficientes
e permitindo abordar problemas mais difíceis nesta classe
Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems
Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ).
This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)
Programming Abstractions for Data Locality
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal
Flexible and Efficient Control of Data Transfers for Loosely Coupled Components
Allowing loose coupling between the components of complex applications has many
advantages, such as flexibility in the components that can participate and
making it easier to model multiscale physical phenomena.
To support coupling of parallel and sequential application components,
I have designed and implemented a loosely coupled framework which has the
following characteristics: (1) connections between participating components
are separately identified from the individual components, (2) all data
transfers between data exporting and importing components are determined by
a runtime-based low overhead method (approximate match), (3) two
runtime-based optimization approaches, collective buffering and inverse-match
cache, are applied to speed up the applications in many common coupling modes,
and (4) a multi-threaded multi-process control protocol that can be
systematically constructed by the composition of sub-tasks protocols.
The proposed framework has been applied to two real world applications,
and the deployment approach and runtime performance are also studied.
Currently the framework runs on x86 Linux clusters, and porting strategies
for multicore x86 processors and advanced high performance computer architectures
are also explored
GPRM: a high performance programming framework for manycore processors
Processors with large numbers of cores are becoming commonplace. In order to utilise the
available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining
parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread
and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge.
In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads.
We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations.
We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi.
The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without
sacrificing productivity
Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation
The increasing complexity of modern hardware requires sophisticated programming
techniques for programs to run efficiently. At the same time, increased power of
modern hardware enables more advanced analyses to be included in compilers. This
thesis focuses on one particular optimisation technique that improves utilisation
of vector units. The foundation of this technique is the ability to chose memory
mappings for data structures of a given program.
Usually programming languages use a fixed layout for logical data structures
in physical memory. Such a static mapping often has a negative effect on usability
of vector units. In this thesis we consider a compiler for a programming language
that allows every data structure in a program to have its own data layout. We
make sure that data layouts across the program are sound, and most importantly
we solve a problem of automatic data layout reconstruction. To consistently do this,
we formulate this as a type inference problem, where type encodes a data layout
for a given structure as well as implied program transformations. We prove that
type-implied transformations preserve semantics of the original programs and we
demonstrate significant performance improvements when targeting SIMD-capable
architectures
Efficient Precise Dynamic Data Race Detection For Cpu And Gpu
Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of
data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs
Application of general semi-infinite Programming to Lapidary Cutting Problems
We consider a volume maximization problem arising in gemstone cutting industry. The problem is formulated as a general semi-infinite program (GSIP) and solved using an interiorpoint method developed by Stein. It is shown, that the convexity assumption needed for the convergence of the algorithm can be satisfied by appropriate modelling. Clustering techniques are used to reduce the number of container constraints, which is necessary to make the subproblems practically tractable. An iterative process consisting of GSIP optimization and adaptive refinement steps is then employed to obtain an optimal solution which is also feasible for the original problem. Some numerical results based on realworld data are also presented
Robustness against Relaxed Memory Models
Sequential Consistency (SC) is the memory model traditionally applied by programmers and verification tools for the analysis of multithreaded programs.
SC guarantees that instructions of each thread are executed atomically and in program order.
Modern CPUs implement memory models that relax the SC guarantees: threads can execute instructions out of order, stores to the memory can be observed by different threads in different order.
As a result of these relaxations, multithreaded programs can show unexpected, potentially undesired behaviors, when run on real hardware.
The robustness problem asks if a program has the same behaviors under SC and under a relaxed memory model.
Behaviors are formalized in terms of happens-before relations — dataflow and control-flow relations between executed instructions.
Programs that are robust against a memory model produce the same results under this memory model and under SC.
This means, they only need to be verified under SC, and the verification results will carry over to the relaxed setting.
Interestingly, robustness is a suitable correctness criterion not only for multithreaded programs, but also for parallel programs running on computer clusters.
Parallel programs written in Partitioned Global Address Space (PGAS) programming model, when executed on cluster, consist of multiple processes, each running on its cluster node.
These processes can directly access memories of each other over the network, without the need of explicit synchronization.
Reorderings and delays introduced on the network level, just as the reorderings done by the CPUs, may result into unexpected behaviors that are hard to reproduce and fix.
Our first contribution is a generic approach for solving robustness against relaxed memory models.
The approach involves two steps: combinatorial analysis, followed by an algorithmic development.
The aim of combinatorial analysis is to show that among program computations violating robustness there is always a computation in a certain normal form, where reorderings are applied in a restricted way.
In the algorithmic development we work out a decision procedure for checking whether a program has violating normal-form computations.
Our second contribution is an application of the generic approach to widely implemented memory models, including Total Store Order used in Intel x86 and Sun SPARC architectures, the memory model of Power architecture, and the PGAS memory model.
We reduce robustness against TSO to SC state reachability for a modified input program.
Robustness against Power and PGAS is reduced to language emptiness for a novel class of automata — multiheaded automata.
The reductions lead to new decidability results.
In particular, robustness is PSPACE-complete for all the considered memory models
- …