64 research outputs found
Development and application of real-time and interactive software for complex system
Soft materials have attracted considerable interest in recent years for predicting the characteristics of phase separation and self-assembly in nanoscale structures. A popular method for demonstrating and simulating the dynamic behaviour of particles (e.g. particle tracking) and to consider effects of simulation parameters is cell dynamic simulation (CDS). This is a cellular computerisation technique that can be used to investigate different aspects of morphological topographies of soft material systems. The acquisition of quantitative data from particles is a critical requirement in order to obtain a better understanding and of characterising their dynamic behaviour. To achieve this objective particle tracking methods considering quantitative data and focusing on different properties and components of particles is essential. Despite the availability of various types of particle tracking used in experimental work, there is no method available to consider uniform computational data. In order to achieve accurate and efficient computational results for cell dynamic simulation method and particle tracking, two factors are essential: computing/calculating time-scale and simulation system size. Consequently, finding available computing algorithms and resources such as sequential algorithm for implementing a complex technique and achieving precise results is critical and rather expensive. Therefore, it is highly desirable to consider a parallel algorithm and programming model to solve time-consuming and massive computational processing issues. Hence, the gaps between the experimental and computational works and solving time consuming for expensive computational calculations need to be filled in order to investigate a uniform computational technique for particle tracking and significant enhancements in speed and execution times.
The work presented in this thesis details a new particle tracking method for integrating diblock copolymers in the form of spheres with a shear flow and a novel designed GPU-based parallel acceleration approach to cell dynamic simulation (CDS). In addition, the evaluation of parallel models and architectures (CPUs and GPUs) utilising the mixtures of application program interface, OpenMP and programming model, CUDA were developed. Finally, this study presents the performance enhancements achieved with GPU-CUDA of approximately ~2 times faster than multi-threading implementation and 13~14 times quicker than optimised sequential processing for the CDS computations/workloads respectively
Multilayered abstractions for partial differential equations
How do we build maintainable, robust, and performance-portable scientific
applications? This thesis argues that the answer to this software engineering
question in the context of the finite element method is through the use of
layers of Domain-Specific Languages (DSLs) to separate the various concerns in
the engineering of such codes.
Performance-portable software achieves high performance on multiple diverse
hardware platforms without source code changes. We demonstrate that finite
element solvers written in a low-level language are not performance-portable,
and therefore code must be specialised to the target architecture by a code
generation framework. A prototype compiler for finite element variational forms
that generates CUDA code is presented, and is used to explore how good
performance on many-core platforms in automatically-generated finite element
applications can be achieved. The differing code generation requirements for
multi- and many-core platforms motivates the design of an additional
abstraction, called PyOP2, that enables unstructured mesh applications to be
performance-portable.
We present a runtime code generation framework comprised of the Unified Form
Language (UFL), the FEniCS Form Compiler, and PyOP2. This toolchain separates
the succinct expression of a numerical method from the selection and
generation of efficient code for local assembly. This is further decoupled from
the selection of data formats and algorithms for efficient parallel
implementation on a specific target architecture.
We establish the successful separation of these concerns by demonstrating the
performance-portability of code generated from a single high-level source code
written in UFL across sequential C, CUDA, MPI and OpenMP targets. The
performance of the generated code exceeds the performance of comparable
alternative toolchains on multi-core architectures.Open Acces
Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos
La resolución de sistemas de ecuaciones lineales dispersos de gran dimensión es una de las operaciones más comunes en aplicaciones científicas y de ingeniería. El aumento de sus tamaños propicia el desarrollo de técnicas de Green Computing, que permiten diseñar aplicaciones conscientes de la energía, en las que la eficiencia energética es el objetivo prioritario.
En este Tesis Doctoral se ha diseñado una metodología basada en “técnicas de fusionado de kernels CUDA” que reduce el número de kernels, y con ello, costes de lanzamiento y transferencias de información. Su uso, junto con la sincronización de las GPUs en modo blocking, permite reducir el consumo energético en sistemas de cómputo heterogéneo, CPU-GPU. Estas técnicas tienen especial interés en GPUs que soporten paralelismo dinámico. La aplicación de esta metodología en la resolución de sistemas de ecuaciones lineales dispersos muestra mejoras destacables en eficiencia energética, obteniendo un compromiso entre rendimiento y consumo energético
A metadata-enhanced framework for high performance visual effects
This thesis is devoted to reducing the interactive latency of image processing computations in
visual effects. Film and television graphic artists depend upon low-latency feedback to receive
a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising
compiler which leverages high-level program metadata to guide key computational and
memory hierarchy optimisations. This metadata encodes static and dynamic information about
data dependence and patterns of memory access in the algorithms constituting a visual effect –
features that are typically difficult to extract through program analysis – and presents it to the
compiler in an explicit form. By using domain-specific information as a substitute for program
analysis, our compiler is able to target a set of complex source-level optimisations that a vendor
compiler does not attempt, before passing the optimised source to the vendor compiler for
lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of
space and schedule optimisation – based upon well-known compositions of the loop fusion and
array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised
visual effect. This adaptation sidesteps the costly solution of runtime code generation
by specialising static parameters in an offline process and exploiting dynamic metadata to
adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second
optimisation comprises a set of transformations to generate SIMD ISA-augmented source code.
Our approach differs from autovectorisation by using static metadata to identify parallelism, in
place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable
parameters for optimal aligned memory access. The third optimisation comprises a related set
of transformations to generate code for SIMT architectures, such as GPUs. Static dependence
metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads.
Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying
inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed
visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD
multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations
of these two effects. Programmability is enhanced by automating the generation of
SIMD and SIMT implementations from a single programmer-managed scalar representation
Master of Science
thesisTensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC
BRUISE DETECTION IN APPLES USING 3D INFRARED IMAGING AND MACHINE LEARNING TECHNOLOGIES
Bruise detection plays an important role in fruit grading. A bruise detection system capable of finding and removing damaged products on the production lines will distinctly improve the quality of fruits for sale, and consequently improve the fruit economy. This dissertation presents a novel automatic detection system based on surface information obtained from 3D near-infrared imaging technique for bruised apple identification. The proposed 3D bruise detection system is expected to provide better performance in bruise detection than the existing 2D systems.
We first propose a mesh denoising filter to reduce noise effect while preserving the geometric features of the meshes. Compared with several existing mesh denoising filters, the proposed filter achieves better performance in reducing noise effect as well as preserving bruised regions in 3D meshes of bruised apples. Next, we investigate two different machine learning techniques for the identification of bruised apples. The first technique is to extract hand-crafted feature from 3D meshes, and train a predictive classifier based on hand-crafted features. It is shown that the predictive model trained on the proposed hand-crafted features outperforms the same models trained on several other local shape descriptors. The second technique is to apply deep learning to learn the feature representation automatically from the mesh data, and then use the deep learning model or a new predictive model for the classification. The optimized deep learning model achieves very high classification accuracy, and it outperforms the performance of the detection system based on the proposed hand-crafted features. At last, we investigate GPU techniques for accelerating the proposed apple bruise detection system. Specifically, the dissertation proposes a GPU framework, implemented in CUDA, for the acceleration of the algorithm that extracts vertex-based local binary patterns. Experimental results show that the proposed GPU program speeds up the process of extracting local binary patterns by 5 times compared to a single-core CPU program
AIMES: advanced computation and I/O methods for earth-system simulations
Dealing with extreme scale Earth-system models is challenging from the computer science perspective, as the required computing power and storage capacity are steadily increasing.
Scientists perform runs with growing resolution or aggregate results from many similar smaller-scale runs with slightly different initial conditions (the so-called ensemble runs).
In the fifth Coupled Model Intercomparison Project (CMIP5), the produced datasets require more than three Petabytes of storage and the compute and storage requirements are increasing significantly for CMIP6.
Climate scientists across the globe are developing next-generation models based on improved numerical formulation leading to grids that are discretized in alternative forms such as an icosahedral (geodesic) grid.
The developers of these models face similar problems in scaling, maintaining and optimizing code.
Performance portability and the maintainability of code are key concerns of scientists as, compared to industry projects, model code is continuously revised and extended to incorporate further levels of detail.
This leads to a rapidly growing code base that is rarely refactored.
However, code modernization is important to maintain productivity of the scientist working
with the code and for utilizing performance provided by modern and future architectures.
The need for performance optimization is motivated by the evolution of the parallel architecture landscape from
homogeneous flat machines to heterogeneous combinations of processors with deep memory hierarchy.
Notably, the rise of many-core, throughput-oriented accelerators, such as GPUs, requires non-trivial code changes at minimum and, even worse, may necessitate a substantial rewrite of the existing codebase.
At the same time, the code complexity increases the difficulty for computer scientists and vendors to understand and optimize the code for a given system.
Storing the products of climate predictions requires a large storage and archival system which is expensive.
Often, scientists restrict the number of scientific variables and write interval to keep the costs
balanced.
Compression algorithms can reduce the costs significantly but can also increase the scientific yield of simulation runs.
In the AIMES project, we addressed the key issues of programmability, computational efficiency and I/O limitations that are common in next-generation icosahedral earth-system models.
The project focused on the separation of concerns between domain scientist, computational scientists, and computer scientists
- …