61 research outputs found
Increasing the performance of the Wetland DEM Ponding Model using multiple GPUs
Due to the lack of conventional drainage systems on the Canadian Prairies, when excess water runs off the landscape because of the snow-melt and heavy rainfall, the water may be trapped in surface depressions ranging in size from puddles to permanent wetlands and may cause local flooding. Hydrological processes play an important role in the Canadian Prairies regions, and using hydrological simulation models helps people understand past hydrological events and predict future ones. In order to obtain an accurate simulation, higher-resolution systems and larger simulation areas are introduced, and those lead to the need to solve larger-scale problems. However, the size of the problem is often limited by available computational resources, and solving large systems results in unacceptable simulation durations. Therefore, improving the computational efficiency and taking advantage of available computational resources is an urgent task for hydrological researchers and software developers. The Wetland DEM Ponding Model (WDPM) was developed to model the distribution of runoff water on the Canadian Prairies. It helps determine the fraction of Prairie basins contributing flows to stream while these change dynamically with water storage in the depressions. In the WDPM, the water redistribution module is the most computationally intensive part. Previously, the WDPM has been developed to run in parallel with one CPU or one GPU that makes the water redistribution module more efficient. Multi-device parallel computing is a common method to increase the available computation resources and could effectively speed up the application with an appropriate parallel algorithm.
This thesis develops a multiple-GPU parallel algorithm and investigates efficient data transmission methods compared to the CPU parallel and one-GPU parallel algorithm. A technique that overlaps communication with computation is applied to optimize the parallel computing process. Then the thesis evaluates the new implementation from several aspects. In the first step, the output summary and the output system are compared between the new implementation and the initial one. The solution shows significant convergence as the simulation processes, verifying the new implementation produces the correct result. In the second step, the multiple-GPU code is profiled, and it is verified that the algorithm can be re-organized to take advantage of multiple GPUs and carry out efficient data synchronization through optimized techniques. Finally, by means of numerical experiments, the new implementation shows performance improvement when using multiple GPUs and demonstrates good scaling. In particular, when working with a large system, the multiple-GPU implementation produces correct output and shows that there is around 2.35 times improvement in the performance compared using four GPUs with using one GPU
Evaluating the performance of legacy applications on emerging parallel architectures
The gap between a supercomputer's theoretical maximum (\peak")
oatingpoint
performance and that actually achieved by applications has grown wider
over time. Today, a typical scientific application achieves only 5{20% of any
given machine's peak processing capability, and this gap leaves room for significant
improvements in execution times.
This problem is most pronounced for modern \accelerator" architectures
{ collections of hundreds of simple, low-clocked cores capable of executing the
same instruction on dozens of pieces of data simultaneously. This is a significant
change from the low number of high-clocked cores found in traditional CPUs,
and effective utilisation of accelerators typically requires extensive code and
algorithmic changes. In many cases, the best way in which to map a parallel
workload to these new architectures is unclear.
The principle focus of the work presented in this thesis is the evaluation
of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel
MIC) for two benchmark codes { the LU benchmark from the NAS Parallel
Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex
parallel behaviours that are representative of many scientific applications. Using
combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we
demonstrate performance improvements of up to 7x for these workloads.
We also detail a code development methodology that permits application developers
to target multiple architecture types without maintaining completely
separate implementations for each platform. Using OpenCL, we develop performance
portable implementations of the LU and miniMD benchmarks that are
faster than the original codes, and at most 2x slower than versions highly-tuned
for particular hardware.
Finally, we demonstrate the importance of evaluating architectures at scale
(as opposed to on single nodes) through performance modelling techniques,
highlighting the problems associated with strong-scaling on emerging accelerator
architectures
Code generation for 3D partial differential equation models from a high-level functional intermediate language
Partial Differential Equation (PDE) modelling is an important tool in scientific domains for bridging
theory with reality; however, they can be complex to program and even more difficult to abstract. The
evolving parallel computing landscape is also making it increasingly difficult to write and maintain codes
(such as PDE models) which retain performance across different parallel platforms. Computational
scientists should be able to focus on their science instead of also having to become high performance
computing experts in order to take advantage of faster parallel hardware. Current methods targeting this
problem either concentrate on very niche applications, are too simplistic for real world problems or are
too low-level to be easily programmable. Domain Specific Languages (DSLs) are a popular approach,
but they have two opposing goals: improving programmability, while also providing high performance.
This thesis presents a solution for developing performance portable 3D PDE models, using room
acoustics simulations as a case study, by raising the abstraction level in the existing hardware-agnostic,
intermediary language LIFT. This functional language and compiler is designed for DSLs to compile into
and provides a separation of concerns for developing parallel applications. This separation enables DSL
writers to focus on developing high-level abstractions providing productivity to the user, while LIFT turns
the intermediary parallel representation these abstractions compile down to into hardware-optimised
code. A suite of composable, algorithmic primitives enables LIFT to reuse functionality across domains
and an exploratory search space provides a way to find the best optimisations for a given platform.
As this thesis shows, room acoustic simulations are expressible in LIFT with only a few small
changes to the framework. These expressions are able to achieve comparable or better performance
to original hand-written benchmarks. Furthermore, such expressions enable room acoustics models to
run across multiple platforms and easily swap in optimisations. Being able to test out what optimisations
give the best performance for a given platform — without rewriting or retuning — allows computational
scientists to focus on their own work.
Optimisations previously inaccessible in LIFT are developed that target 3D stencils generally, including 3D PDE models. In particular, 2.5D Tiling and compiler passes to inline private arrays and structs
are added to the LIFT ecosystem, giving high performance to various 3D stencil codes. The 2.5D Tiling
optimisation is coded functionally for the first time in LIFT and is selected automatically by additional
rewrite rules. These rewrite rules, such as the one for 2.5D Tiling, are explored in a search space to find
the best set of optimisations for an application on a given platform.
Building on previous work, LIFT is extended to enable complex boundary conditions and room
shapes for room acoustics models. This is the first intermediate representation in a high-level code generator to do so. Additionally, it is also the first high-level framework to support frequency-dependent
boundary handling for room acoustics simulations. Combined, these contributions show that high-level
abstractions for 3D PDE models are possible, enabling computational scientists to optimise and parallelise their codes more easily across different parallel platforms
Reconfigurable Antenna Systems: Platform implementation and low-power matters
Antennas are a necessary and often critical component of all wireless systems, of which they share the ever-increasing complexity and the challenges of present and emerging trends. 5G, massive low-orbit satellite architectures (e.g. OneWeb), industry 4.0, Internet of Things (IoT), satcom on-the-move, Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles, all call for highly flexible systems, and antenna reconfigurability is an enabling part of these advances. The terminal segment is particularly crucial in this sense, encompassing both very compact antennas or low-profile antennas, all with various adaptability/reconfigurability requirements. This thesis work has dealt with hardware implementation issues of Radio Frequency (RF) antenna reconfigurability, and in particular with low-power General Purpose Platforms (GPP); the work has encompassed Software Defined Radio (SDR) implementation, as well as embedded low-power platforms (in particular on STM32 Nucleo family of micro-controller). The hardware-software platform work has been complemented with design and fabrication of reconfigurable antennas in standard technology, and the resulting systems tested. The selected antenna technology was antenna array with continuously steerable beam, controlled by voltage-driven phase shifting circuits. Applications included notably Wireless Sensor Network (WSN) deployed in the Italian scientific mission in Antarctica, in a traffic-monitoring case study (EU H2020 project), and into an innovative Global Navigation Satellite Systems (GNSS) antenna concept (patent application submitted). The SDR implementation focused on a low-cost and low-power Software-defined radio open-source platform with IEEE 802.11 a/g/p wireless communication capability. In a second embodiment, the flexibility of the SDR paradigm has been traded off to avoid the power consumption associated to the relevant operating system. Application field of reconfigurable antenna is, however, not limited to a better management of the energy consumption. The analysis has also been extended to satellites positioning application. A novel beamforming method has presented demonstrating improvements in the quality of signals received from satellites. Regarding those who deal with positioning algorithms, this advancement help improving precision on the estimated position
- …