829 research outputs found

    Improved GPU near neighbours performance for multi-agent simulations

    Get PDF
    Complex systems simulations are well suited to the SIMT paradigm of GPUs, enabling millions of actors to be processed in fractions of a second. At the core of many such simulations, fixed radius near neighbours (FRRN) search provides the actors with spatial awareness of their neighbours. The FRNN search process is frequently the limiting factor of performance, due to the disproportionate level of scattered memory reads demanded by the query stage, leading to FRNN search runtimes exceeding that of simulation logic. In this paper, we propose and evaluate two novel optimisations (Strips and Proportional Bin Width) for improving the performance of uniform spatially partitioned FRNN searches and apply them in combination to demonstrate the impact on the performance of multi-agent simulations. The two approaches aim to reduce latency in search and reduce the amount of data considered (i.e. more efficient searching), respectively. When the two optimisations are combined, the peak obtained speedups observed in a benchmark model are 1.27x and 1.34x in two and three dimensional implementations, respectively. Due to additional non FRNN search computation, the peak speedup obtained when applied to complex system simulations within FLAMEGPU is 1.21x

    Working With Incremental Spatial Data During Parallel (GPU) Computation

    Get PDF
    Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin. The work within this thesis answers the question: What techniques can be used for improving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)? It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models. Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously. Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    A formula-driven scalable benchmark model for ABM, applied to FLAME GPU

    Get PDF
    Agent Based Modelling (ABM) systems have become a popular technique for describing complex and dynamic systems. ABM is the simulation of intelligent agents and how these agents communicate with each other within the model. The growing number of agent-based applications in the simulation and AI fields led to an increase in the number of studies that focused on evaluating modelling capabilities of these applications. Observing system performance and how applications behave during increases in population size is the main factor for benchmarking in most of these studies. System scalability is not the only issue that may affect the overall performance, but there are some issues that need to be dealt with to create a standard benchmark model that meets all ABM criteria. This paper presents a new benchmark model and benchmarks the performance characteristics of the FLAME GPU simulator as an example of a parallel framework for ABM. The aim of this model is to provide parameters to easily measure the following elements: system scalability, system homogeneity, and the ability to handle increases in the level of agent communications and model complexity. Results show that FLAME GPU demonstrates near linear scalability when increasing population size and when reducing homogeneity. The benchmark also shows a negative correlation between increasing the communication complexity between agents and execution time. The results create a baseline for improving the performance of FLAME GPU and allow the simulator to be contrasted with other multi-agent simulators

    Agent-based modelling and Swarm Intelligence in systems engineering

    Get PDF
    El objetivo de la tesis doctoral es evaluar la utilidad de las técnicas Modelado Basado en Agentes, algoritmos de optimización Swarm Intelligence y programación paralela sobre tarjeta gráfica en el campo de la Ingeniería de Sistemas y Automática. Se ha realizado un revisión bibliográfica y desarrollado un marco de desarrollo de la técnica de Modelado Basado en Agentes. Esta técnica se ha empleado para realizar un modelo de un reactor de fangos activados (que se engloba dentro del proceso de depuración de aguas residuales). Se ha desarrollado una notación complementaria para la descripción de modelos basados en agentes desde el punto de vista de la ingeniería de sistemas. Se ha presentado asimismo un algoritmo de optimización basado en agentes bajo la filosofía Swarm Intelligence. Se han trabajado con las técnicas de paralelización sobre tarjeta gráfica para reducir los tiempos de simulación de modelos y algoritmos. Se trata por lo tanto de un tesis de integración de varias tecnologías.Departamento de Ingeniería de Sistemas y Automátic

    Constraint-based simulation of virtual crowds

    Get PDF
    Central to simulating pedestrian crowds is their motion and behaviour. It is required to understand how pedestrians move to simulate and predict scenarios with crowds of people. Pedestrian behaviours enhance the range of motions people can demonstrate, resulting in greater variety, believability, and accuracy. Models with complex computations and motion have difficulty in being extended with additional behaviours. This is because the structure of these models are not designed in a way that is generally compatible with collision avoidance behaviours. To address this issue, this thesis will research a possible pedestrian model that can simulate collision response with a wide range of additional behaviours. The model will do so by using constraints, a limit on the velocity of a person's movement. The proposed model will use constraints as the core computation. By describing behaviours in terms of constraints, these behaviours can be combined with the proposed model. Pedestrian simulations strike a balance between model complexity and runtime speed. Some models focus entirely on the complexity and accuracy of people, while other models focus on creating believable yet lightweight and performant simulations. Believable crowds look realistic to human observation, but do not match up to numerical analysis under scrutiny. The larger the population, and the more complex the motion of people, the slower the simulation will run. One route for improving performance of software is by using Graphical Processing Units (GPUs). GPUs are devices with theoretical performance that far outperforms equivalent multi-core CPUs. Research literature tends to focus on either the accuracy, or the performance optimisations of pedestrian crowd simulations. This suggests that there is opportunity to create more accurate models that run relatively quickly. Real time is a useful measure of model runtime. A simulation that runs in real time can be interactive and respond live to user input. By increasing the performance of the model, larger and more complex models can be simulated. This in turn increases the range of applications the model can represent. This thesis will develop a performant pedestrian simulation that runs in real time. It will explore how suitable the model is for GPU acceleration, and what performance gains can be obtained by implementing the model on the GPU

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    Collective Gradient Sensing in Fish Schools

    Get PDF
    Throughout the animal kingdom, animals frequently benefit from living in groups. Models of collective behaviour show that simple local interactions are sufficient to generate group morphologies found in nature (swarms, flocks and mills). However, individuals also interact with the complex noisy environment in which they live. In this work, we experimentally investigate the group performance in navigating a noisy light gradient of two unrelated freshwater species: golden shiners (Notemigonuscrysoleucas) and rummy nose tetra (Hemigrammus bleheri). We find that tetras outperform shiners due to their innate individual ability to sense the environmental gradient. Using numerical simulations, we examine how group performance depends on the relative weight of social and environmental information. Our results highlight the importance of balancing of social and environmental information to promote optimal group morphologies and performance

    Real-time hybrid cutting with dynamic fluid visualization for virtual surgery

    Get PDF
    It is widely accepted that a reform in medical teaching must be made to meet today's high volume training requirements. Virtual simulation offers a potential method of providing such trainings and some current medical training simulations integrate haptic and visual feedback to enhance procedure learning. The purpose of this project is to explore the capability of Virtual Reality (VR) technology to develop a training simulator for surgical cutting and bleeding in a general surgery

    Parallel simulation methods for large-scale agent-based predator-prey systems : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand

    Get PDF
    The Animat is an agent-based artificial-life model that is suitable for gaining insight into the interactions of autonomous individuals in complex predator-prey systems and the emergent phenomena they may exhibit. Certain dynamics of the model may only be present in large systems, and a large number of agents may be required to compare with macroscopic models. Large systems can be infeasible to simulate on single-core machines due to processing time required. The model can be parallelised to improve the performance; however, reproducing the original model behaviour and retaining the performance gain is not straightforward. Parallel update strategies and data structures for multi-core CPU and graphical processing units (GPUs) are developed to simulate a typical predator-prey Animat model with improved perfor- mance while reproducing the behaviour of the original model. An analysis is presented of the model to identify dependencies and conditions the parallel update strategy must satisfy to retain original model behaviour. The parallel update strategy for multi-core CPUs is constructed using a spatial domain decomposition approach and supporting data structure. The GPU implementation is developed with a new update strategy that consists of an iterative conflict resolution method and priority number system to simultaneously update many agents with thousands of GPU cores. This update method is supported by a compressed sparse data structure developed to allow for efficient memory transactions. The performance of the Animat simulation is improved with parallelism and without a change in model behaviour. The simulation usability is considered, and an internal agent definition system using a CUDA device Lambda feature is developed to improve the ease of configuring agents without significant changes to the program and loss of performance
    corecore