Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.
such as FLAME GPU [35] and MCMAS [103] have been proposed that focus on the execution on specific accelerators such as graphics cards.
In this survey, we structure the complex landscape of agent-based simulation on heterogeneous hardware. We give an overview of existing types of hardware that have been employed to accelerate agent-based simulations and discuss past developments and current trends. While some surveys exist that present generic high-performance computing techniques using heterogeneous hardware [50, 120, 176] , we highlight the specific challenges of ABS on heterogeneous hardware and categorise an ample body of related work along these challenges. For each challenge, we discuss in detail how existing literature has contributed to solving them. This overview allows us to identify research gaps that need to be filled to establish heterogeneous accelerators in the simulation domain and make them applicable to a wider range of problems-ideally by providing an automated process to support the modeller.
The remainder of this survey is structured as follows: In Section 2, we characterise the main classes of hardware accelerators for general-purpose computations. Section 3 provides an overview of agent-based simulation concepts and outlines the computational challenges of executing agentbased simulations on hardware accelerators. In Section 4, we systematise and survey the existing works according to the identified challenges and according to the techniques used to do so. In Section 5, we discuss unresolved challenges and outline how a system tackling these challenges could look like, thus sketching avenues for future work. Section 6 summarises our findings and concludes the survey.
HARDWARE PLATFORMS
In this section, we describe the technical characteristics of hardware platforms that have been used to accelerate agent-based simulations, covering many-core CPUs, Graphics Processing Unit (GPUs), Accelerated Processing Units (APUs), Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), and System on Chips (SoCs). Readers familiar with these platforms may skip this section and continue to Section 3.
Many-Core CPUs
Architecture: A many-core (or many integrated core, MIC) CPU contains a group of CPU cores on a single chip. One of the well-known many-core CPUs, the Intel Xeon Phi, is equipped with up to 72 x86-compatible CPU cores communicating via an internal Network-on-Chip that enables fast and parallel data transfer between the cores. A many-core CPU can be connected to the host machine via PCI-E or can be a standalone CPU with direct access to the system memory.
In the past years, a number of non-x86 many-core CPUs have emerged, such as the Parallella Board [4] , the Epiphany-V [135] , and the Kalray MPPA (Massively Parallel Processor Array) [43] .
Benefits: A notable advantage of some many-core CPUs over GPUs and FPGAs is their capability to execute largely unmodified parallelised code written for regular CPUs [101] . Since the individual cores support out-of-order execution, employ deep instruction pipelines, and have access to comparatively large caches, the need to adapt a program's control flow to the hardware is less pressing than with, e.g., GPUs [24] . Further parallelism may be extracted using a singleinstruction, multiple-data (SIMD) style of programming using instruction set extensions such as AVX-512 [86] .
Recent work showed that many-core CPUs can substantially accelerate Discrete Event Simulation (DES) [89, 184] . A number of authors also evaluated the acceleration of various types of simulations such as fluid dynamics and seismic wave propagation using non-x86 many-cores [29, 148] .
Limitations: In light of the comparatively high cost of recent many-core CPUs (≈ US$3368.00 as of 03/2018 for an Intel Xeon Phi Processor 7290F) compared to other accelerators, the performance gains compared to traditional multi-core CPUs have frequently been relatively low [154] ): In the Sense stage, an agent detects and analyses its neighbours as well as the environment in which it resides. In the Think stage, an agent makes judgement based on the information collected during the Sense stage. The update of states takes place in the Act stage. The simulation time is typically advanced in fixed time steps at which all agents update their states. However, if a model requires agents to update their states at variable points in simulation time, time advancement using a discrete-event simulation (DES) approach may be more appropriate. In DES, state updates are performed through events scheduled for execution at discrete points in simulation time. The simulation proceeds by iteratively executing the earliest remaining event, potentially scheduling new events in the process.
Independent of the time advancement mechanism, a defining characteristic of ABS distinguishing it from other simulation techniques is the autonomy of agents, i.e., "agents are endowed with behaviours that allow them to make independent decisions" [115] . Since the focus of this survey paper is on ABS, we exclude simulation domains such as physics and chemistry, which usually consider sets of entities that are passively affected by their environment. However, we do discuss a number of methods proposed outside of the ABS domain with direct applications to ABS, e.g., GPU-based priority queues in the context of DES.
Computational Aspects
The literature on executing ABS using heterogeneous hardware can be organised according to the challenges addressed by the individual works. In our literature review, we identified five such challenges, which are consequences of features shared by most ABS models.
Crooks and Heppenstall summarise the literature on definitions of the term "agent" using the core features of autonomy (independent decision-making), heterogeneity (opportunity for different attributes per agent), and being active (independent influence on the simulation) [40] . The feature of "being active" is further described using a number of sub-features including mobility (ability to move in the simulation space) and interactivity (ability to communicate extensively).
From this definition, we identify a number of features with direct implications on the execution of agent-based simulations on parallel hardware as follows:
• Autonomy, Heterogeneity, and Mobility: The independent decision-making per agent provides opportunities for parallelisation across the computations involved in the individual agents' Sense-Think-Act cycle. However, since agents move and act according to their individual attributes as well as their internal states and current environment, the actions of an agent population are commonly diverse both in time and space. For instance, at certain points in simulated time there may be areas of the simulation space with little activity and thus low computational demands, while areas populated with highly active agents may require intense computations. Thus, the heterogeneity and mobility of agents make it challenging to balance the workload among the available processing elements. Further difficulties are given by the diversity in the type of action performed, since some hardware accelerators achieve highest performance when executing large numbers of identical computations in parallel. • Interactivity: The communication among agents is typically reflected by memory accesses when the communicating agents are simulated on the same device, or by data transfers over a bus like PCI-E when crossing device boundaries. Since the agents' communication patterns are usually not fully predictable, exploiting regularities in the communication to increase performance, e.g., by storing messages in low-latency memory shared by two communicating agents, is non-trivial. Further, given autonomous and mobile agents, even under a highly suitable and dynamic hardware assignment there will typically still be a need for substantial amounts of data transfers among processing elements.
When considering the modelling and simulation life cycle, another important property is given by the frequent changes to the model code in the iterative process commonly applied when developing simulation models. As a property of the agent-based modelling and simulation life cycle, it has been stated that modellers iteratively extend and refine models to "balance the expressive power of more details with the cost of additional complications" and that in fact "iterative model development and use is the ABMS paradigm" [128] . Given the resulting frequent changes to the models and the difficulty predicting the runtime behaviour resulting from the features above, there is insufficient information about the data dependencies, instruction mix or control flow to develop an overall simulation program optimised for a particular model. This is in contrast to well-defined tasks such as linear algebra operations, which allow for the development of highly optimised libraries targeting specific hardware platforms. Instead, there have been attempts to provide efficient facilities for tasks common to many agent-based simulations, allowing developers to specify models without the need to consider low-level specifics of the given hardware platform.
We summarise the above discussion by identifying five challenges for agent-based simulations on heterogeneous hardware platforms along which the remainder of the survey will be organised. While the challenges are shared by other types of workloads, the features discussed above make the challenges particularly common and pressing in agent-based simulations.
(1) Hardware assignment: A well-known challenge in parallel and distributed simulation lies in partitioning the simulation workload among the processing elements. Generally, there are two dimensions according to which a simulation can be partitioned [122] : Domain decomposition partitions according to the simulation space (e.g., different roads in a traffic simulation), while functional decomposition partitions according to different models (e.g., different layers of the network stack in a computer network simulation). In ABS on heterogeneous hardware, the hardware assignment is further complicated by the shifting workload due to the agents' autonomy and mobility, and by the heterogeneity of the hardware platform, in which devices may differ in their suitability for certain types of computations. Existing techniques to approach this challenge either attempt to determine static boundaries within the simulation that allow for an efficient partitioning without further adaptation, or dynamic hardware assignments that are updated at simulation runtime. (2) Data transfer overhead: As a result of the agents' mobility and interactivity, even under an efficient static or dynamic partitioning, frequent data transfers are usually necessary among the hardware devices to migrate agents or to reflect inter-agent communications. Techniques have been proposed to exploit the limited agent velocity and interaction range to reduce the impact of the data transfers on the simulation performance. (3) Scattered memory accesses: The dynamic and largely unpredictable agent movement and interaction translates to memory access patterns that are in conflict with the regular, i.e., linear, accesses preferred by common accelerators. The literature proposes representations of irregular structures using regular memory layouts and caching heuristics to improve the efficiency when accessing the simulation data. (4) Maximisation of parallelism: A defining characteristic of simulations is the notion of simulated time, which puts constraints on the simulation progress during parallel execution. Although there may be a substantial amount of computation scheduled, processing elements may frequently be blocked waiting to maintain synchronisation with the other processing elements. The field of parallel and distributed simulation proposes a wide range of algorithms to extract as much parallelism as possible while maintaining [1, 47, 74, 76, 95, 107, 145, 151-153, 169, 199] [41] Ecology [104, 192] [178] Social [101] [92, 93, 106, 108, 177, 185, 196] [59] Physics and Chemistry [17, 69, 98, 118, 143, 159, 192] [121] Networks [7, 8, 22, 100, 139, 141, 157, 169 ] Domain-independent Simulation Framework [103] [ 39, 87, 88, 103, 105, 151, 153, 190] the synchronisation of simulated time [53] . In the past years, a number of works have re-evaluated and adapted these approaches targeting the execution of ABS on accelerators. (5) Abstraction from hardware specifics: Given the frequent adaptations and extensions commonly made during the development of an ABS model and during the verification and validation process, it is necessary to provide frameworks to simplify the modellers' implementation work while still maintaining high performance. In the past years, a number of frameworks that abstract from the hardware details as well as libraries for unified access to the memory of different devices have been presented in the literature.
ADDRESSING THE CHALLENGES OF AGENT-BASED SIMULATION ON
ACCELERATORS Agent-based simulation on hardware accelerators started to receive attention from the research community from the early 2000s onwards. The vast majority of these works focused on GPUs (see Figure 1 for an overview of the number of publications since 2002), mainly because they are comparatively inexpensive and because, in recent years, the ease of programming of GPUs is slowly approaching that of CPUs. Furthermore, well-established programming frameworks such as OpenCL enable the formulation of models in a less hardware-specific manner. For publications that considered specific simulation models, Table 1 shows the simulation domains and hardware platforms, providing researchers with pointers to relevant works in their respective domain.
Our survey of the literature is organised along the key challenges we identified in Section 3, that is, hardware assignment, data transfer overheads, scattered memory accesses, maximisation of parallelism, and abstraction from hardware specifics. In the following, we discuss the techniques from 
Hardware assignment
Static assignment by type of computation Many-Core [101] , GPU [8, 15, 22, 75, 80, 118, 142, 163, 189] [74, 76, 193] , APU [180] , FPGA [41, 59, 175, 178] Dynamic assignment based on runtime measurements GPU [19, 63, 66, 96, 182, 193] , FPGA [19] Data transfer overheads
Overlapping of communication and computation GPU [16, 17, 100] Computation replication at partition boundaries GPU [1, 199] Scattered memory accesses
Manual caching in shared memory GPU [107, 151, 199] Heuristics for agent update order GPU [9, 68, 92, 93] Representation of irregular data structures by arrays and grids APU [180] , GPU [47, 69, 98, 114, 144-146, 152, 166, 177] [7, 14, 95, 109, 140, 141, 159, 168, 183, 196] , FPGA [121, 149] 
Maximisation of parallelism
Multiple replications in parallel GPU [100, 104, 108, 142, 160, 192] Window-based event execution GPU [7, 26, 139, 141, 143, 157, 169, 196] Speculative execution GPU [106, 109] , FPGA [121] Computation sorting GPU [95, 100, 169] Abstraction from hardware specifics Frameworks to support simulation development Many-Core [103] , GPU [39, 79, 103, 114, 151, 153] Unified memory access GPU [87, 88, 105, 190] the literature applicable to these key challenges in agent-based simulation. Table 2 summarises the systematisation of knowledge presented in this survey. It contains our classification of challenges, techniques, publications, and types of accelerators.
Hardware Assignment
One of the main challenges in parallel and distributed computations in heterogeneous hardware environments lies in finding a suitable partitioning, i.e., assignment of a given problem to the available hardware [56] . We discuss techniques that have been used to address this problem according to two different, yet interrelated, aspects: First, we consider techniques to select suitable hardware for sub-tasks according to their ability to efficiently execute certain types of computations. The minimisation of data transfers among the partitions running on separate devices will be considered in the next subsection. The existing approaches can be roughly categorised as follows:
1. Static assignment: If the simulation model involves different types of computations that clearly suggest a certain hardware mapping, then it may be sufficient to partition the model prior to a simulation run without any adaptation during runtime. For instance, model segments involving large numbers of independent floating point operations may be well-suited for execution on a GPU, whereas segments with highly data-dependent control flow suggest the execution on a CPU. 2. Dynamic assignment: Frequently, the dynamic behaviour of a simulated system at runtime translates to unpredictable computational patterns. In such cases, maintaining high performance may require an adaptation of the hardware mapping based on performance measurements at runtime. An inherent challenge of dynamic assignment is the trade-off between the performance increase through an improved assignment and the costs of runtime measurements and re-assignment.
An ample body of research has considered the parallelisation of general programs onto heterogeneous platforms, which is an enormous challenge due to the arbitrary control flows and memory access patterns that can be present in general programs. Thus, typically, the approaches limit themselves to program portions that are particularly amenable to parallelisation on accelerators. In the case of ABS, constraints such as the separation of data into a per-agent state and the limited sensing range of agents somewhat simplify the problem of parallelisation, potentially enabling a higher degree of automation in the hardware mapping. In Section 5, we outline the vision of an automated approach and the required building blocks towards an automated hardware mapping for heterogeneous ABS.
Static Assignment.
The simplest hardware assignment is to execute the entire simulation on a single device. This approach is common in the existing work on FPGA-based ABS. For instance, Vourkas and Sirakoulis [178] , who implement an environmental model simulation based on cellular automata (CA). The authors note the structural similarity between a two-dimensional cellular automaton and an FPGA and assign one cell to each Configurable Logic Block (CLB). If the number of cells exceeds the number of CLBs, then the simulation lattice is partitioned into several layers, which are processed one after the other. Similarly, Cui et al. [41] and Georgoudas et al. [59] show high performance when assigning ABS models operating on cellular grids to a single FPGA. A number of works on GPU-based ABS take the same approach of assigning the entire simulation to the accelerator [1, 79, 80, [144] [145] [146] . In these works, the computations associated with the Sense-Think-Act cycle of the individual agents are often assigned to one GPU thread each.
Often, the available hardware devices lend themselves to specific types of computations, or the memory consumption of the simulation exceeds the capacities of an individual device. Then, it is necessary to find a partitioning for the ABS, which may follow either a domain decomposition or a functional decomposition [122] .
Domain decomposition: When using domain decomposition, the simulation space is partitioned and each partition is assigned to a separate processing element. An example is given by the work by Lai et al. [101] , who implement Game of Life [37] and a simulation of urban sprawl processes using cellular automata [186] . The authors compare the performance achieved when using one CPU per execution node, one GPU per node and 60 cores per CPU-based many-core accelerator, using MPI for inter-node communication in each instance. The authors conclude that the use of accelerators provides a performance benefit over the purely CPU-based execution. Given a sufficiently large number of assigned processors, using the CPU-based many-core accelerator with fully device-based simulation achieves similar performance as the GPU-based acceleration.
Generally, a domain decomposition of an ABS targeting hardware accelerators is suited to assign different parts of the simulation to devices of the same type. Since different types of hardware device typically favour certain types of computations, when assigning computations to different types of hardware, functional decomposition is commonly applied instead. It is thus natural that most of the literature on hardware assignment of ABS to heterogeneous hardware has focused on functional decomposition.
Functional decomposition: An efficient functional decomposition identifies static functional boundaries in the considered simulation to maximise the computational performance on each device, while minimising data transfers. Pavlov and Müller [142] discuss approaches for the hardware assignment of ABS and conclude that an approach in which the CPU and the GPU hold duplicated or partial agent and environment data is the most promising. An overall development process for a GPU-accelerated ABS starting from a CPU-based implementation is proposed in [75, 118] . After the decomposition of the simulation into small task modules, modules suitable for execution on a GPU such as loops are identified heuristically and manually replaced with GPU-executable counterparts. Several case studies [74, 76, 118] show substantial speedup when employing this method. Fig. 2 . Four CPU-device simulation schemes [8] . Devices can be GPUs or many-core CPUs.
Some works have focused on isolating simulation code that heavily relies on floating point arithmetic to be executed on a GPU. Bauer et al. [15] assign the discrete part of a combined continuous-discrete simulation to the CPU and the continuous part, which relies on floating point arithmetic, to the GPU. The authors conclude that while keeping the GPU fully utilised poses a challenge, models with large numbers of floating point operations can benefit from GPU acceleration. Andelfinger et al. [8] compare different GPU/CPU simulator architectures aiming to offload events associated with floating point operations to a GPU. In a basic CPU/GPU hybrid scheme (cf. Figure 2 (a)), the CPU offloads each event to the GPU individually. The required data transfers can be reduced by aggregating independent events to execute them in a single step (cf. Figure 2 (b)) and by leaving computation results required by subsequent events in graphics memory (cf. Figure 2(c) ). Finally, if the entire simulation is ported to the GPU, data transfers are only required at the start of the simulation and once the simulation terminates (cf. Figure 2(d) ). While the simulation performance increases with each of the above optimisations, the developer is burdened with some additional complexity.
Since floating point arithmetic is a natural fit to the GPU's capabilities, models heavily relying on such operations are likely to benefit from a functional decomposition focusing on this aspect. Examples of such models include kinematic equations as in microscopic traffic simulation or crowd simulation, as well as models of wireless communication and of biological or chemical processes.
In another approach to functional decomposition, the agent behaviours are left to the CPU while environment dynamics are handled by the GPU [118] . With this approach, the author aims to reduce the impact of the GPU acceleration on the maintainability of the simulator code. To increase performance, portions of the agent behaviour that do not depend on the agent state, e.g., perception of the environment, are carried out on the GPU independently of individual agents for all locations.
In contrast to the above works, a number of approaches partition the simulation along functional boundaries specific to the given simulation models. For instance, when the underlying simulation can be clearly separated into model computation and management tasks, a master-worker scheduling approach as shown by Bilel et al. [22] in the context of large-scale mobile networks simulation can be applied. In the proposed design, the model is executed on the GPU, while the CPU orchestrates the event scheduling, simulation status monitoring, and memory allocation.
Finally, the nature of traffic simulation allows for a relatively straight-forward functional decomposition according to different simulation aspects. Xu et al. [189] and Song et al. [163] assign the agent mobility in a mesoscopic traffic simulation to the GPU, whereas the route calculation, agent generation, and file reading and writing remain on the CPU. The two parts run asynchronously to hide data transfer latencies. In the traffic simulation on an APU presented by Wang et al. [180] , sorting of agent states is required to locate each agent's neighbours. To reduce synchronisation overheads on the GPU portion of the APU, the GPU portion only performs state updates and local sorting, whereas the sorting across GPU blocks is handled by the CPU resources. The work separation can be carried out efficiently using zero-copy memory accesses. Considering FPGAs, Tripp et al. [175] showed how the movement of agents on individual lanes can be computed on an FPGA, while the agents' transitions from one road to another as well as the behaviour at intersections are computed on the CPU.
In summary, most approaches relying on static hardware assignment split the simulation workload into coarse-grained functional tasks so that some tasks are clearly suited for a certain hardware device. To minimise trial-and-error, heuristics may be applied to identify a suitable mapping of tasks to the hardware. Tasks involving large numbers of parallel floating point operations are among the most common portions of simulations offloaded to accelerators.
Dynamic Assignment.
While a wide range of literature has considered the problem of dynamically adapting a partitioning of agent-based simulations to multiple CPUs (e.g., References [38, 111, 188] ), we are not aware of such works that specifically target heterogeneous hardware environments. In the following, we outline recent works on dynamic assignment of general computational workloads to heterogeneous hardware. Since these works are generic, they cannot rely on knowledge of the general structure of ABS simulators or on model knowledge. Still, the proposed methods to determine suitable hardware platforms for given segments of code can be applied to ABS as well.
Belviranli et al. [19] propose a self-scheduling scheme for partitioning generic application workloads into blocks and assigning them to CPUs, GPUs, and FPGAs. The proposed system consists of two phases: In the first phase, the system performs an online training with a small amount of data to estimate the maximum workload capacity size of each hardware device. Fast convergence is achieved by fitting four sampled data points to a logarithmic function. Once the capacity is determined, the processing unit's performance can be inferred from the same data. When the change of processing speed between two samples drops below a threshold, it is used as the final estimated value. In the second phase, the remaining workload is partitioned based on the relative processing speeds of the available processing units, assigning a large portion of the workload to faster processing units.
Some authors use machine-learning techniques such as support vector machines, artificial neural networks, and decision trees to distribute the workload of OpenCL programs to CPUs and GPUs. For example, Grasso et al. [63, 96] and Zhang et al. [193] translate a single-device OpenCL program to a multiple-device program, while Wen et al. [182] focus on scheduling multiple OpenCL functions to run in parallel on CPU/GPU. They train a machine-learning algorithm according to a set of typical OpenCL programs and benchmarks. The prediction generated by the machinelearning algorithm guides the assignment of a portion of the computation to CPU or GPU. Their results show that the above three machine-learning approaches outperform purely CPU-or GPUbased approaches. The scheduling scheme by Wen et al. achieves a performance improvement compared to a first-come, first-served scheme and a scheme where computation-heavy tasks are handled by the GPU.
To automate the compilation of sequential programs for parallelised execution on heterogeneous hardware, Grosser and Groesslinger [66] present a compiler that generates CPU and GPU code. Regions with mostly static control flow and sufficient computational intensity are detected and transformed to a formal representation to facilitate program transformations [65] . After optimisations have been performed to increase memory access locality and parallelism, CUDA code for GPUs is generated from the formal representation. A runtime library eliminates repeated memory allocations and unnecessary data transfers between CPU and GPU. The decision whether a region is compute-intensive enough for execution on the GPU is made statically or at runtime using heuristics based on metrics such as the number of instructions. The authors conclude that the compiler is able to translate CPU code into cross-platform code with no performance penalty. For some computations, such as the correlation benchmark from polybench [147] , significant speedup of up to two orders of magnitude can be achieved.
The main difficulty in automated hardware mapping lies in determining the control flow and data dependencies of the original program. Current approaches either rely on the program code being formulated in languages such as OpenCL that express independent control flows explicitly, or only consider specific portions of programs such as loops with largely static control flow. In ABS, however, most of the available parallelism may exist across the update routines of separate agents. Thus, without semantic information describing the code structure, automatic detection of the parallelism is challenging. In Section 5, we sketch how the common structure of many ABS may be utilised to support the extraction of parallelism.
Minimisation of Data Transfer Overheads
Since most hardware accelerators are equipped with their own memory, simulations making use of accelerators typically require data transfers between host and accelerator memory. Even with a high-quality domain decomposition or functional decomposition of the simulation, agent mobility and communication or the data dependencies between the different functional aspects make data transfers unavoidable. In this section, we survey works that focus on minimising the cost of such data transfers. The existing approaches can be categorised according to the following techniques:
1. Overlapping of communication and computation: Some authors proposed techniques to hide communication overheads by transferring data while independent computations are performed. Sometimes, the technique has been referred to as latency hiding (e.g., [25] ).
Computation replication at partition boundaries:
Another technique to address communication overheads is to increase the amount of computation performed before synchronisation among processing elements is required. This is achieved by duplicating some computations on multiple processing elements, thus delaying the need to resolve data dependencies across processing elements.
Overlapping of Communication and Computation.
One way of mitigating the overhead from data transfers between the host and an accelerator is to execute computations at the same time as data are being transferred. In the approach described by Kunz et al. [100] , event computations are overlapped with data transfers across the CPU-GPU boundary, thus hiding data transfer latencies in a pipelined fashion. Since events from multiple simulation instances are considered concurrently, there are substantial opportunities for overlapping these steps. While their approach is applied to a discrete-event simulation, it can be applied to timestepped ABS by initiating the transfer of output data of some agents' state updates at a given timestep, while computations for other agents are still in progress.
Bauer et al. [16, 17] propose a generic API to optimise the data transfer between global memory and shared memory of CUDA GPUs using so-called warp specialisation. The warps within one cooperative thread array are split into two groups: Dedicated memory warps are in charge of data transfer between the on-chip and off-chip memory. Compute warps process the data. The approach improves performance over thread-level separation between communication and computation, since separate warps can follow divergent control flows without any performance penalty. While their general idea can be applied to other types of independent processing elements, the warpbased implementation is specific to GPUs.
Computation Replication at Partition Boundaries.
In timestepped ABS, at model time t each agent updates its state based on the states of its neighbours at time t − 1. If the simulation is distributed across multiple processing elements, then synchronisation and data transfers are required to provide this information at each timestep. The associated overhead may make up a substantial portion of the simulation runtime. Thus, some authors have proposed methods to reduce synchronisation by replicating some computations on multiple processing elements, similarly to performance optimisations in numerical computing [45] . The main challenge when applying this approach to ABS is the consideration of the model-specific sensing range of agents and the speed according to which the effect of an agent's actions can propagate throughout the simulation space.
Aaby et al. [1] present a multi-level data partitioning scheme for cellular simulations on multi-CPU/GPU clusters. The simulation state is partitioned into blocks and each block is executed by a thread, a core, or a node, depending on the configured granularity. In contrast to the traditional data partitioning into blocks of B × B cells and synchronisation at each timestep, their approach partitions the data into several overlapping
cells form the overlapping area. The computation in the overlapping area is performed redundantly by multiple processing units. Thus, assuming that at each timestep, a cell can only affect its immediate neighbours, R timesteps are required for a cell in the inner block to be affected by cells in another processing element. Therefore, synchronisation is only required every R timesteps. Between synchronisation points, an error propagates inwards within the overlapping areas, but does not affect the inner B × B cells before a new synchronisation occurs. This partitioning approach is further employed in multi-GPU clusters on the node-, GPU-, block-, and thread-level, and for multi-CPU clusters at the node-, socket-, core-, and thread-level.
While Aaby et al. illustrate the idea based on cellular grids, the approach applies to general ABS. The sensing range of agents is generally limited and provides an upper bound on the propagation of the effects of an agent's actions within a timestep. As long as overlapping segments of the simulation space can be distributed to the processing elements in a manner so that an effect requires at least R > 1 timesteps, some synchronisations can be avoided. The generality of the approach is illustrated by Zou et al. [199] , who extend the idea of computation replication to graph-based topologies in a GPU-accelerated epidemic ABS.
Scattered Memory Accesses
Throughout the past decades, the increase in computational performance has outpaced the decrease in memory access latencies, leading to modern hardware designs towards large caches and deep memory hierarchies. In the context of ABS, the issue of memory access latencies is particularly pressing: Due to the autonomous decision-making of agents, the runtime interactions among agents and their environment cannot easily be predicted before executing the simulation, significantly limiting the opportunities for a priori optimisation of data access patterns. However, commonalities between different simulation models can be exploited to propose data structures supporting efficient simulation of an entire range of models on a specific type of accelerator.
Since dynamic memory allocation on GPUs is costly [54] , most GPU-based simulators allocate memory for data such as the agent states statically (e.g., Reference [107] ). Another approach is to determine after each timestep the required amount of memory and perform allocations accordingly [141] .
We categorise the existing approaches to address scattered memory accesses as follows:
1. Manual caching in shared memory: Although the support for transparent caching has improved in recent years, achieving highest performance frequently still requires manual caching in low-latency memory. In ABS, agents often influence and are influenced by their direct neighbours. This fact can be exploited when arranging the simulation data in memory, reducing high-latency memory accesses during state updates.
Heuristics for agent update order:
Since the data dependencies between agent state updates are typically not known prior to the execution of the simulation, minimising cache misses during the state updates is non-trivial. Heuristics have been proposed, aiming to favour sequences of computations acting on the same agent data. 3. Representation of irregular data structures by arrays and grids: The hardware architecture of GPUs and FPGAs is designed so that highest performance is achieved when acting on regular data structures such as arrays and grids. Thus, efforts are taken to represent highly irregular data structures in a regular fashion. When covering the techniques from the literature, we first cover model-specific data structures such as graph representations of a simulated road network. Subsequently, we discuss works covering two generic building blocks commonly required as part of ABS engines: priority queues and sorting.
Manual Caching in Shared Memory.
Richmond et al. [151] propose to utilise the shared memory of the GPU as a manual cache. In their agent-based simulation framework for cellular models in biology based on FLAME GPU [35] , they copy sets of messages to be transferred between agents into shared memory. Each thread within a block can then efficiently iterate through the messages and identify those pertaining to the local agent. Once all threads have iterated through the messages, the next sets of messages are loaded into shared memory. Recently, Heywood et al. [78] specialised their messaging method for traffic simulations on graph-based road networks. Messages are sorted by the edge (road segment) or vertex (intersection) so that each agent only considers messages that pertain to its immediate neighbourhood in the road network.
Similarly, Zou et al. [199] implement a manual software cache in shared memory to increase the performance of their graph-based epidemic simulation on GPU clusters. Before the simulation commences on the GPU, the CPU sorts the edges of the directed graph by the source vertex. Each thread block's shared memory stores edges originating from one specific node. Since each block processes only edges originating from this node, a cache hit rate of at least 50% is ensured.
In the GPU-based ABS by Li et al. [107] , assuming a constant number of agents, each agent is assigned to a GPU thread and its state data are permanently kept in global memory. The simulation space is partitioned into a grid of rectangles. Once a search for the neighbours within a circle around an agent is required, a search rectangle that encloses the searching circle is created, so only agents inside the search rectangle have to be considered. Two approaches to utilise the GPU's shared memory are proposed: In the first approach, one block manages the searching process for a chunk C of close-by agents. Per-block shared-memory loads the data of the agent and the agent's neighbours. Each agent in C has a high probability of being in the other agents' neighbourhoods, so that these agents can frequently be accessed through the current block's low-latency shared memory. However, since the limited shared memory capacity allows only for small numbers of agents to be stored, it is still likely that some neighbours are managed by another block and thus have to be accessed through global memory. In the second approach, the shared memory loads the data of agents located in the union of all search rectangles of the agents' handled by the current block. If the shared memory is not sufficient to hold all agents' data, then the data are loaded as a sequence of chunks. Of course, the increase in the search space given by the union of search rectangles leads to a higher number of unnecessary agent accesses through shared memory. To address this problem, the union rectangle can be constructed on the warp level instead of the block level.
Heuristics for Agent Update
Order. The order in which agent updates are performed must adhere to the causal dependencies between the agent states and behaviours, e.g., in road traffic simulation, vehicles in direct proximity must be at the same point in simulated time to be able to interact according to the model specification.
Typically, this is achieved by a strictly timestepped scheme in which agents always reside at the same timestep, after which conflicts in the resulting agent states are resolved [191] . However, since in a typical simulation not all agents interact at each point in time, some agents may be updated further into the simulated future than others without affecting the simulation results [9] . Harris and Scheutz have shown that distributed agent-based simulations can be accelerated by favouring agent updates that resolve dependencies across multiple processing elements [68] . This way, processing elements waiting for others to proceed can be unblocked, decreasing the amount of idle time. Their approach can be applied independently of the underlying hardware platform, but requires bounds on the sensing range and the agent movement per timestep.
Jin et al. [92] present an information propagation simulation supporting execution on HPC systems and single GPUs and extend it to run on multiple GPUs [93] . Their focus lies on maximising the cache hit rate when traversing a graph according to rules defined by the simulation models. Two categories of approaches are developed for the cascade model [60] and the threshold model [62] , which both simulate the propagation of information among nodes in a graph: vertexoriented processing and edge-oriented processing. For the vertex-oriented approach, the authors further describe two agent update orders: One iterates starting from active vertices, i.e., those that already have the information, and the other from inactive vertices. Since the costs depend on the portion of active nodes, the simulation can switch dynamically between the two vertex-oriented approaches. Finally, the edge-oriented approach iterates over the connecting edges between two vertices. Since the number of edges is constant over a simulation run, the cost of the edge-oriented approach is less variable than that of the vertex-oriented approaches. The authors achieved the highest performance when dynamically switching between the two vertex-oriented approaches.
Representation of Irregular Data Structures by Arrays and Grids.
GPUs and FPGAs are particularly suited for operations on regularly structured data. However, many model types specify topologies that are more naturally expressed in terms of irregular structures such as graphs. Further, execution of the simulator core itself may require operations on irregular data structures.
A basic optimisation commonly applied in works on GPU-based computing to improve memory access patterns is the transformation of the data layout in memory from arrays of structures (AoS) to structures of arrays (SoA) (e.g., References [151, 166] ). Commonly, sequential programs store data in an AoS representation. Since AoS bundles the properties associated with each object in object-oriented programming, or the states of agents in agent-based simulations, it is a natural way to represent data within these paradigms. However, with an AoS data layout, parallel operations on the same property across many objects results in scattered memory accesses. An SoA data layout bundles the same property across all objects, which can increase cache hits rates and opportunities for memory access coalescing, thus improving performance substantially.
Beyond this simple optimisation, the data representation can be specialised for a given model to further improve performance. In the following, we give an overview of methods applicable to ABS to achieve high performance by translating irregular data structures to a more regular form.
Model-specific data structures
Early works on executing ABS using GPUs frequently focused on cellular grids and translated the required computations into the graphics processing domain. In a pioneering work done by Harris et al. [69] , GPU shaders are used for implementing computations on the RGBA values in a texture that holds the agents' states. The same idea is employed by Lysenko et al. [114] , Perumalla and Aaby [145] , and Kolb et al. [98] .
Perumalla et al. [145] evaluate the performance of running agent-based simulation entirely on a GPU. They ported the cellular models Mood Diffusion [77, 125] , Game of Life [58] and Schelling Segregation [158] . Through the Open Graphics Library (OpenGL), individual agent states are mapped to pixel colour values. The authors report a speedup of 15 to 40 compared to CPU-based sequential execution. Kolb et al. [98] develop a particle simulation and a GPU-based collision detection mechanism built on the authors' previous work [97] . Similarly, Richmond et al. [152] utilise the GPU's texture processing ability and map agent states onto texture data. To accelerate the neighbourhood detection, the simulation space is partitioned dynamically according to the agents' current states. The algorithm to generate partitions is borrowed from the particle pinning problem in rigid body particles physics [64, 67] . Identification of the start and end of the partition boundary is performed similarly to the method described in Reference [134] . Textures are used to represent the agents' states and vertex texture fetching enables the search for the start and end of the partition boundary by comparing the partition value to the previous agent's state.
To enable traffic simulations on GPUs, Perumalla [144] (and Perumalla and Aaby [146] ) proposes to model the road network as a grid made up of cells. A road network in Cartesian coordinates is translated to a grid representation overlaying the network: A cell in the grid is marked as occupied when an edge of the original road network starts in the cell, passes the cell, or ends in the cell. In graphics memory, the cells' properties such as turning probabilities and length are stored in texture buffers. Simulation is carried out by performing operations on the texture buffers.
A different method for traffic simulation on GPUs is presented by Strippgen and Nagel [166] , who propose a queue-based approach using CUDA. Each road is represented as a single first-in, first-out (FIFO) queue stored in memory in the form of a ring buffer. With the ring buffer, insertion of a vehicle entering a road and removal of a vehicle exiting a road is achieved with constant time complexity. Coalesced memory access can be achieved by processing adjacent roads using adjacent threads. Since the vehicles' mobility is modelled by a fixed per-link velocity, their approach can be considered mesoscopic. The representation of lanes as ring buffers relies on changes of the relative position of vehicles being rare. Behaviours such as overtaking or lane-changing are not modelled and would require random insertions and removals from the ring buffers, which are associated with linear time complexity.
Other domains in which agent-based simulations have been successfully ported to GPUs using model-specific data structures include collision detection [177] and a simulation study of tuberculosis [47] . In the former, a grid is split into tiles and data at the boundary of the tiles is replicated so that a consecutive space is occupied in the global memory of the GPU. In the latter, the authors propose to use a sorted array according to the liveness status of agents, so that the state of a new agent can be stored in a memory location previously occupied by one of the dead agents.
Sorting and priority queues Full or partial sorting is frequently required in agent-based simulations, e.g., for neighbourhood discovery or to implement priority queues (PQ) if time advancement is performed in a discreteevent manner. These operations can involve large amounts of data-dependent and scattered memory accesses and are therefore challenging to implement efficiently on hardware accelerators. Since this operation can occupy a substantial portion of the simulation runtime [155] , a number of works have focused on memory layouts and algorithms for sorting and priority queues on accelerators.
As building blocks for time advancement in a discrete-event fashion, parallel reduction and bitonic sorting are commonly used in GPU-and FPGA-based simulation [95, 140, 159, 180, 196] . We discuss these two operations jointly due to their structural similarities. In both cases, an input array is split into chunks, each chunk being handled by one thread. At each cycle, the sorted arrays/minimum values of two threads are then merged to form a new input array. Thus, at each cycle, the number of chunks and active threads is cut into half. The algorithm iterates until only one thread is active, leaving a sorted array or the global minimum value, respectively.
He et al. [71] propose a parallel heap-based PQ on GPU based on a previous CPU-based design [44] . The data structure resembles a binary min-heap, but stores r items per heap node. Items are inserted and extracted in a joint bulk operation that inserts up to k ≤ 2r and extracts up to r elements. At any time the root node is guaranteed to hold the highest-priority elements, while elements of lower priority are gradually inserted into deeper levels of the tree over the course of multiple insert-extract operations. Parallelism can be exploited across the sorting operations on the items within a tree node, across the nodes on one level of the tree, and by processing all evennumbered and odd-numbered levels of the tree in parallel. The costs of the queue operations can be hidden by performing them in parallel with the processing of extracted items.
Similarly, the FPGA-based DES simulator by Rahman et al. [149] relies on a pipelined heap [21] for storing events. In contrast to the parallel heap by He et al., the pipelined heap is designed to achieve near-constant access times, but does not provide bulk operations.
A number of works avoid the need for a global PQ holding all future events. Instead, the set of events is considered jointly in an unsorted fashion [168] , split by model segment [159] or simulated entity [7, 109, 196] , split according to a fixed policy [141, 183] , or split randomly [121] . To determine the events that can be executed without violating the simulation correctness, a parallel reduction is performed to determine the minimum timestamp among the events.
Baudis et al. [14] evaluate the performance of PQs on a GPU implemented as a single parallel heap or as a set of ring buffers, implicit binary heaps, and splay trees [162] in the context of DES and path finding on grids. Their results indicate that for up to about 500 elements per PQ, ring buffers achieve the highest performance. At larger element counts, implicit heaps outperform the other approaches in their study. Their results suggest that higher performance is achieved by relying on multiple PQs, one for each agent or set of agents, compared to a single PQ holding all events.
Maximisation of Parallelism
The autonomous decision-making and mobility of agents can limit the exploitable parallelism in a simulation in two ways: first, variations in the computational intensity among the model segments may leave some processing elements idle. Second, the single-instruction multiple-thread execution model of GPUs requires divergent operations within a warp to be serialised.
The existing techniques to maximise the parallelism of ABS using accelerators can be roughly categorised as follows:
1. Multiple replications in parallel: Full utilisation of a massively parallel accelerator requires large numbers of computations that are independent and can thus be executed in parallel. If a simulation involves a sequence of mostly dependent computations, then the overheads for communication may outweigh the gains from parallelisation. Thus, techniques have been proposed to perform computations from multiple simulation runs in parallel.
Window-based event execution:
In simulations involving a discrete-event mechanism, only a proper subset of the simulated entities may require an update at a certain point in simulation time. Multiple authors have proposed gathering events across a window in simulated time, and executing these events in parallel. In effect, this approach forces a discreteevent approach into a timestepped execution. A key difference among the techniques lies in whether the simulation correctness is strictly maintained. 3. Speculative execution: As in general optimistic parallel and distributed simulation [53] , computations may be performed speculatively to improve hardware utilisation. A rollback mechanism is required to revert to a correct simulation state after erroneous computations. 4. Computation sorting: On a GPU, threads within a warp following divergent branches of the control flow are serialised. Since each agent performs actions based on its attributes, its current state as well as its environment and neighbouring agents, the computations at each simulation step are often diverse across agents. Some authors have proposed sorting of computations to minimise the serialisation resulting from branch divergence.
Multiple Replications in Parallel.
If an individual simulation run does not provide sufficient parallelism to fully utilise the available hardware, then a Multiple Replications in Parallel (MRIP) approach [142] can be applied, as shown by Shen et al. [160] : In their approach, multiple replications of a traffic simulation [181] are executed in parallel on a GPU. Thus, both the parallelism among agents as well as the parallelism across replications can be exploited. Laville et al. [104] implement a multi-agent simulation of microorganisms in soil for CPU/GPU in OpenCL. Each GPU thread manages one agent and each block is responsible for one simulation instance so that multiple simulation instances can run concurrently on one graphics card. The idea is applied to discrete-event simulations by Kunz et al. [100] , focusing on executing parameter studies comprised of multiple replications on a GPU.
In addition to exploiting the parallelism across replications, Li et al. [108] aim to avoid unnecessary redundant computations common to multiple replications. They propose a cloning mechanism for ABS on the GPU: In an ensemble simulation run comprised of multiple simulation instances, the computations that are common to multiple instances are only performed once. When the behaviour of an agent diverges between two simulation instances, a clone of the agent is created. Since the agent may affect other agents, cloning is performed according to the propagation of the effects of the original change in agent behaviour. Similarly to the technique "computation replication at partition boundaries" (cf. Section 4.2.2), cloning exploits the limited propagation speed of agent updates due to the limited sensing ranges and movement speeds of agents. If agents move and communicate arbitrarily across the entire simulation space, then the required number of clones is too large to achieve a performance benefit. Across cloned simulation instances, neighbour detection can be aggregated to improve the utilisation of the GPU resources. The benefit of cloning is limited when simulation runs diverge strongly, e.g., across multiple runs of a stochastic simulation using different seeds for random number generation. Recently, the cloning approach has been applied to large-scale cellular simulations on GPU clusters [192] .
Window-based Event Execution.
On a GPU, all threads in a warp execute the same sequence of instructions on different elements of data. If no input data are available for some of the threads within a warp, then the hardware utilisation is reduced. In ABS, this issue is particularly obvious when time advancement is performed in a discrete-event fashion to accommodate varying state update intervals among the agents. Then, the probability that many events share the same timestamp may be low. Thus, a simple parallelisation across the events at a certain point in model time may be insufficient. An approach to address this problem is to execute DES models in a timestepped fashion: All events within a certain time interval are executed in parallel. The lower bound of this time interval is usually referred to as Lower Bound on Time Stamp (LBTS), which is similar to Global Virtual Time in optimistically synchronised parallel and distributed simulation [90] . With a sufficiently large timestep size, hardware utilisation is increased. However, since dependencies between events are not considered, the simulation results may differ from a sequential execution.
A study comparing the performance of time advancement mechanisms for simulations on the CPU and the GPU is presented by Perumalla [143] . They study diffusion simulations running in a timestepped, discrete-event, and hybrid fashion. The GPU variant is implemented in the GPU programming language Brook [26] . While the GPU outperforms the CPU in the timestepped variant, it does not perform as well as the discrete-event implementation on the CPU. However, high speedup is achieved using the hybrid approach, where at each cycle, the minimum gap between two events is used as a timestep. The simulation time then advances according to this timestep.
Park and Fishwick [139, 141] present a method for queuing network simulation that executes a DES model in a timestepped fashion. The simulation time advances according to a fixed timestep size, but skips periods where no events occur. All events within the current timestep are executed without considering potential dependencies. Although the results are affected by their approach, the authors show that for a queueing network simulation, error bounds can be given. Other works assume a minimum time delta between an event and its creation (lookahead) to guarantee the correctness of the simulation results [7, 157, 196] . If lookahead is available, then a window can be determined within which events are independent, allowing for parallel execution without affecting the results.
The current time window is extended dynamically in work by Tang and Yao [169] to allow more events to be executed in parallel. After executing all events within the current window, their algorithm evaluates the first event in the event queue with a timestamp larger than the LBTS that can still safely be executed according to the lookahead.
Speculative Execution.
To maintain the correctness of the simulation results when executing in parallel on an accelerator, the simulator must consider the dependencies between state updates. In some of the approaches described above, a time window is determined where state updates cannot affect each other. If it is difficult to determine a time window of sufficient size to extract substantial parallelism, then a speculative (or optimistic) approach can be employed: State updates are performed without regard for correctness and rolled back if errors are detected.
Speculative execution of simulations on FPGAs has been first demonstrated by Model and Herbordt [121] . They make use of predictions of the interaction between particles, generating new events accordingly. Events may later be cancelled as a consequence of a false prediction.
Targeting GPUs, Li et al. [106] present an execution model that achieves high parallelism by speculative event execution. In an initial step, all events that may occur in the simulation are created. Subsequently, all events are executed in parallel. A scanning process detects and revokes causally invalid event executions: If an event leaves the simulation in an incorrect state according to a model-specific criterion, then the erroneous event and all events created by it are revoked recursively.
A more general approach for GPU-based discrete-event simulation is presented by Liu and Andelfinger [109] . An optimistic execution scheme based on the Time Warp algorithm [90] implemented in CUDA is shown to be beneficial at low event density in simulated time. To support rollbacks in case of erroneous computations, the authors show how the default random number generator in CUDA can be reversed computationally without storing additional data.
Computation
Sorting. The individual decision-making of the autonomous and heterogeneous agents often leads to diverse computations being executed at the same simulation step. Some approaches attempt to arrange the assignment of computations to the available threads on a GPU so that the serialisation caused by branch divergence within a warp is minimised.
In their DES engine on the GPU, Tang and Yao [169] sort events by type before execution, i.e., by the code associated with the event.
The idea is applied to GPU-based execution of multiple simulation instances at the same time by Kunz et al. [100] (cf. Section 4.4.1). If the simulation instances do not diverge too strongly, then many events of the same type are available across multiple instances, enabling efficient parallel execution.
Kofler et al. apply computation sorting to their ABS of mosquitoes [95] . In their simulator, a oneto-one mapping between agents and threads is used. Depending on their current state, agents may perform different operations, which can result in taking different control flow branches during the state updates. Thus, to reduce divergence among threads within a warp, agents are sorted by their current state, so that the state updates of adjacent agents share the same control flow.
In a recent work, Chimeh et al. [32] provide guidelines on formulating models to be executed in FLAME GPU so that branch divergence is minimised. They suggest modifying the state machines defining the agent behaviour to eliminate conditional branches by creating a new state or even a new agent type for each branch. The state updates of agents currently in the same state can then be executed without divergence.
Abstraction from Hardware Specifics
Compared to model development in CPU-based environments, development for accelerators can be cumbersome and error-prone. To avoid the need for modellers to gain deep expertise in programming for specific accelerators, several frameworks have been proposed that enable the specification of parts of the model structure and behaviour in a hardware-agnostic fashion. Since ABS models are commonly developed, modified, and extended in an iterative process, it is critical to avoid the need for modellers to consider low-level aspects of accelerators. The following works address the abstraction from hardware specifics:
1. Frameworks to support simulation development: Some authors have proposed generating partial model code to be executed on accelerators from domain-specific languages or the reliance on a library of pre-defined implementations of common simulation tasks and models. However, in these approaches, developing a full ABS will typically still require manual implementation work using comparatively low-level languages such as CUDA. Further, workload partitioning and assignment to different hardware devices is currently not considered by these approaches. 2. Unified memory access: Since in most cases, the CPU and hardware accelerators involved in a simulation operate on separate memory, resolving data dependencies may involve cumbersome explicit data transfers. A number of authors have proposed techniques to transparently access data in programs executed on heterogeneous hardware.
Frameworks to Support Simulation Development.
In the Flexible Large Scale Agent Modelling Environment (FLAME GPU) [151, 153] , agent states are specified using the state machine model X-Machine [48, 82] . Modellers define agent states in an XML-based format, while state transitions, i.e., the code segments describing the state updates, have to be manually specified as CUDA code. Generic facilities for exchanging messages between agents are provided by the framework. Traffic simulation has been presented as one of the use cases of FLAME GPU [79] .
The domain-specific language OpenABL [39] enables the specification of ABS models in a compact and platform-independent fashion. The OpenABL code is translated to an intermediate representation, from which code targeting different backends for sequential, parallelised, as well as GPU-based execution can be generated.
Another framework for GPUs and other many-core architectures is called Many-Core Multi-Agent System (MCMAS) [103] . MCMAS provides a high-level Java interface to OpenCL code as well as a set of pre-defined data structures and functions called plugins. To implement agent models, users either rely on plugins or define their own plugins as OpenCL code that can be called from Java code. The authors state that unlike FLAME GPU, in which models are targeted exclusively at the framework, the models defined in MCMAS can be reused by other agent-based simulators.
While FLAME and MCMAS both reduce the implementation work required to develop agentbased simulations targeting accelerators, these frameworks do not provide guidance or automation in distributing the simulation workload to the available hardware. Thus, manual experimentation is required to determine a suitable hardware mapping.
Unified Memory
Access. GPGPU frameworks such as OpenCL or CUDA require the user to either explicitly trigger data transfers between host and device memory, to explicitly select certain variables or memory regions for access from both CPU and GPU code [133] , or to annotate the program to manage data transfers [105, 190] . These manual steps complicate the development of agent-based simulations in heterogeneous environments. Some works aim to improve on this situation by transparently transferring required data between host and graphics memory. However, in languages based on C or C++, static alias analysis, i.e., determining which pointers refer to the same memory regions, is known to be undecidable [88] .
Jablin et al. [87, 88] presented the first fully automated data management system based on compilation steps and a runtime library. The developer formulates his program and GPU code as if all data reside in host memory and can be accessed both from the CPU and GPU. The proposed approach instruments the code to track accesses to different memory regions using code instrumentation and trapping of system calls. To avoid the need for static pointer analysis, memory accesses through pointers are tracked by the runtime library. In addition to transparently handling data transfers, CPU-GPU communication is optimised during compile time by re-ordering the program flow to reduce the alternation between computations and data transfers. Unnecessary data transfers are avoided by leaving data in the GPU memory until they are accessed from the host.
While the work of Jablin et al. could be applied to automate data transfers in heterogeneous ABS, the detection of parallelism is not covered. In Section 5, we sketch research directions towards automation in porting ABS to accelerators.
TOWARDS AN AUTOMATED OFFLOADING PROCEDURE
From the observations in the previous section, we can state that there is a vast range of techniques covering the main challenges of high-performance ABS on hardware accelerators. However, there exist only few ABS frameworks that support such accelerators. Since existing agent-based simulation and model implementations typically target purely CPU-based environments, there is a clear need for processes and tools to support the transition to an execution on accelerators [187] . More specifically, modellers and simulationists should be supported in the parallelisation and hardware mapping as much as possible. While methodologies have been proposed to systematise the steps of porting a simulation to a GPU [76, 118] , there is a lack of automated tools to support this process.
The problem of automatic parallelisation of general programs is a broad and active field of research [54] . Substantial successes have been achieved with respect to parallelisation of computationally intensive loops with predictable and mostly static control flow [66] , whereas the extraction of parallelism across complex and irregular programs is still a largely manual process. Common approaches include specifying software systems using formalisms that express parallelism explicitly [81, 102, 116] or annotating programs with parallelisation hints [42] . In essence, these approaches provide the compiler or parallelisation middleware with a dependency graph of the statements or code blocks within the original program.
Fortunately, many agent-based simulators and models roughly follow a common set of properties that simplify the extraction of parallelism. We identify the following constraints that can be leveraged to support the parallelisation process: performed to the agents' states and the environment state at t − 1, and only write accesses to the states at t. Thus, within an update, there are no read-after-write dependencies across agents. (3) Sense-Think-Act cycle: We assume that agent updates follow the well-known Sense-Think-Act cycle (cf. Section 3.1), with one such cycle per model.
With these constraints, a natural approach to parallelisation is to offload individual stages of a model's Sense-Think-Act cycle to an accelerator. For instance, in crowd simulations using the social force model, the Think stage comprised of the computation of the force affecting an agent may be performed by one thread of a GPU per agent.
In the following, we sketch an envisioned workflow and the required tools to support users in porting an existing CPU-based ABS to a system equipped with hardware accelerators. For the targeted simulator architecture, we assume a traditional master-worker scheme, with the host CPU acting as the master and assigning work to the available accelerators at each timestep.
Proposed Work Flow
The proposed semi-automated process is visualised in Figure 3 . To facilitate the automatic partitioning of the simulation source code into segments that can be outsourced to various types of hardware, we suggest manually annotating the source code according to the Sense-Think-Act paradigm. From that it follows that the smallest unit that can be offloaded to a hardware accelerator in our proposed framework is one of these three stages. Each of the stages is profiled in terms of memory and computational requirements. According to the gathered requirements, an optimisation problem is solved to generate a hardware assignment (rightmost part of Figure 3 ).
For simplicity, we assume that all data required by the stage fits into one of the accelerator's memory entirely. Otherwise, agents could be distributed across multiple accelerators or processed in batches, both implying additional communication costs.
Input.
The source code is annotated manually to signify the stages of the Sense-Think-Act cycle, e.g., in the form #pragma sense_begin, #pragma sense_end, and so forth. A simple example for a crowd simulation is given in Algorithm 1. In addition to the manual annotations, this clear separation may require refactoring of the simulation code. By parsing the annotated source code, the framework obtains a mapping between code and stages that will later be enriched with data from measurements. The second input is a specification of the available hardware. Each hardware device is characterised by its available memory, computational performance, and hostdevice data transfer overhead. The computational performance can be stated in terms of singlethreaded performance on CPUs, many-core CPUs, GPUs, and APUs. We assume that for an FPGA, ALGORITHM 1: Example for model code annotated with the stages of an agent update. only model stages for which implementations already exist are eligible for offloading. Thus, the computational performance of an FPGA is given with respect to specific model stages.
Memory Access
Profiling. Now that the source code is partitioned into offloadable stages and the capabilities of all the hardware components are known, the data dependencies of each stage are determined. Assuming a node in a graph represents one stage, then an edge in this graph represents a data dependency between these stages. The dependency can refer to either agent or environment data. The weight of the edge is the volume of the data that are accessed in the CPUbased simulator, i.e., that has to be transferred during offloading. Usually, the Think stage only has a dependency on the Sense stage within the same model and agent (intra-agent dependency), whereas the Sense stage might depend on the environment and on other agents' states (interagent dependency). Although we assume that an individual stage is not partitioned across multiple hardware devices, the amount of data gathered during the sense stage may vary over the course of the simulation. For instance, if agents form clusters in the simulation space, the number of neighbours per agent may increase over time. Thus, the data dependencies should be measured with respect to typical scenario conditions. To avoid exceeding the memory capacity of one of the considered hardware devices, the profiling can be repeated for a worst-case scenario.
Tools exist that are able to ascribe memory accesses performed during a program run to the source functions, data structures or threads [10] . For instance, the tool PinComm constructs a dynamic dataflow graph from instrumented program executions [73] . The annotations shown in Algorithm 1 allow us to map function names to the separate agent update stages. Thus, it is possible to obtain the amount of memory accessed within each stage. Once the graph describing the amount of memory accesses across stages is created, the implications in terms of memory copying of moving a certain stage to a hardware device can directly be evaluated. For example, if the Think stage is moved to the GPU and the Sense and Act stage remains on the host CPU, then the edges entering and leaving the Think node determine the data transfer overhead. The actual cost of this copy procedure can be obtained from the device specification or through measurements.
Computational Profiling.
In addition to the memory requirements of each stage, information about the computational characteristics of each stage is required. The estimated runtime could be inferred from hardware performance models [11, 31, 161, 195] . Approaches as those described in Section 4.1.2 can be applied to estimate the suitability of different agent update stages for execution on a certain accelerator. By characterising the workload incurred by each stage in terms of instruction mix and memory accesses as well as the number of agents, the performance of executing the full-scale simulation can be estimated [63, 96, 182, 193] . Alternatively, if the runtime of a stage is dominated by a sub-task that can easily be ported to an accelerator, measurements with respect to this task can be performed directly on the accelerator [19] .
Optimisation Problem.
Building on the graph that represents data dependencies, an optimisation problem of assigning stages to hardware types can be formulated, similar to the approach targeting embedded systems by Zhang et al. [194] . In essence, constraints are formulated so that each stage is assigned to the host or a device, resulting in an overall simulation schedule. Importantly, the optimisation problem must reflect the data location after each stage or timestep (e.g., Reference [116] ). For instance, to avoid data transfers, it may be more efficient to execute two subsequent stages on the same accelerator. The objective function of the optimisation problem is the overall runtime, i.e., the sum of all estimated execution times on the respective device and the incurred communication costs by distributing nodes of the dependency graph that are connected by an edge.
Output.
The output of the optimisation steps is a recommendation of which stages should be executed on which hardware device. It is then the task of the user to port the code of each stage so it can be executed on the assigned device. This might require specific knowledge, e.g., programming in VHDL or OpenCL and can therefore be an obstacle to some researchers. Given that some established simulation models are used by many researchers (e.g., a CSMA/CA model in network simulation or different car-following models in traffic simulation), a public repository of common simulation models could be created, similarly to the plugin approach used in MCMAS [103] . Researchers could download these crowd-sourced simulation models to enable parts of their own simulations to be run on heterogeneous hardware environments, and contribute their own model implementations. Such a repository would also reduce the need to estimate execution times and improve the optimisation results by allowing direct measurements on the potential target devices. Similarly, after porting a specific model stage, new measurements may be performed to provide the optimisation process with more accurate performance data.
Discussion.
In our approach, we take a pragmatic perspective: While the envisioned workflow is achievable based on existing building blocks, our assumptions may leave substantial performance potentials unexplored. In particular, by assuming that models and their stages are both executed as a series of dependent steps, we only exploit the inter-agent parallelism within each stage, while any parallelism across stages is not considered. In the following, we revisit the key challenges of ABS using hardware accelerators and sketch techniques from the literature that could be applied to maximise the performance benefits given our assumptions.
The hardware assignment (cf. Section 4.1) is the main focus of the proposed work flow. Above, we describe a static assignment using a functional decomposition. Still, the optimisation problem that determines the hardware mapping could be updated according to runtime measurements.
To minimise data transfer overheads that cannot be avoided (cf. Section 4.2), a bulk execution of multiple simulation runs would be feasible. The optimisation problem could be adapted so that the computational and memory requirements reflect those of each stage executed within multiple simulations runs at the same time. The output of the optimisation process would then be a schedule for an execution in a multiple replications in parallel (MRIP) fashion [100, 142] .
The technique of overlapping computations with data transfers seems challenging in our approach, since we assume a serialisation of the agent update stages. However, pre-fetching across stages may be performed by commencing data transfers once some agents have finished a stage.
Scattered memory accesses and the maximisation of parallelism (cf. Sections 4.3 and 4.4) could be addressed by providing a library of optimised functions and data structures for operations such as inter-agent communication or neighbour search (e.g., References [35, 103] ).
A certain degree of abstraction from hardware specifics (cf. Section 4.5) is achieved by the automated profiling and hardware mapping of our proposed workflow. Since each stage is executed on a single accelerator, facilities for unified memory access across all devices are not required. Instead, all agent data are updated locally on the accelerator and transferred automatically according to the schedule determined in the optimisation process.
Overall, the envisioned workflow is intended to rely on existing tools and techniques to allow researchers to exploit the hardware at their disposal with reasonable performance gains, while avoiding the need for costly and time-consuming manual optimisation steps as much as possible.
CONCLUSIONS
We presented a survey of the literature on agent-based simulation using hardware accelerators. We categorised existing approaches according to the key challenges of hardware assignment, minimisation of data transfer overheads, scattered memory accesses, maximisation of parallelism, and the abstraction from hardware specifics. Our survey provides modellers with an overview of techniques to execute a certain class of models on the available hardware. Methodology researchers are given a summary of the existing work, pointing out research gaps where further exploration is required. Our main observations are twofold: First, most of the literature in the past years has focused on GPUs. We expect a significant amount of work exploring agent-based simulations on FPGAs to appear in the near future. Second, while a vast amount of work has proposed techniques that allow for efficient execution of agent-based simulations, only few techniques have found their way into a unified framework. Thus, the burden of developing a simulation executable in a heterogeneous environment is carried by the modeller. Aiming to reduce the need for expertise in the programming for accelerators, we sketched our vision of a framework for automated hardware mapping and performance optimisation based on building blocks from the literature.
