Performing a tsunami or storm surge simulation in real time is a highly challenging research topic that calls for a collaboration between mathematicians and computer scientists. One must combine mathematical models with numerical methods and rely on computational performance and code parallelization to produce accurate simulation results as fast as possible. The traditional modeling approaches require a lot of computing power and significant amounts of electrical energy; they are also highly dependent on uninterrupted access to a reliable power supply. This paper presents a concept how to develop suitable low power hardware architectures for tsunami and storm surge simulations based on cooperative software and hardware simulation. The main goal is to be able -if necessary -to perform simulations in-situ and battery-powered. For flood warning systems installed in regions with weak or unreliable power and computing infrastructure, this would significantly decrease the risk of failure at the most critical moments.
INTRODUCTION
Accurately predicting floods in endangered coastal regions precipitated by catastrophic geophysical events such as earthquakes, landslides, hurricanes, etc. requires accurate numerical simulations that utilize two-or three-dimensional regional or global ocean models. The computational resources needed to run these models at a sufficient spatial resolution and in real time, i.e., with a hard upper bound for the allocated time to produce a usable solution, greatly exceeds the capabilities of conventional work stations. This currently precludes the use of accurate simulation software in tsunami and flood warning systems installed in areas with weak or unreliable computational, communication, and power infrastructurewhich is the case in many of the affected regions.
Currently, a number of less sophisticated approaches are being offered for installation in flood warning systems. These solutions are relying on one of the following techniques: (i) simulation at a coarser grid resolution, (ii) using simpler or less accurate numerical methods, (iii) searching in a database of precomputed scenarios, or (iv) running the simulation remotely (e.g., in a cloud). All these alternatives increase the risk of the flood prediction being either too late or too inaccurate incurring a potentially high cost in terms of human lives and property damage.
In this paper, we present a concept for an affordable, reliable, and energy-efficient flood simulation system designed to mitigate the aforementioned problems of current systems. We analyze the requirements for such a system in terms of performance, power efficiency, and reliability with the ultimate goal of designing a combined hardware/software-system capable of carrying out a flood simulation in parallel in an entirely battery-powered manner using low power compute units.
In order to speed up the development of such a system, we rely on its simulation as a first step. This simulation must not only cover the functional properties, but also provide an estimate of the system's energy consumptions and other relevant non-functional requirements. Even though our algorithm is highly scalable (Aizinger et al., 2013) , we do not consider the use of either dedicated GPU-based compute units, nor GPGPUs, as we currently do not have any reliable power models or simulations tools for them. In addition, we want to focus our study on processor architectures available in the open-source form, which also precludes usage of commercially available GPUs.
Our simulation relies on 2D/3D shallow-water solver UTBEST/UTBEST3D that is based on the discontinuous Galerkin finite element method (Aizinger and Dawson, 2002; Dawson and Aizinger, 2005) . We decided on this type of numerical algorithm for a number of reasons: It runs on unstructured meshes which allows to utilize computational grids with optimal spatial resolution in the areas of interest; the code has shown excellent parallel scalability; the implementation posesses adaptive refinement capabilities, thus capable of automatically increasing the mesh resolution in the critical locations not known in advance; the user can choose between different approximation orders on the same mesh providing a simple way to get the optimal accuracy for a given performance/energy cost.
The rest of the paper is organized as follows: Section 2 provides an overview of the state-of-the-art in the use of low power architectures for high performance computing and energy optimization. Section 3 presents our concept for the simulation based evaluation of suitable low power architectures. A preliminary evaluation of available multi-and many-core hardware simulation environments and their suitability for our purposes is given in section 4. The paper concludes with a summary and outlook of future work.
RELATED WORK
There is a significant body of research in the field of utilizing low power architectures for high performance computing (HPC) and in the optimization of energy efficiency for HPC applications.
Rajovic et al. investigated the usage of low power ARM 1 architectures and SoCs (System on Chips) as means to reduce the cost of high performance computing (Rajovic et al., 2013a) . They conclude that low power ARM-based SoCs have promising characteristics for high performance computing.
In 2013, Goeddeke et al. presented a paper on energy-to-solution comparisons between different architectures for different classes of numerical methods for partial different equations (Göddeke et al., 2013) .
They showed that energy to solution and energy per time step improvements up to a factor of three are possible when using the ARM-based Tibidabo cluster (Rajovic et al., 2014) , instead of an x86-based cluster. The x86 cluster used for the reference measurements was the Nehalem sub-cluster of the LiDOng machine provided by TU Dortmund (ITMC TU Dortmund, 2015) .
A study comparing the performance as well as the energy consumption of different low power and general-purpose architectures was published by Castro et al. (Castro et al., 2013) . Based on the TravelingSalesman problem (Applegate et al., 2011) , they investigated time to solution and energy to solution for an Intel Xeon E5-4640 Sandy Bridge-EP, a low power Kalray MPPA-256 many-core processor (KALRAY Corporation, 2015) , as well as for a low power CARMA board from SECO (NVIDIA Corporation, 2015a). The results show, that the CARMA board and the MPPA-256 many-core processor achieve better results than the Xeon 5 measured in terms of energy to solution. Concerning the time to solution, the Xeon 5 performed better than the CARMA board but not as good as the low power MPPA-256 many-core processor.
A work considering low power processors and accelerators in terms of energy aware high performance computing was published in 2013 (Rajovic et al., 2013b) . There, a number of different HPC microbenchmarks was used to determine the energy to solution. The architectures evaluated were NVIDIA Tegra 2 (NVIDIA Corporation, 2015b) and Tegra 3 (NVIDIA Corporation, 2015c) SoCs. The results show that drastic energy to solution improvements are possible on the newer Tegra 3 SoC in comparison to the Tegra 2 SoC (reduction of 67% on average). Furthermore, the authors conclude that the usage of integrated GPUs in low power architectures, such as Tegra 2 and Tegra 3, can improve the overall energy efficiency.
CONCEPT
Our goal is to develop an integrated hardware/software system that can satisfy (i) the functional requirements, i.e., computational performance, accuracy, and efficiency as well as (ii) the non-functional requirements, such as the energy efficiency and cost effectiveness. The project requirements are formulated in this slightly unusual vein, where the non-functional requirements are given comparable importance to the functional ones. This is caused by the fact that the operating environment of the system sets a number of rather rigid constraints for the entire solution.
The flow of events in this setting provides the most important boundary condition. As soon as remote sensors provide a warning and supply data for a geophysical event (e.g., an earthquake) capable of causing flooding, the time remaining until the landfall of the wave is clearly defined. The available information then needs to be processed as fast as possible in order to predict the magnitude of the flood and identify the affected areas. In addition, from this moment on, the availability of uninterrupted power supply is critical in order to complete the flood simulation. One has to note here, that this hard upper bound for the time-to-solution can strongly vary on a caseby-case basis depending on a variety of factors (distance from shore, wave speed, area and topography of affected regions, etc.). If one also includes other nonfunctional requirements such as the size and speed of the available computational hardware, power source (external or battery), this clearly motivates the need of a highly adaptive flood modeling software in the warning system. Many of the regions at risk from such catastrophic events cannot boast of either good power infrastructure or reliable communication networks. Thus the minimum set of requirements must include the following two: (i) the hardware platform must be able -if such need arises -to complete the simulation on battery power (ii) the computation must be carried out locally.
Since the algorithm can be executed in highly parallel manner, we fill focus on parallel computer architectures. As mentioned in the introduction, GPGPU computing is not in the scope of our work. We will simulate the different functional as well as nonfunctional properties of the computation of the flood simulation on different virtual hardware platforms first. This enables us to produce a preliminary decision on which hardware to use, thus saving resources and time during the development phase. With regard to the non-functional properties of the system, we need to know the total computation time and to obtain an estimate of the energy consumption for the entire run.
In order to obtain optimal results for our simulations, we need to limit the number of possible computer architectures. E.g., though the Intel x86 architecture is known to deliver high computation performance, this architecture is not in our scope due to its high power consumption when running high computation loads. Yet, along with the improvements of smart phones, the ARM CPU platform has strongly evolved as well and turned into a high performance and low power consumption architecture. It enables today's devices to run on battery power for a long time under relatively high computational loads.
Another possible architecture is the well-known sparc-v8 architecture. Here we choose the LEON processor (an open source sparc processor developed by AEROFLEX Gaisler), as there have been successful investigations on estimation of non-functional properties for it. Furthermore, the sparc-v8 powered LEON processor is well suited to our purposes due to its fault tolerance, high configurability, and a relatively costeffective licensing. Therefore we concentrate our investigations on the promising ARM platform as well as the LEON processor family.
PRELIMINARY EVALUATION OF SIMULATION ENVIRONMENTS
We intend to focus on building heterogenous as well as homogenous low power multi-and many-core architectures. Therefore, virtual environments that enable the simulation of those systems are of interest for us and have to be evaluated with respect to their suitability for our purposes. There are three further key aspects that are of importance for our choice: Simulation performance, the capability of power estimation and modeling, and last but not least, the availability of low power processor models. The next paragraphs briefly describe available multi-and many-core simulation environments and discuss their suitability for our purposes; Table 1 gives an overview of chief parameters of each simulation environment relevant in the context of our project. An open-source multi-and many-core simulator is Graphite (Miller et al., 2010) ; it offers the possibility to simulate hundreds or even thousands of cores. Graphite is not a complete cycle-accurate simulator, it uses different techniques to provide accurate performance results. The simulation environment offers processors, a memory subsystem, cache models as well as a network for realizing interconnections. All these models use further analytical timing models to guarantee accurate results. Two processor models are supported, iocoom (in-order core model with out-oforder memory completion) and simple (in-order core model that adds all latencies). Power modeling for the processor, the caches, and the network (Kurian et al., 2014 ) is also supported. However, the focus of Graphite is not on embedded systems and low power architectures.
Based on the Graphite simulation infrastructure Carlson et al. (Carlson et al., 2011) developed Sniper. Sniper enhances Graphite by an interval simulation approach, a more detailed timing model, and operating system modeling improvements. It allows faster and more precise simulations for exploring homogeneous and heterogeneous multi-and many-core architectures than Graphite. Sniper supports power modeling and estimation using McPAT (Li et al., 2013) and custom DRAM power models (Heirman et al., 2012) . Unfortunately, Sniper is an x86-tool only.
An environment that focuses on virtual prototyping of multi-processor system on chips (MP-SoC) is SoCLib (SocLib Project, 2015) . SoCLib provides a wide range of processor and peripheral models, for example MIPS32 and ARM. Furthermore, the usage of real-time operating systems like eCos is supported. This environment enables simulations at cycle-and bit-accurate level. Since all models are written in SystemC (Accellera Systems Initiative, 2015), the ability to simulate at transaction level is provided as well. To supply the power and energy estimations, Atitallah et al. (Atitallah et al., 2006) developed energy models for different hardware components that can be used in conjunction with SoCLib. In terms of simulation speed, the cycle accurate simulation level is very slow in comparison to the instruction accurate level: Weaver and McKee (Weaver and McKee, 2008) showed that discrepancies of hours up to days are possible. As a consequence, cycle accurate simulations are not an option for the simulation of large manycore systems in the context of our project.
A cycle-level multi-and many-core simulator for Network on Chip (NoC) architectures is HORNET (Lis et al., 2011) . The simulator provides a variety of memory hierarchies, interconnect geometries as well as accurate power modeling. HORNET can operate in full multi-core mode, i.e., using a built-in MIPS core simulator in addition to the network model. Unfortunately, HORNET only offers one single-cycle inorder MIPS core. For increasing simulation performance, a loose synchronization mechanism is supported. As a result of loose synchronization, accuracy of performance measurements suffers.
The gem5 simulation environment (Binkert et al., 2011) combines the benefits of the M5 (Binkert et al., 2006) and the GEMS ((GEMS Development Team, 2015)) environments. M5 is a configurable simulation environment offering multiple ISAs (instruction set architectures) as well as various CPU models. The CPUs can be configured to operate on different levels of detail and accuracy. In combination with GEMS, gem5 provides a detailed and flexible memory system as well as interconnection models. A wide range of instruction set architectures (e.g. x86, ALPHA, ARM or MIPS) is supported by gem5. For a short time, power modeling and estimation for low power ARM architectures is also possible (Endo et al., 2015) . This simulation environment is not designed to be pure instruction accurate and targets low power architectures only partially.
QEMU (Bellard, 2005) is an emulator and virtual machine (VM) for the Intel x86 architecture, that can also emulate and virtualize a variety of systems of differing architectures. When used as an emulator, QEMU operates on an instruction accurate level. Typically, QEMU is used as a VM in hosting centres, but can also be used as a debugging platform for embedded ARM systems. QEMU is not meant to be an extensible framework, even though it is possible to implement new platforms. Among the emulated ARM platforms are, e.g., Nokia N810 tablet or ARM Versatile Express for Cortex-A9. QEMU does not support power measurements and estimations.
The instruction accurate simulation technology from Open Virtual Platforms (OVP) was developed for high performance simulation of low power multiand many-core architectures. Simulations can run 100s of MIPS, often faster than in real time. Debugging applications, which run on the virtual hardware, as well as analysis of virtual platforms containing multiple processor and peripheral models is possible. A wide range of older and current processor models is available, e.g. for ARM, MIPS, Renesas, or PowerPC processor families. A number of predefined platforms (Altera Cyclone V SoC, MIPS malta, etc.) is also available for the system. Furthermore, the OVP simulation technology offers the ability to create new processor architectures and other platform components (Imperas Software Limited, 2014). The OVP simulator supports measuring instruction counts within a program, thus permitting in-depth analysis of specific code fragments. Also power modeling for selected processors is possible using OVP -as introduced in (Rosa et al., 2014) .
CONCLUSION
In this paper, we present a concept that enables the determination of suitable low power multi-and manycore architectures for tsunami and storm surge simulation. The realization of the concept relies on a virtual environment that enables emulation and simulation of different low power multi-and many-core hardware architectures. For that reason, we conducted a preliminary investigation of available multi-and many-core simulation environments. Three aspects played an important role in our choice: Simulation performance, the ability to estimate and model power consumption, and the availability of low power processor models in the simulation system. As can be seen from Table 1 , OVP appears to be the best solution for the realization of our concept.
FUTURE WORK
Preliminary estimations of time-to-solution and energy consumption are necessary to improve design decisions when developing hardware architectures. This is especially true in the case of a power-aware flood warning system. As already discussed, cycle accurate simulation is not an option for developing a many core system due to the poor simulation speed. However, some work has already been done on utilizing high-level functional simulation enhanced by a mechanistic extension, thus including non-functional properties such as time and energy consumption on given hardware. Functional Simulators, such as Instruction Set Simulators (ISS) can be considered as interpreters of binary executables, simulating the internal registers of a given system. Our general approach is to estimate the energy-and time-to-solution by classifying a given instruction set into instruction categories, particularly regarding their non-functional characteristics. This can be done by using microbenchmarks as well as existing data (Berschneider et al., 2014) . The process of categorizing and collecting the information can be regarded as an initial training phase. Once complete, the obtained information from each single category can thus be used to analyze an instruction mix with regard to its mean energy consumption, computation time, or other nonfunctional properties. The ISS is used to provide an instruction mix out of compiled binary executable for a given toolchain. This approach works well for simple embedded architectures with simple pipeline design, no caches, and in-order execution. Evaluations where already performed for different, mostly compute bound, algorithms, and the results where very promising so far (mean relative estimation errors of under 5%). Since the flood simulation algorithm is also compute bound, it seems also promising to choose this approach for our system as well. However, further research has to be done to include effects of more complicated pipeline structures, one or more data/instruction caches and possible out-of-order execution.
