Abstract. Driven by the emerging requirements of High Performance Computing (HPC) architectures, the main focus of this work is the interplay of computational and energetic aspects of a Four Dimensional Variational (4DVAR) Data Assimilation algorithm, based on Domain Decomposition (named DD-4DVAR). We report first results on the energy consumption of the DD-4DVAR algorithm on embedded processor and a mathematical analysis of the energy behaviour of the algorithm by assuming the architectures characteristics as variable of the model. The main objective is to capture the essential operations of the algorithm exhibiting a direct relationship with the measured energy. The experimental evaluation is carried out on a set of mini-clusters made available by the Barcelona Supercomputing Center.
Introduction and Motivations
Data assimilation (DA) is an uncertainty quantification technique by which measurements and model predictions are combined to obtain an accurate representation of the state of the modeled system [5, 7] . Due to the scale of the forecasting area and the number of state variables used to describe ocean or atmosphere for climate or weather predictions, DA applications are large scale problems that should be solved in near real-time. This mandates to design and develop DA algorithms to be run by exploiting High Performance Computing (HPC) environments. During the last 20 years, parallel algorithms for DA have been investigated by a number of federal research institutes and universities. Up to now, the main efforts towards the development of parallel 4DVAR DA systems were achieved in numerical weather prediction applications, namely by the ECMWF (European Centre for Medium-Range Weather Forecasts), in Reading (UK) and by the NCAR (National Center for Atmospheric Research), in Colorado (USA). In this paper, we employ a 4DVAR algorithm described in [1, 6] , named DD-4DVAR, based on a Domain Decomposition approach. In [3, [9] [10] [11] are described some different approaches to take full advantage of emerging HPC architectures. In the model we employ, the parallelism is achieved by dividing the global problem into multiple local 4DVAR DA sub-problems solved across processors. The global solution is obtained by collecting the local minimums. The sub-problems are handled by a slightly modified 4DVAR algorithm, custom implemented on an ARM-based low-energy node with the aim of minimizing the overall energyto-solution experienced by the application. The performance and energy cost of a parallel algorithm executing on HPC systems have different trade-offs, depending on how many processors the algorithm uses, at what characteristics these processors have, and the structure of the algorithm. Due to the interest of the HPC community towards low-power architectures such as the ones used in smartphone and tablets [12] , we report in this paper the first results on the energy consumption of the DD-4DVAR algorithm on embedded processor. Note that our approach addresses the problem in the spirit of scalability analysis of parallel algorithms as distinct from practical performance analysis on specific architecture. We provide a mathematical analysis of the energy behaviour of the DD-4DVAR algorithm as function of the architectures characteristics of the platforms where are executed. The main objective is to capture the essential operations in the algorithm exhibiting a direct relationship with the measured energy. Such analysis will enable predicting the energy requirements of the DD-4DVAR code, provided that a set of architecture-dependent parameters are available, as well as understanding its energy breakdown, which may in turn underpin a systematic approach to combined performance/energy optimization. The experimental evaluation is carried out on a set of AMR based platforms made available by the Barcelona Supercomputing Center in the context of the Mont-Blanc European project [13] . The evaluation, aimed at understanding the energy breakdown and the related scalability issues, pointing out the importance of the underplay between parallel performance and energy optimization.
The DD-4DVAR Computational Kernel
Hereafter we provide a concise formalization of the DD-4DVAR model we implemented in Algoritm 1 [1] . Let t k , k = 0, 1, . . . , n be a sequence of observation times and, for each k, let be
the vector denoting the state of a sea system such that
N → N forecasting model. At each time step t k , let be
the observations vector where H k :
N → p is a non-linear interpolation operator collecting the observations at time t k .
The aim of DA problem is to find an optimal tradeoff between the current estimate of the system state (background) defined in (1) and the available observations y k defined in (2) . Let (3) be an overlapping decomposition of the physical domain Ω such that Ω i ∩ Ω j = Ω ij = 0 if Ω i and Ω j are adjacent and Ω ij is called overlapping region [1] .
For a fixed time t k = t 0 , according to this decomposition, the DD-4DVAR computational model is a system of N sub non-linear least square problems described in (4)- (5) where J i in (5) is called cost-function.
in (4) is the analysis (i.e. the estimation of the vector x DA 0i at time t 0 ). The variables x 0i and y ki are the same vectors x 0 and y k in (1) and (2) defined on the subdomain Ω i , R i and B i are the covariance matrices whose elements provide the estimate of the errors on y ki and on x 0i , respectively. Let d = [y k − H(x k )] be the misfit, by using the linearization of H such that H(x) = H(x + δx) + H δx, where H is the matrix obtained by the first order approximation of the Jacobian of H and, by setting
, the cost function in (5) is written as:
The minimum of the cost function J i in (6) is computed by the L-BFGS method [14] which implements a quasi Newton method. Then we need to compute ∇J i (v i ) such that:
where G T ki is the adjoint operator of G ki .
Algorithm 1 The DD-4DVAR algorithm on each subdomain
% compute the misfit 4: Define R k i starting from the observed data y k i 5: Define Vi starting from a temporal sequence of hystorical data {x 
Energy analysis of the algorithm
In this section we set a DD-4DVAR algorithm configuration and we perform a mathematical analysis of the energy behaviour of the algorithm. For the DD-4DVAR algorithm configuration we assume:
-N defined in (1), which is the dimension of the problem, such that N = n x × n y × n z = n × n × 3 as this does not affect the generality, where n ∈ N , n > 1; -a 2D decomposition along the x-axes and the y-axes such that each subdomain has dimension:
where p ∈ N , p > 1. Then, N sub the number of subdomain in (3) (which constitutes the domain decomposition) is
-the algorithm be implemented on a parallel architecture by employing nproc processors such that nproc = N sub , i.e. from (9), we are assuming
Concerning the energy model, we assume that [8] :
-the energy consumption is additive and it is essentially proportional to the respective activity intensity in each component of the computing architecture, in terms of compute operation count, exchanged messages, memory accesses, plus a static energy contribution which is not affected by the activity and only depends on the considered time interval.
Based on the above assumption, we can write the energy breakdown as:
where the superscript HC denotes the dependency on the computing architecture, and -E comp (p, n) is the energy for computation:
where E d is a hardware constant [4] , µ comp (p, n) is the number of computations and f is the frequency; -E mem (p, n) is the energy for memory accesses:
where E m is the energy consumed for a single memory access (both read and write) and and µ mem (p, n) is the number of memory accesses; -E mes (p, n) is the energy for message transfers:
where E t is the energy consumed for a single message transfer between the processors and µ mes (p, n) is the number of message transfers at all processors; -E static (p, n) is the static energy:
where E l is a hardware constant [4] and T active (p, n) is the execution time for performing the whole algorithm. and by analyzing the time complexity of Algorithm 1, we can estimate the order of magnitude of the energy consumption by the following result.
Theorem 1. By assuming (10), (11)- (14) and (15), it holds:
where E HC (p, n) denotes the energy consumption defined in (10) and where C HC (p):
with t f lop denotes the unitary time required for the execution in each processor of one floating point operation.
Proof: Let S i (p, n) and V i (p, n) denote the number of floating point exchanges at each algorithm iteration and the floating point computations at each iteration respectively, proportional to surface area and the volume of each subdomain in Algorithm 1:
then µ comp (p, n), µ mem (n, p) and µ mes (p, n) are such that:
Also we assume T active (p, n) be the execution time for performing V 2 i (p, n) floating point operations:
Then, from (10), (18)- (19) and (20)- (22), it holds
As we run in a single computational node (i.e. p < p max as expressed in (15)) this means that we are not implying communications, so the third term can be neglected. From qualitative observations, we can assume that the second term can be neglected because we fit the whole data in cache (as expressed in (15)), therefore a negligible number of access to the main memory are performed. Then the (16) follows.
Definition 1 (Energy Variation parameter) We denote with Energy Variation parameter the ratio
The following result holds:
Proposition 1 For a fixed architecture and, under the hypothesis of Theorem 1, it is
for p 2 ≥ p 1 .
Proof: From (24) and (16) for a fixed value of n, it is
We observe that, from (27), it is
which gives:
From (28) and (17) it is
As for a fixed architecture, the values of E d , E l and t f lop are also fixed, it is
Due the better conditioning of the smaller problems, it is N L−BF GS,p1 > N L−BF GS,p2 [2] . Then the (26) holds.
Remark 1
We observe that, if the (15) is not satisfied, then C HC (p) includes also E mes which increases as the number of processors increases. In that case, for p 2 > p 1 , it is:
which gives
Experimental results
The proposed approach is validated on a case study based on the linear Shallow Water Equation (SWE) for n = 64, i.e. we consider a fixed size configuration of the DD-4DVAR algorithm and we discuss results obtained by varying p.
The experiments are been conducted on architectures available at the Barcelona Supercomputing Center (BSC) and the power measurements have been enabled by the Mont-Blanc computing environment [13] .
In Table 2 .
3 Due the time complexity of the computation, for each Megabyte, the values on nC which is independent from the computing architecture, is such that: nC,1 = 1048576 8 * 3 1 6 = 6, where · denotes the integer part. The JetsonTx1 and Mont-Blanc, with 2 Megabyte and 1 Megabyte of cache instead (see Table 1 ) do not satisfy (15). In fact, n JT C = 2 · n C,1 = 12 and n M B C = 1 · n C,1 = 6 for the JT and MB respectively, both smaller than n = 64. In these cases, the upper bound in (30) holds as confirmed by the results in Table 3 and Table 4 .
We introduced an energy analysis of the DD-4DVAR algorithm for data assimilation problems. An implementation of the algorithm was evaluated on some prototype ARM-based platforms made available by the Barcelona Supercomputing Center. We performed the analysis of the energy behaviour of the algorithm depending on several architectures characteristics. A preliminary experimental evaluation confirmed the estimations provided by our analysis on a fixed size problem varying the number of processors. As a future development, we aim at scaling up the methodology by demonstrating energy-driven parallelization approaches on production-grade ARM-based HPC clusters.
