Abstract-The paper presents modeling and simulation of energy consumption of two types of parallel applications: geometric Single Program Multiple Data (SPMD) and divide-and-conquer (DAC). Simulation is performed in a new MERPSYS (Modeling Efficiency, Reliability and Power consumption of multilevel parallel HPC SYStems using CPUs and GPUs) environment. Model of an application uses the Java language with extensions representing message exchange between processes working in parallel. Simulation is performed by running threads representing distinct process codes of an application, with consideration of process counts. Instead of running time consuming calculations, their times are simulated using functions representing computational time dependent on input data sizes. The simulator considers performance and power consumption values for compute devices stored in its database. We performed verification of running the two applications on up to 512 and 1024 processes respectively on a large cluster from Academic Computer Center in Gdansk demonstrating a high degree of accuracy between simulated and measured results.
Abstract-The paper presents modeling and simulation of energy consumption of two types of parallel applications: geometric Single Program Multiple Data (SPMD) and divide-and-conquer (DAC). Simulation is performed in a new MERPSYS (Modeling Efficiency, Reliability and Power consumption of multilevel parallel HPC SYStems using CPUs and GPUs) environment. Model of an application uses the Java language with extensions representing message exchange between processes working in parallel. Simulation is performed by running threads representing distinct process codes of an application, with consideration of process counts. Instead of running time consuming calculations, their times are simulated using functions representing computational time dependent on input data sizes. The simulator considers performance and power consumption values for compute devices stored in its database. We performed verification of running the two applications on up to 512 and 1024 processes respectively on a large cluster from Academic Computer Center in Gdansk demonstrating a high degree of accuracy between simulated and measured results.
I. INTRODUCTION
I N TODAY'S High Performance Computing (HPC) landscape performance and power consumption are key factors, both of which are of key concerns in design of future systems. As of today, Tianhe-2 is the most powerful computing cluster on the TOP500 list with performance of over 33 PFlop/s at 17.8 MWs of power consumption. Tianhe-2 uses the hybrid architecture that couples multicore CPUs and accelerators within a single node. Examples of accelerators used today are GPUs or coprocessors such as Intel Xeon Phi. These are used in the top high performance clusters listed on the TOP500 list. The recently announced Tesla P100 offers 5.3 TFlop/s doubleprecision performance with Thermal Design Power (TDP) 300 Watts 1 . Intel R Xeon Phi TM Coprocessor 7120P offers 1.2 TFlop/s theoretical peak double-precision performance 2 with TDP 300 Watts 3 . As computational power of HPC systems comes from engaging more and more processing cores and consequently increasing the sizes of compute devices and the number of compute devices within a system, there is often a need for assessment of not only performance but also power consumption of such systems. A typical use case is when the user or system owner already know several applications that are run in their contexts or environments and need to assess performance and power consumption of an HPC system after an upgrade or after a new HPC system is to be purchased.
This paper focuses on a model and methodology for assessment of power and energy consumption of parallel applications adopted in the MERPSYS simulation environment 4 . This work follows modeling execution time of parallel applications in MERPSYS that is presented in [1] .
II. RELATED WORK
In terms of applications, energy consumption and its reduction is very important. Proper techniques involving load shifting and machine management may result in energy bill savings [2] . Paper [3] analyzes optimization of energy consumption for large virtualized service centers.
In work [4] , authors present a workflow that allows prediction of energy and power consumption of HPC applications using available data for a given application regarding power and energy consumption for specific values of nodes used. Then, based on a predictor, that uses the available data and proper interpolation, predicted values can be found. The paper shows a high degree of accuracy of the predictor for Hydro (computational fluid dynamics) and EPOCH (plasma physics simulation) benchmarks executed on the SuperMUC (near Munich, hence MUC) HPC platform.
In paper [5] , experiments with Co-Design Molecular Dynamics (CoMD) and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) codes were performed on a system with host Xeon E5 CPUs and Xeon Phi 5110P coprocessors with measurements of energy and power for the whole system, CPUs and Xeon Phis. Results were used to obtain parameters of theoretical model coefficients with high confidence (R 2 coefficient). Results were presented for host frequency scaling as well as problem size scaling.
In paper [6] authors used neural networks to train models that predicted power and energy consumption when running high performance computing codes. It has been shown that after training, using various versions of codes, it is possible to predict power consumption and energy usage of CPUs and DIMMs with less that 5.5% error for LU factorization, Jacobi and matrix multiplication.
In work [7] authors have presented a detailed energy usage model for parallel master-slave applications, including modeling energy consumption of communication operations, based on execution times. Furthermore, the model was verified in a real environment with a master and 4 or 6 slaves for single or dual core configurations with error rate lower than 4% across the tested configurations.
In paper [8] authors investigated execution times and energy used when running MPI-only and hybrid MPI with OpenMP codes such as Parallel Multiblock Lattice Boltzmann or Gyrokinetic Toroidal Code. In particular, on the largest configurations tested with 8 nodes and a total of 32 cores, hybrid versions showed better execution times and energy consumption than MPI-only codes. Energy used was broken into CPU, memory, disk and motherboard energy.
In paper [9] authors, following analysis performed for Amdahl's law, present a general energy speed-up model in a parallel environment, for multicore systems. Furthermore, authors present specific results for three various power consumption models for a multicore CPU, based on the number of cores used: in the first one all cores are always on, the second with consideration of active cores only and the third with base power, active and idle core power values.
Modeling power consumption of cluster nodes depending on the number of threads active with verification against real measurements were presented in [10] . This showed an idle system power consumption and a non-linear increase until a saturation point. Such a model has been incorporated into the MERPSYS simulator. In work [1] , modeling and verification of performance of parallel applications in MERPSYS was presented for computation of vector similarities along with verification in a real parallel environment.
For some types of applications, such as embarrassingly parallel ones, volunteer computing may be an alternative to clusters. Clusters and volunteer systems are different in terms of locality (centralized vs distributed), payer of infrastructure and electrical bill cost, involvement (or lack thereof) of society, security. Comparison of performance and power consumption as well as computational efficiency of cluster based systems and volunteer based systems which use distributed volunteers' computers is presented in [11] . For the latter, sets of volunteer hardware configurations were taken from BOINC projects and http://cpubenchmark.net/ benchmarks for relevant CPUs and TDPs were used. On average performance/power consumption ratio for modern CPU based clusters turned out to be 2-3 times better than for machines in volunteer based systems.
In [12] authors statistically analyze average CPU utilization and draw a conclusion that in the typical operating region of between 10 and 50% of utilization, energy efficiency is low and aim at designing energy proportional machines that would consume energy proportional to the executed work.
Modelling energy consumption of distributed systems can be useful for exploring the time-energy trade-off, defined in [13] . The authors consider the relationships between execution time, energy consumption and power draw for a set of chosen applications, both on shared memory devices such as Intel Xeon Phi coprocessor and Intel Xeon processor, as well as the Vesta IBM Blue Gene/Q cluster. Formal formulation of the multi-objective code optimization problem is presented, as well as evidence that the energy-performance trade-offs exist in practice.
Paper [14] analyzes energy and makespan trade-offs as a Pareto front in heterogeneous computing systems. Pareto fronts for the multi-objective optimization problem can be determined using mathematical modeling and linear programming [15] . However, such model has to closely match the characteristics of the real executions, which can significantly vary depending on the application model (i.e. synchronization scheme, communication overlapping). Additionally, the model may require defining the execution times of each type of task on each type of hardware beforehand. Thus, for more accurate modeling of various application executions on various hardware, it is important to develop a more flexible method which can give an approximate result with a possibility to quickly modify the application and hardware models.
Paper [16] considers tuning of application execution by proper tiling in the code (cache usage) and CPU frequency. It considers impact on the execution time and energy usage using an example of Poisson's equation with stencil computations.
In work [17] a methodology and experiments were presented for a distributed KernelHive [18] system that is aimed at parallelization of computations in a heterogeneous environment that consists of potentially several clusters each with multicore CPUs and accelerators such as GPUs. Based on an imposed power consumption limit, an optimizer is able to select compute devices such that the total power consumption does not exceed the threshold and execution time is minimized taking into consideration application configuration (including OpenCL's kernel NDRange configurations for GPUs and CPUs), network parameters etc. Dependence of execution times against power consumption limits were shown for a real environment.
As demonstrated in paper [4] , a model for prediction of power and energy usage in an HPC system can potentially be very desirable e.g. for budget estimation and prediction of peak requirements in terms of power consumption.
In work [19] authors presented Energy Efficient Task Duplication Schedule (EETDS) algorithm with a grouping and energy efficient group allocation schedule phases of a DAGs (Directed Acyclic Graphs) onto a parallel environment. The algorithm was compared, in terms of energy consumption, to Task Duplication Scheduling algorithm (TDS), NonDuplication Scheduling algorithm (NDS) and Energy-Efficient Non-Duplication Scheduling (EENDS) strategies for Gaussian Elimination and FFT for various values of communication to computation ratio (CCR) demonstrating benefits of EETDS for larger CCR values. 856 PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 In paper [20] optimization of hybrid MPI/OpenMP parallel application execution is considered in terms of execution efficiency. Algorithms used consider Dynamic Concurrency Throttling (DCT) and Dynamic Voltage Frequency Scaling (DVFS), also in a combined setting. It is demonstrated that the proposed approach results in savings in energy usage with little loss in performance or even gains.
III. MOTIVATIONS AND PROBLEM STATEMENT
Motivations for simulation of execution of parallel applications on large systems stated for the MERPSYS environment, involving execution time and energy consumption, include: 1) Finding good configurations for running parallel applications i.e. specific compute devices in the available environment, numbers of nodes as well as application parameters such as data packet sizes etc. 2) Testing various potential (e.g. from a database of available components) hardware configurations for running a set of applications. MERPSYS allows instant substitution of one component by another e.g. exchanging a CPU or a GPU for another CPU or GPU model, similarly for interconnects. 3) Simulations of a set of applications in a distributed multi-level system composed of clusters and volunteer based systems in order to find approximately optimal hardware allocation, task mapping and scheduling.
In view of the aforementioned works and challenges, the goal of this paper is to define and verify a model of energy consumption of a parallel application run on a parallel system that would return acceptably accurate results from fast running simulations of parallel runs. Specifically, this requires finding the following function energy consumption(parallel application, parallel system, input data)
It should be noted that there are two possible ways of how energy consumption is calculated. In one, within the makespan of the application only energy used for duration of computations on particular nodes, only when these are used by the application, is accounted for. In the latter, energy of all nodes is integrated over the makespan irrespective of how many application processes/threads run there, considering idle energy consumption if none processes/threads are active. MERPSYS adopted the second method.
In essence, the function mentioned above can be expressed in terms hardware count, thread count, time of effective application execution (stress time) and time of ineffective processor work (idle time) as follows:
which considers hardware used and power consumption in idle state multiplied by execution time as well as additional power consumption under stress when running a given number of threads on particular hardware multiplied by activity period.
IV. MODELING ENERGY CONSUMPTION
We modeled energy consumption in a supercomputer Galera Plus located in Academic Computer Centre in Gdansk (CI TASK). This supercomputer consists of 192 computational nodes each containing two Intel Xeon Six-core processors. We used two models of parallel applications: a Single-ProgramMultiple-Data application model and a Divide-And-Conquer application model.
Before energy modeling, we modeled the time of a application execution dependency on the number of processors used for calculation. We proved that our timing model is valid using MERPSYS simulation environment (described in the next section). In our simulation, we assumed usage of 1, 8, 27, 64, 125 ... 512 processes for the SPMD application and 1, 2, 4, 8, ... 1024 processes for the DAC application. We achieved results of modeling in a high accordance to the real execution times (see Figure 1 ) [21] .
In the first application, all used computational nodes are almost equally loaded during the whole time of application execution. So energy consumption should be a simple multiplication of execution time and the power used by computational nodes involved in calculations. However in our testbed environment only 32 nodes were assigned to experiments. We assumed in our model that when the modeled number of processes was smaller than 32, each process runs on a separate node, and only a part of computational nodes are used. If the modeled number of processes is equal or greater than 32, the processes are distributed among all the available computational nodes, and all the computational nodes are used. So energy consumption in the SPMD application can be expressed as a sum of energy consumed by active nodes (E an ) and inactive nodes (E in ):
where the energy consumed by active and inactive node are evaluated as: The number of active nodes (N an ) and the number of inactive nodes (N in) are simply:
N proc is the number of processes in application, N total is the total number of computational nodes (here 32), P an is modeled power usage at one active node, P in is modeled power usage at inactive node. We measured that power usage at Intel Xeon processors in inactive node (P in ) equals approximately 50% of maximum power usage (P max ) which is consumed when all the cores are active [10] . We simplified the function of power usage due to number of active cores as a linear broken function (see Figure 2) .
The model of energy consumption in the DAC application is much more complex. Computational nodes are unevenly loaded in consecutive steps of application. At the beginning all needed cores are active. In following steps every second core goes to an idle state (see Figure 3 ). Thus energy consumption must be evaluated in each step separately. It means that not only the number of active/inactive cores (and active/inactive nodes), but also the time of execution in each application step must be evaluated.
In the DAC application active and inactive energy is expressed by the following formulas:
where:
N is the number of steps in the application process, k is the index of step, t(k) is time of execution of the k th step, P an (k)is power used on an active node in the k th step, P in (k)is power used on an inactive node in the k th step, N an (k) is number of active nodes in the k th step, N in (k)is number of inactive nodes in the k th step, E an (k)is energy used on all active nodes in the k th step, E in (k)is energy used on all inactive nodes in the k th step,
We are aware that the above model, where the total energy used for computations depends on the number of computational nodes and their usage, ignores the energy used by the whole infrastructure (e.g. cooling system), but this payload was beyond our research at this time.
V. SIMULATION ENVIRONMENT
We modeled time of execution and energy consumption in the MERPSYS simulation environment. MERPSYS enables modeling of a calculation environment (by drawing an architecture model diagram) and simulation of application execution in this environment (by writing an application model as a simulation program).
The architecture model is a graph diagram in which nodes model key architecture components, and edges model connections between components. As we modeled the Galera Plus supercomputer with homogeneous nodes our diagram consisted of two nodes: one single node modeling all computational components (all processors) and the second node modeling the internal Infiniband network connecting the computational components (see Figure 4) . Next we specified component instances count (i.e. the number of computational nodes in the modeled supercomputer). Afterwards MERPSYS looked up to its component database and assigned timing parameters to components.
The application modeling program is written in the Java language, with the use of a special simulator interface, accessible by the sim object. We can see a sample fragment of simulation program in Figure 5 . The simulation program is not the application itself. To create the simulation program we had to translate the application written in C programming language to the simulation Java language. However, the simulation program is much simpler than the corresponding application program. All computational routines are modeled as sim.Computation method invocations. Interprocess communication is modeled as sim.p2pCommunicationSend/sim.p2pCommunicationReceive or one2oneCommunicationSend/one2oneCommunicationReceive invocations. All the researcher has to do is to determine the data count and the computational routines complexity.
VI. EXPERIMENTS AND RESULTS
As we have mentioned above we modeled and simulated time of execution and energy consumption of two parallel applications (SPMD and DAC) in the Galera Plus supercomputer. The applications were written in C using MPI library. We compared the results of simulation with real application execution measured in this supercomputer.
A. Testbed Environment
The testbed consists of a number of identical computation nodes provided by the Academic Computer Center in Gdansk University of Technology in Poland. Each node is based on two Intel(R) Xeon(TM) CPU 2.27GHz processors (EM64T) with 6 physical processing cores with HyperThreading, 12MB cache, running Linux kernel version 2.6.32. Each node has 16 Gigabytes of RAM and they are composed in the cluster architecture, with fast (QDR, 40Gbps) Infiniband interconnection. The power meters of the cluster are served by the specialized, autonomous management subsystems: HP Integrated LightsOut 3 (iLO 3) 5 .
B. Testbed Applications
The first of the tested application is a geometric SPMD application. This kind of application can be used to solve such problems as weather prediction, heat distribution or other physical phenomena. The evaluated 3D geometric space is divided to many cuboidal regions, each region is evaluated by a separate node. The evaluation process is repeated in many iterations, between each iteration the calculation nodes interchanged data corresponding to regions borders. Data is Fig. 6 . The schematic inter-process data exchange in the SPMD application interchanged in 4 steps for each dimension. Considering X dimension in the first step even nodes send data to their right neigbours, next odd nodes send data to their right neighbours, next odd nodes send data to their left neighbours, and at last event neighbours send data to their left neighbours (see Figure 6 for illustration).
Sample fragments of the SPMD application are shown below. In the beginning some common data variables are defined following by four auxiliary routines (getdata, setdata, compute_cell, and cell_to_rank). The main simulation logic is iterated in four nested for instructions. The most external loop iterates for an arbitrary number of steps, the internal loops iterate in the three dimensions of the geometric space. After a process has computed an associated cuboidal region in the three dimensions, the process sends data to its neighbors using the scheme shown in Figure 6 . / / c o m p u t e s t h e v a l u e o f t h e c e l l } i n t c e l l _ t o _ r a n k ( i n t x , i n t y , i n t z ) { / / r e t u r n s t h e r a n k o f a p r o c e s s / / t h a t owns c e l l ( x , y , z ) } main ( i n t a r g c , c h a r * * a r g v ) {
. . . / / t h e main s i m u l a t i o n l o o p f o r ( t = 0 ; t < s t e p s ; t ++) { f o r ( i =myminx ; i <=mymaxx ; i ++) f o r ( j =myminy ; j <=mymaxy ; j ++) f o r ( k=myminz ; k<=mymaxz ; k ++) { s e t d a t a ( i , j , k , c o m p u t e _ c e l l ( i , j , k ) ) ; } / / e x c h a n g e d a t a i n X d i r e c t i o n i f ( myblockx %2) { / / r e c e i v e from l e f t MPI_Recv ( . . . , YZ_wall , c e l l _ t o _ r a n k ( myminx −1 ,myminy , myminz ) , . . . ) ; } e l s e { / / s e n d t o r i g h t i f ( myblockx +1< p r o c x ) MPI_Send ( . . . , YZ_wall , c e l l _ t o _ r a n k ( mymaxx +1 , myminy , myminz ) , . . . ) ; } . . . / / do t h e same f o r Y d i r e c t i o n / / and Z d i r e c t i o n } / / end o f t h e i t e r a t i o n l o o p M P I _ F i n a l i z e ( ) ; e x i t ( 0 ) ; } The second test application is a Divide-and-Conquer mergesort algorithm implementation. The first node, which gets the large data set, divides the data into two parts and sends one part to its free neighbor node. This process is repeated in parallel until the size of each partition reaches its limit. Then each node sorts its part of data and "odd" nodes return the sorted fragments to their "parent" nodes. The "parent" nodes merge two sorted data fragments and the process repeats until all data flow to the first node when they are merged to one sorted set (see Figure 7) .
The DAC application code is shown below. For simplicity it is assumed that the size of the vector to be sorted is a power of 2. The same applies to the number of processes. i n t * m e r g e s e q ( i n t * a r r a y i n , l o n g l e n g t h ) { / / t h e main f u n c t i o n f o r a s e q u e n t i a l / / merge ( i t e r a t i v e ) f o r ( ; c u r r e n t l e n g t h * 2<= l e n g t h ; c u r r e n t l e n g t h * = 2) { f o r ( i = 0 ; i < l e n g t h ; i +=2 * c u r r e n t l e n g t h ) m e r g e l o c a l ( . . . ) ; . . . } } i n t main ( i n t a r g c , c h a r * * a r g v ) { . . . / / i n t h e f i r s t s t e p e a c h p r o c e s s n e e d s / / t o s o r t i t s p a r t o f t h e a r r a y a r r a y o u t = m e r g e s e q ( a r r a y i n , . . . ) ; / / now s e n d t h e d a t a t o an u p p e r p r o c e s s i f ( myrank %2) { / / t h e n s e n d t h e d a t a / / t o p r o c e s s w i t h r a n k myrank −1 MPI_Send ( . . . ) ; } / / now e a c h p r o c e s s n e e d s t o c h e c k i t s / / r o l e i n t h e d i v i d e −and−c o n q u e r t r e e / / w h e t h e r i t s h o u l d q u i t o r r e c e i v e d a t a / / from a n o t h e r p r o c e s s , merge and s e n d / / t o a n o t h e r p r o c e s s i n t c u r r e n t s k i p = 2 ; / / t h e c u r r e n t s k i p b e t w e e n p r o c e s s / / r a n k s a s i n t h e a b o v e scheme f o r ( ; c u r r e n t s k i p <= p r o c c o u n t ; c u r r e n t s k i p * = 2) { i f ( ! ( myrank%c u r r e n t s k i p ) ) { / / t h e n I am i n v o l v e d i n t h e g i v e n s t e p / / t h i s means t h a t 
C. Simulation Programs
Real calculations are not performed in the simulation program. Instead we invoke sim.computation method passing a string that descibes computational complexity. This string is composed in the simulation program as a JavaScript function that returns the number of operations. However we had to calibrate the result to the real application execution time measured in the testbed environment. It is represented by the last factor (60.94) in the computationComplexity function expression.
S t r i n g c o m p u t a t i o n a l C o m p l e x i t y =" f u n c t i o n " + C o n s t V a r . c o m p l e x i t y F u n c t i o n N a m e + " ( " + C o n s t V a r . p a r a m e t e r s + " ) { " + " r e t u r n " + C o n s t V a r . g e t D a t a S i z e + " * 6 0 . . . .
In the real MPI application each process is identified by a "rank" integer number. In the MERPSYS simulator we can not identify a single process. Instead we can identify a "role" of a process (with a "tag" string). So we mapped application process algorithm based on individual process number to simulation program algorithm based on process group role. We defined 7 roles for the SPMD application using relative processes position: "Center", "Left", "Right", "Top", "Bottom", "Front", and "Back". We replaced the four-step inter-process data exchange with send-receive simulation to the neighbor processes groups (see below):
/ / f i r s t s e n d d a t a t o a l l n e i g h b o r s n e i g h b o r T a g = " c e n t e r " ; i f ( ! t a g . e q u a l s ( n e i g h b o r T a g ) && c e n t e r C o u n t > 0 ) sim . p2pCommunicationSend ( w a l l S i z e , n e i g h b o r T a g ) ;
. . . / / t h e n r e c e i v e d a t a from a l l n e i g h b o r s n e i g h b o r T a g = " c e n t e r " ; i f ( ! t a g . e q u a l s ( n e i g h b o r T a g ) && c e n t e r C o u n t > 0 ) sim . p 2 p C o m m u n i c a t i o n R e c e i v e ( n e i g h b o r T a g ) ;
For the second application, we defined program roles as "levels" (from L0 to L10). We also had to define three computational complexity functions: InitComplexity, MergeSeqComplexity, and MergeLocalComplexity. Instead a peerto-peer communication send function we used one-to-one communication send. In peer to peer communication, it is assumed that all pairs of processes communicate. In the DAC application one process sends data to one other process. At the same time a half of all the processes in the lower level send data to their corresponding processes in the upper level, so the time of one pair communication must be multiplied by sendersCount/2. f o r ( l e v e l = 0 ; l e v e l < l e v e l C o u n t ; l e v e l ++) { t h i s T a g = "L"+ I n t e g e r . t o S t r i n g ( l e v e l ) ; i f ( t a g . e q u a l s ( t h i s T a g ) ) { n e x t T a g = "L" + I n t e g e r . t o S t r i n g ( l e v e l + 1 ) ; 
D. Power Measurement
We measured power usage at a single node of a real execution cluster. Power usage was measured in Watts every 10 seconds of the application execution. We repeated the execution three times for each assumed processes count and then we averaged the measured values. The results are shown in Table I and Table II. 
E. Simulation Results and Comparison
Having accurate time simulaton results (see Figure 1 ), we could base energy simulation on solid foundations. We evaluated energy consumption in the MERPSYS simulator and compared the results to the measured energy consumption. However as we measured energy consumption at a one node only, we had to recalculate the measured results according to the model of application. In the SPMD applications, as all nodes assigned to the application are active all the time, it was easy to calculate the energy consumption in the whole experimental environment. We show the compared results in Table III . However in the DAC application nodes are unevenly active. We applied the model of activity shown in Figure 3 and recalculated active and inactive nodes real energy consumption in all the applications steps separately, and next summarized them. As the results depend not only on the really Table IV .
VII. CONCLUSIONS AND FUTURE WORK
In the paper we presented a way to model parallel SPMD and divide-and-conquer applications within the MERPSYS environment including application and system models. Next, we presented verification of results obtained from the fast MERP-SYS simulator against energy consumption that stemmed from consideration of power usage of real cluster nodes. We performed tests and calculations for up 512 and 1024 processes for SPMD and divide-and-conquer applications respectively reaching a high degree of accuracy. This allows to obtain results for these applications for other configurations such as input data sizes with ease without the need for rerunning the real application and much faster than the latter.
The future works should cover wider variety of the software and hardware systems. Different vendors and configurations should be tested as well as new simulated programs.
