New multi-core CPU and GPU architectures promise high computational power at a low cost if suitable computational algorithms can be developed. However, parallel programming for such architectures is usually non-portable, low-level and error-prone. To make the computational power of new multi-core architectures more easily available to Modelica modelers, we have developed the ParModelica algorithmic language extension to the high-level Modelica modeling language, together with a prototype implementation in the OpenModelica framework. This enables the Modelica modeler to express parallel algorithms directly at the Modelica language level. The generated code is portable between several multi-core architectures since it is based on the OpenCL programming model. The implementation has been evaluated on a benchmark suite containing models with matrix multiplication, Eigen value computation, and stationary heat conduction. Good speedups were obtained for large problem sizes on both multi-core CPUs and GPUs. To our knowledge, this is the first high-performing portable explicit parallel programming extension to Modelica.
Introduction
Models of large industrial systems are becoming increasingly complex, causing long computation time for simulation. This makes is attractive to investigate methods to use modern multi-core architectures to speedup computations.
Efficient parallel execution of Modelica models has been a research goal of our group for a long time [4] , [5] , [6] , [7] , involving improvements both in the compilation process and in the run-time system for parallel execution. Our previous work on compilation of dataparallel models, [7] and [8] , has primarily addressed compilation of purely equation-based Modelica models for simulation on NVIDIA Graphic Processing Units (GPUs). Several parallel architectures have been targeted, such as standard Intel multi-core CPUs, IBM Cell B.E, and NVIDIA GPUs. All the implementation work has been done in the OpenModelica compiler framework [2] , which is an open-source implementation of a Modelica compiler, simulator, and development environment. Related research on parallel numeric solvers can for example be found in [9] .
The work presented in this paper presents an algorithmic Modelica language extension called ParModelica for efficient portable explicit parallel Modelica programming. Portability is achieved based on the OpenCL [14] standard which is available on several multi-core architectures. ParModelica is evaluated using a benchmark test suite called Modelica PARallel benchmark suite (MPAR) which makes use of these language extensions and includes models which represent heavy computations. This paper is organized as follows. Section 2 gives a general introduction to Modelica simulation on parallel architectures. Section 3 gives an overview of GPUs, CUDA and OpenCL, whereas the new parallel Modelica language extensions are presented in Section 4. Section 5 briefly describes measurements using the parallel benchmark test suite. Finally, Section 6 gives programming guidelines to use ParModelica, and Section 7 presents conclusions and future work.
GPU Architectures, CUDA, and OpenCL
Graphics Processing Units (GPUs) have recently become increasingly programmable and applicable to general purpose numeric computing. The theoretical processing power of GPUs has in recent years far surpassed that of CPUs due to the highly parallel computing approach of GPUs. However, to get good performance, GPU architectures should be used for simulation of models of a regular structure with large numbers of similar data objects. The computations related to each data object can then be executed in parallel, one or more data objects on each core, so-called data-parallel computing. It is also very important to use the GPU memory hierarchy effectively in order to get good performance.
In Section 3.1 the NVIDIA GPU with its CUDA programming model is presented as an influential example of GPU architecture, followed by the portable OpenCL parallel programming model in Section 3.2.
NVIDIA GPU CUDA -Compute Unified
Device Architecture
An important concept in NVIDIA CUDA (Computer Unified Device Architecture) for GPU programming is the distinction between host and device. The host is what executes normal programs, and the device works as a coprocessor to the host which runs CUDA threads by instruction from the host. This typically means that a CPU is the host and a GPU is the device, but it is also possible to debug CUDA programs by using the CPU as both host and device. The host and the device are assumed to have their own separate address spaces, the host memory and the device memory. The host can use the CUDA runtime API to control the device, for example to allocate memory on the device and to transfer memory to and from the device. The building block of the NVIDIA CUDA hardware architecture is the Streaming Multiprocessor (SM). In the NVIDIA Fermi-Tesla M2050 GPU, each SM contains 32 Scalar Processors (SPs). The entire GPU has 14 such SMs totaling to 448 SPs, as well as some offchip DRAM memory, see Figure 1 . This gives a scalable architecture where the performance of the GPU can be varied by having more or fewer SMs.
To be able to take advantage of this architecture a program meant to run on the GPU, known as a kernel, needs to be massively multi-threaded. A kernel is just a C-function meant to execute on the GPU. When a kernel is executed on the GPU it is divided into thread blocks, where each thread block contains an equal number of threads. These thread blocks are automatically distributed among the SMs, so a programmer need not consider the number of SMs a certain GPU has. All threads execute one common instruction at a time. If any threads take divergent execution paths, then each of these paths will be executed separately, and the threads will then converge again when all paths have been executed. This means that some SPs will be idle if the thread executions diverge. It is thus important that all threads agree on an execution path for optimal performance.
This architecture is similar to the Single Instruction, Multiple Data (SIMD) architecture that vector processors use, and that most modern general-purpose CPUs have limited capabilities for too. NVIDIA call this architecture Single Instruction, Multiple Thread (SIMT) instead, the difference being that each thread can execute independently, although at the cost of reduced performance. It is also possible to regard each SM as a separate processor, which enables Multiple Instructions, Multiple Data (MIMD) parallelism. Using only MIMD parallelism will not make it possible to take full advantage of a GPU's power, since each SM is a SIMD processor. To summarize:
 Streaming Multiprocessors (SM) can work with different code, performing different operations with entirely different data (MIMD execution, Multiple Instruction Multiple Data).  All Scalar processors (SP) in one streaming multiprocessor execute the same instruction at the same time but work on different data (SIMT/SIMD execution, Single Instruction Multiple Data).
NVIDIA GPU Memory Hierarchy
As can be seen in Figure 1 there are several different types of memory in the CUDA hardware architecture. At the lowest level each SP has a set of registers, the number depending on the GPU's capabilities. These registers are shared between all threads allocated to a SM, so the number of thread blocks that a SM can have active at the same time is limited by the register usage of each thread. Accessing a register typically requires no extra clock cycles per instruction, except for some special cases where delays may occur. Besides the registers there is also the shared (local) memory, which is shared by all SPs in a SM. The shared memory is implemented as fast on-chip memory, and accessing the shared memory is generally as fast as accessing a register. Since the shared memory is accessible to all threads in a block it allows the threads to cooperate efficiently by giving them fast access to the same data.
Most of the GPU memory is off-chip Dynamic Random Access Memory (DRAM). The amount of offchip memory on modern graphics cards range from several hundred megabytes to few gigabytes. The DRAM memory is much slower than the on-chip memories, and is also the only memory that is accessible to the host CPU, e.g. through DMA transfers. To summarize:
 Each scalar processor (SP) has a set of fast registers.
(private memory)  Each streaming multiprocessor (SM) has a small local shared memory (48KB on Tesla M2050 ) with relatively fast access.  Each GPU device has a slower off-chip DRAM (2GB on Tesla M2050) which is accessible from all streaming multiprocessors and externally e.g. from the CPU with DMA transfers.
OpenCL -the Open Computing Language
OpenCL [14] is the first open, free parallel computing standard for cross-platform parallel programming of modern processors including GPUs. The OpenCL programming language is based on C99 with some extensions for parallel execution management. By using OpenCL it is possible to write parallel algorithms that can be easily ported between multiple devices with minimal or no changes to the source code. The OpenCL framework consists of the OpenCL programming language, API, libraries, and a runtime system to support software development. The framework can be divided into a hierarchy of models: Platform Model, Memory model, Execution model, and Programming model. The memory regions can be accessed in the following way:
Memory Regions Access to Memory Constant Memory
All work-items in all work-groups Local Memory
All work-items in a work-group Private Memory
Private to a work-item Global Memory
All work-items in all work-groups
OpenCL Execution Model
The execution of an OpenCL program consists of two parts, the host program which executes on the host and the parallel OpenCL program, i.e., a collection of kernels (also called kernel functions), which execute on the OpenCL device. The host program manages the execution of the OpenCL program. Kernels are executed simultaneously by all threads specified for the kernel execution. The number and mapping of threads to Computing Units of the OpenCL device is handled by the host program.
Each thread executing an instance of a kernel is called a work-item. Each thread or work item has unique id to help identify it. Work items can have additional id fields depending on the arrangement specified by the host program.
Work-items can be arranged into work-groups. Each work-group has a unique ID. Work-items are assigned a unique local ID within a work-group so that a single work-item can be uniquely identified by its global ID or by a combination of its local ID and work-group ID. The work-items in a given work-group execute concurrently on the processing elements of a single compute unit as depicted in Figure 4 .
Several programming models can be mapped onto this execution model. OpenCL explicitly supports two of these models: primarily the data parallel programming model, but also the task parallel programming model
ParModelica: Extending Modelica for Explicit Algorithmic Parallel Programming
As mentioned in the introduction, the focus of the current work is an extension (ParModelica) of the algorithmic subset of Modelica for efficient explicit parallel programming on highly data-parallel SPMD (Single Program Multiple Data) architectures. The current ParModelica implementation generates OpenCL [14] code for parallel algorithms. OpenCL was selected instead of CUDA [15] because of its portability between several multi-core platforms. Generating OpenCL code ensures that simulations can be run with parallel support on OpenCL enabled Graphics and Central Processor Units (GPUs and CPUs). This includes many multicore CPUs from [19] and Advanced Micro Devices (AMD) [18] as well as a range of GPUs from NVIDIA [17] and AMD [18] . As mentioned earlier most previous work regarding parallel execution support in the OpenModelica compiler has been focused on automatic parallelization where the burden of finding and analyzing parallelism has been put on the compiler. In this work, however, we have decided to leave this responsibility to the end user programmer. The compiler provides additional high level language constructs needed for explicitly stating parallelism in the algorithmic part of the modeling language. These, among others, include parallel variables, parallel functions, kernel functions and paral-lel for loops indicated by the parfor keyword. There are also some target language specific constructs and functions (in this case related to OpenCL).
Parallel Variables
OpenCL code can be executed on a host CPU as well as on GPUs whereas CUDA code executes only on GPUs. Since the OpenCL and CUDA enabled GPUs use their own local (different from CPU) memory for execution, all necessary data should be copied to the specific device's memory. Parallel variables are allocated on the specific device memory instead of the host CPU. An example is shown below: The first two matrices A and B are allocated in normal host memory. The next two matrices pA and pB are allocated on the global memory space of the OpenCL device to be used for execution. These global variables can be initialized from normal or host variables. The last array pS is allocated in the local memory space of each processor on the OpenCL device. These variables are shared between threads in a single work-group and cannot be initialized from hast variables.
Copying of data between the host memory and the device memory used for parallel execution is as simple as assigning the variables to each other. The compiler and the runtime system handle the details of the operation. The assignments below are all valid in the function given above  Normal assignment -A := B  Copy from host memory to parallel execution device memory -pA := A  Copy from parallel execution device memory to host memory -B := pB  Copy from device memory to other device memory -pA := pB Modelica parallel arrays are passed to functions only by reference. This is done to reduce the rather expensive copy operations.
Parallel Functions
ParModelica parallel functions correspond to OpenCL functions defined in kernel files or to CUDA device functions. These are functions available for distributed (parallel) independent execution in each thread executing on the parallel device. For example, if a parallel array has been distributed with one element in each thread, a parallel function may operate locally in parallel on each element. However, unlike kernel functions, parallel functions cannot be called from serial code in normal Modelica functions on the host computer just as parallel OpenCL functions are not allowed to be called from serial C code on the host. Parallel functions have the following constraints, primarily since they are assumed to be called within a parallel context in workitems:
 Parallel function bodies may not contain parforloops. The reason is that the kernel containing the parallel functions is already distributed on each thread.  Explicitly declared parallel variables are not allowed since execution is already taking place on the parallel device.  All memory allocation will be on the parallel device's memory.  Nested parallelism as in NestStepModelica [10] is not supported by this implementation.  Called functions must be parallel functions or supported built-in functions since execution is on the parallel device. 
Kernel Functions
ParModelica kernel functions correspond to OpenCL kernel functions [14] or CUDA global functions [16] . They are simply functions compiled to execute on an OpenCL parallel device, typically a GPU. ParModelica kernel functions are allowed to have several return-or output variables unlike their OpenCL or CUDA counterparts. They can also allocate memory in the global address space. Kernel functions can be called from serial host code, and are executed by each thread in the 
Parallel For Loop: parfor
The iterations of a ParModelica parfor-loop are executed without any specific order in parallel and independently by multiple threads. The iterations of a parfor-loop are equally distributed among available processing units. If the range of the iteration is smaller than or equal to the number of threads the parallel device supports, each iteration will be done by a separate thread. If the number of iterations is larger than the number of threads available, some threads might perform more than one iteration. In future enhancements parfor will be given the extra feature of specifying the desired number of threads explicitly instead of automatically launching threads as described above. An example of using the parfor-loop is shown below: 
Executing User-written OpenCL Code from ParModelica.
There are also some additional ParModelica features available for directly compiling and executing userwritten OpenCL code:
 oclbuild(String) takes a name of an OpenCL source file and builds it. It returns an OpenCL program object which can be used later.
 oclkernel(oclprogram, String) takes a previously built OpenCL program and create the kernel specified by the second argument. It returns an OpenCL kernel object which can be used later.
 oclsetargs(oclkernel,...) takes a previously created kernel object variable and a variable number of arguments and sets each argument to its corresponding one in the kernel definition.
 oclexecute(oclkernel) executes the specified kernel.
All of the above operations are synchronous in the OpenCL jargon. They will return only when the specified operation is completed. Further functionality is planned to be added to these functions to provide better control over execution.
Synchronization and Thread Management
All OpenCL work-item functions [20] are available in ParModelica. They perform the same operations and have the "same" types and number of arguments. However, there are two main differences:
 Thread/work-item index ids start from 1 in ParModelica, whereas the OpenCL C implementation counts from 0.  Array dimensions start from 1 in Modelica and from 0 in OpenCL and C.
For example oclGetGlobalId(1) call in the above arrayElemWiseMultiply will return the integer ID of a work-item or thread in the first dimension of a work group. The first thread gets an ID of 1. The OpenCL C call for the same operation would be ocl_get_global_id(0) with the first thread obtaining an ID of 0.
In addition to the above features, special built-in functions for building user written OpenCL code directly from source code, creating a kernel, setting arguments to kernel and execution of kernels are also available. In addition parallel versions of some built-in algorithm functions are also available.
Benchmarking and Evaluation
To be able to evaluate the relative performance and behavior of the new language extensions described in Section 4, performing systematic benchmarking on a set of appropriate Modelica models is required. For this purpose we have constructed a benchmark test suite containing some models that represent heavy and highperformance computation, relevant for simulation on parallel architectures.
The MPAR Benchmark Suite
The MPAR benchmark test suite contains seven different algorithms from well-known benchmark applications such as the LINear equations software PACKage (LINPACK) [21] , and Heat Conduction [23] . These benchmarks have been collected and implemented as algorithmic time-independent Modelica models.
The algorithms implemented in this suite involve rather large computations and impose well defined workloads on the OpenModelica compiler and the run-time system. Moreover, they include different kinds of forloops and function calls which provide parallelism for domain and task decomposition. For space reasons we have provided results for only three models here.
Time measurements have been performed of both sequential and parallel implementations of three models: Matrix Multiplication, Eigen value computation, and Stationary Heat Conduction, on both CPU and GPU architectures. For executing sequential codes generated by the standard sequential OpenModelica compiler we have used the Intel Xeon E5520 CPU [24] which has 16 cores, each with 2.27 GHz clock frequency. For executing generated code by our new OpenCL based parallel code generator, we have used the same CPU as well as the NVIDIA Fermi-Tesla M2050 GPU [25] .
Measurements
In this section we present the result of measurements for simulating three models from the implemented benchmark suite. On each hardware configuration all simulations are performed five times with start time 0.0, stop time of 0.2 seconds and 0.2 seconds time step, measuring the average simulation time using the clock_gettime() function from the C standard library. This function is called once when the simulation loop starts and once when the simulation loop finishes. The difference between the returned values gives the simulation time.
All benchmarks have been simulated on both the Intel Xeon E5520 CPU (16 cores) and the NVIDIA Fermi-Tesla M2050 GPU (448 cores).
Simulation Results
The Matrix Multiplication model (Appendix A) produces an M×K matrix C from multiplying an M×N matrix A by an N×K matrix B. This model presents a very large level of data-parallelism for which a considerable speedup has been achieved as a result of parallel simulation of this model on parallel platforms. The simulation results are illustrated in Figure 5 and Figure 6 . The obtained speedup of matrix multiplication using kernel functions is as follows compared to the sequential algorithm on Intel Xeon E5520 CPU:
 Intel 16-core CPU -speedup 26  NVIDIA 448-core GPU -speedup 115 The measured matrix multiplication model simulation times can be found in Figure 6 . The measured simulation times for the Eigen-value model are shown in Figure 8 . The third benchmark model computes stationary heat conduction, with the following speedups:
 Intel 16-core CPU -speedup 7  NVIDIA 448-core GPU -speedup 22 The measured simulation times for the stationary heat conduction model are shown in Figure 10 . According to the results of our measurements illustrated in Figure 5 , Figure 7 , and Figure 9 , absolute speedups of 114, 48, and 22 respectively were achieved when running generated ParModelica OpenCL code on the Fermi-Tesla M2050 GPU compared to serial code on the Intel Xeon E5520 CPU with the largest data sizes. It should be noted that when the problem size is not very large the sequential execution has better performance than the parallel execution. This is not surprising since for executing even a simple code on OpenCL devices it is required to create an OpenCL context within those devices, allocate OpenCL memory objects, transfer input data from host to those memory objects, perform computations, and finally transfer back the result to the host. Consequently, performing all these operations normally takes more time compared to the sequential execution when the problem size is small.
It can also be seen that, as the sizes of the models increase, the simulations get better relative performance on the GPU compared to multi-core CPU. Thus, to fully utilize the power of parallelism using GPUs it is required to have large regular data structures which can be operated on simultaneously by being decomposed to all blocks and threads available on GPU. Otherwise, executing parallel codes on a multi-core CPU would be a better choice than a GPU to achieve more efficiency and speedup.
Guidelines for Using the New Parallel Language Constructs
The most important task in all approaches regarding parallel code generation is to provide an appropriate way for analyzing and finding parallelism in sequential codes. In automatic parallelization approaches, the whole burden of this task is on the compiler and tool developer. However, in explicit parallelization approaches as in this paper, it is the responsibility of the modeler to analyze the source code and define which parts of the code are more appropriate to be explicitly parallelized. This requires a good understanding of the concepts of parallelism to avoid inefficient and incorrect generated code. In addition, it is necessary to know the constraints and limitations involved with using explicit parallel language constructs to avoid compile time errors. Therefore we give some advice on how to use the ParModelica language extensions to parallelize Modelica models efficiently:
 Try to declare parallel variables as well as copy assignments among normal and parallel variables as less as possible since the costs of data transfers from host to devices and vice versa are very expensive.  In order to minimize the number of parallel variables as well as data transfers between host and devices, it is better not to convert forloops with few iterations over simple operations to parallel for-loops (parfor-loops).  It is not always useful to have parallel variables and parfor-loops in the body of a normal for-loop which has many iterations. Especially in cases where there are many copy assignments among normal and parallel variables.  Although it is possible to declare parallel variables and also parfor-loops in a function, there are no advantages when there are many calls to the function (especially in the body of a big for-loop). This will increase the number of memory allocations for parallel variables as well as the number of expensive copies required to transfer data between host and devices.  Do not directly convert a for-loop to a parfor-loop when the result of each iteration depends on other iterations. In this case, although the compiler will correctly generate parallel code for the loop, the result of the computation may be incorrect.  Use a parfor-loop in situations where the loop has many independent iterations and each iteration takes a long time to be completed.  Try to parallelize models using kernel functions as much as possible rather than using parfor-loops. This will enable you to explicitly specify the desired number of threads and work-groups to get the best performance.  If the global work size (total number of threads to be run in parallel) and the local work size (total number of threads in each work-group) need to be specified explicitly, then the following points should be considered. First, the work-group size (local size) should not be zero, and also it should not exceed the maximum work-group size supported by the parallel device. Second, the local size should be less or equal than the global-size. Third, the global size should be evenly divisible by the local size.  The current implementation of OpenCL does not support recursive functions; therefore it is not possible to declare a recursive function as a parallel function.
Conclusions
New multi-core CPU and GPU architectures promise high computational power at a low cost if suitable computational algorithms can be developed. The OpenCL C-based parallel programming model provides a way of writing portable parallel algorithms that perform well on a number of multi-core architectures. However, the OpenCL programming model is rather low-level and error-prone to use and intended for parallel programming specialists. This paper presents the ParModelica algorithmic language extension to the high-level Modelica modeling language together with a prototype implementation in the OpenModelica compiler. This makes it possible for the Modelica modeler to directly write efficient parallel algorithms in Modelica which are automatically compiled to efficient low-level OpenCL code. A benchmark suite called MPAR has been developed to evaluate the prototype. Good speedups have been obtained for large problem sizes of matrix multiplication, Eigen value computation, and stationary heat condition.
Future work includes integration of the ParModelica explicit parallel programming approach with automatic and semi-automatic approaches for compilation of equation-based Modelica models to parallel code. Autotuning could be applied to further increase the performance and automatically adapt it to varying problem configurations. Some of the ParModelica code needed to specify kernel functions could be automatically generated.
