Abstract-Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides librarybased approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more manycore compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different manycore devices -potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism; i.e., without modification of the kernel.
I. INTRODUCTION
Manycore compute devices provide a significant potential gain in computational performance with respect to both runtime and energy consumption. Porting large, complex scientific and engineering high performance computing (HPC) applications to these devices is challenging in that it introduces an additional layer (or two) of parallelism, it can be difficult to obtain good performance on these devices, and there is a diversity of device-specific programming models. Many projects have successfully addressed these challenges by writing distinct versions of their codes that are specialized for particular compute devices (e.g., CUDA TM [1] ). However, this approach incurs the cost of developing, verifying, and maintaining a specialized code base for each class of compute device. For large, complex scientific and engineering applications this may be an unacceptable cost.
The Trilinos-Kokkos array programming model provides library-based approach to implement computational kernels that are performance-portable to manycore and GPGPU accelerator devices. This approach uses C++ template metaprogramming [2] , as opposed to a new or extended language, for compile-time specialization of kernels to the target manycore device. The data parallel programming model and API focusing on two responsibilities: (1) managing multidimensional array data on the manycore device and (2) executing parallel for and parallel reduce operations on that data.
This programming model is similar to the established Thrust library [3] in that it uses C++ template metaprogramming for manycore device portability, it provides parallel for and parallel reduce operations, and it manage data on the manycore device. The Trilinos-Kokkos array programming model is unique in that it (1) provides multidimensional array data structures and indexing semantics and (2) parameterizes data structures so that computational kernels may be executed on multiple devices within the same application. Aggregation of data via multidimensional arrays is "natural" for engineering and scientific computations; as is evident by decades of FORTRAN use in this computational domain. In contrast, when using the Thrust library computations must use "zip_iterator" objects to bind individually declared one dimensional vectors into an aggregated data structure.
Performance-portability includes source code portability of a kernel implementation and performance that is commensurate with a device-specific implementation of that kernel. Memory access is the dominant constraint on performance; and memory access patterns dominate memory access performance on NVIDIA R devices. As such the Kokkos multidimensional array programming model uses compile-time polymorphism (i.e., C++ template metaprogramming) to insert device-optimal memory access patterns into computational kernels without modification of the kernel's source code.
II. PROGRAMMING MODEL
The Kokkos multidimensional array programming model is data parallel in that a computational kernel can be applied in parallel to members of a partitioned data set. This programming model consists of the following conceptual components.
1) compute device with many computational threads and shared memory, 2) multidimensional array, 3) partitioning and mapping of multidimensional arrays onto the memory space of a compute device, and 4) application of data parallel computational kernels to partitioned multidimensional arrays.
A. Manycore Compute Device
A manycore compute device provides many threads of execution and owns memory that is shared by those computational threads. Computations performed by the compute device only access and update data which are resident in the compute device's shared memory. A heterogeneous parallel application utilizes both distributed-memory process parallelism among a network of compute nodes and sharedmemory thread parallelism within a manycore compute device. In such an application it is assumed that a distributedmemory parallel process; e.g., the process associated with a message passing interface (MPI) rank, has exclusive use of at most one manycore compute device. This assumption is made to avoid the complexity associated with managing multiple compute devices within the same process. However, the abstraction for a single manycore compute device can aggregate multiple hardware devices into a single, logical device.
B. Multidimensional Array
Multidimensional arrays are intrinsic to programming models for scientific and engineering computations. For example, the following two statements declare the same double precision multidimensional array X in the FORTRAN and C languages' syntax. 
Definition 2: The rank of a multi-index is the number of indices; e.g., (1, 3, 5 ) is a multi-index of rank 3 and (7, 3, 5, 1) is a multi-index rank 4.
Definition 3: A Kokkos multi-index space I of rank R is a Cartesian product of integer ranges
The abbreviated notation of
when there is no ambiguity between denoting a multi-index versus a multi-index space.
Definition 4: The cardinality of a multi-index space I, denoted by #I, is
2) a homogeneous collection of #X I data members, and 3) bijective map between the multi-index space and the data members; X map : X I ↔ {data}.
C. Multidimensional Array Map
There are many valid bijective maps between a multiindex space and the collection of data members. Typically these data members reside in a contiguous span of memory on a compute device. The map for such a multidimensional array X can be expressed by a base location in memory and a bijective function between the multi-index space and an offset in the range [0..#X I ). For example, the FORTRAN and C language multidimensional array index spaces and offset-maps are as follows. space:
D. Data Parallel Work Partitioning
A Kokkos multivector or multidimensional array are partitioned among the threads of a compute device. Each thread then applies a given computational kernel to that thread's assigned partition of the array. Multivectors are partitioned along their length, under the assumption that operations applied to each vector are data parallel and that the length of the multivector is greater than the count of vectors. Multidimensional arrays are partitioned along exactly one dimension of the multi-index space. For Kokkos the leftmost dimension was chosen for parallel partitioning by a consensus of computational kernel developers participating in a Kokkos software design review; however, it would have been equally valid to have chosen any other dimension.
The partitioning of a multidimensional array is defined by the partitioning of its multi-index space
where the left-most N P dimension is partitioned into N P "atomic" units of parallel work. When a thread applies the computational kernel to a unit of work, denoted by i P ∈ [0..N P ), the kernel must
• only update array data members that are associated with that index (i P , * , * , · · · ) and • not query array data members that are potentially updated by another thread applying the kernel to a different unit of work.
E. Data Parallel Computational Kernels
Computational kernels are currently applied to parallel partitioned work in two ways: parallel_for and parallel_reduce. A parallel_for application is trivially parallel in that the computational kernel's work is fully disjoint. In a parallel_reduce application each application of the computational kernel generates data that must be reduced among all work items; e.g., an inner product generates N P values which must be summed to a single value.
Definition 6: A parallel_for kernel is a function that inputs a collection of parameters and data parallel partitioned multidimensional arrays and outputs a corresponding collection of partitioned arrays.
Definition 7:
A parallel_reduce kernel is a function that inputs a collection of parameters and data parallel partitioned arrays and outputs a collection of parameters and data parallel partitioned arrays. Each application of a parallel_reduce kernel to the i P unit of parallel work generates a corresponding contribution to the output parameters. These contributions are reduced by a mathematically commutative and associative parameter reduction function f Θ . Note that an implementation f Θ may be non-associative due to round-off in its floating point operations.
F. Device Polymorphic Multidimensional Array Maps
The key concept for the Kokkos multidimensional array programming model is the use of device polymorphic multidimensional array maps. Many valid multidimensional array maps may exist; however, for a domain of computational kernels a particular map may yield the best memory access performance for a particular compute device. For example, an NVIDIA R device must have a multidimensional array map that results in a coalesced global memory access pattern for parallel_for or parallel_reduce kernels. Device polymorphism is implemented in the Kokkos multidimensional array programming model through C++ template meta programming, as opposed to C++ virtual interfaces, to eliminate runtime overhead in the API. This implementation strategy allows computational kernels to have performanceportable implementations where the best performing multidimensional array map for a given device is transparently compiled into computational kernels when the kernels. This implementation strategy requires that multidimensional array maps and computational kernel execution conform to the interface patterns described in the following sections.
III. MULTIDIMENSIONAL ARRAY API
The multidimensional array API given in Figure 1 composes (1) a runtime defined multi-index space, (2) a collection of data members of a compile-time defined simple mathematical type, and (3) a compile-time device-polymorphic map from the multi-index space to data members. The member data type is restricted to the simple mathematical types of the computational languages (i.e., C++ and CUDA) so that data members can be simply and optimally mapped onto compute devices such as NVIDIA GPGPU. Each device has a default map which may be overridden to support investigation of alternative maps to improve memory access performance.
A. View and Shared Ownership Semantics
The MDArrayView C++ class API (Figure 1 ) is a multidimensional array view to member data on the manycore device. In view semantics there may exist multiple views to the same member data such that all views share ownership of that member data. Multiple views to the same member data are created via the copy constructor and assignment operator; these functions perform "shallow" copies of the input view. This in contrast to container semantics where a container exclusively owns its member data and the copy constructor and assignment operator perform a "deep" copy by allocating new member data and copying all members of the input container. Figure 1 . The Kokkos multidimensional array API composes a member data type, multi-index space, multi-index map to device memory, and view semantics.
In large complex application codes arrays are allocated on the compute device by "driver" functions, passed among driver functions, passed from driver functions to computational kernels, passed from one computational kernel to another, and eventually deallocated to reclaim memory on the compute device. Managing the complexity of numerous references to many allocated arrays requires a high degree of software design and implementation discipline to prevent memory management errors of (1) deallocation of a still used array or (2) neglecting to deallocate an array no longer in use. Thus there is a significant risk that a team of application developers will lose track of when to, or not to, deallocate a multidimensional array, and as a result will introduce one of the two memory management errors. This risk is mitigated by using view or shared ownership semantics (e.g., see shared_ptr in [4] ) for allocated Kokkos arrays. Under the shared ownership semantics multiple view to the same allocated data may exist and the last view to be cleared (see Figure 1) deallocates the allocated data.
B. Device versus Host Allocated Data
The create_mdarray function is called by the application on the host process to allocate array member data in the memory space of a designated device. The DeviceType may be a manycore device (e.g., NVIDIA R device) or the host "device"; i.e., the host process running the application code. Views to allocated member data will exist on both the host process and within kernels running on the manycore device. However, member data allocated on one device may not be directly accessible to another device; e.g., data allocated on the host is not directly accessible to an NVIDIA R device and vice-versa.
Kokkos API functions in Figure 1 are designated as available only on the device (DEVICE_FUNCTION), available on both the device and host (DEVICE_AND_HOST_FUNCTION), or available only on the host (no macro designation). Functions which describe the shape of an array (rank and dimensions) are available on both the device and host. However, functions which provide access to member data are only available on the device.
The deep_copy function is called on the host process to copy array member data between arrays allocated on the manycore device and host. These arrays must have conformal multi-index spaces; i.e., their rank and dimensions must be equal. However, the source and destination arrays may have different maps from the conformal multi-index space to the data members. Thus the deep_copy both copies member data between memory spaces and potentially permutes member data through the composition of the two array's multi-index space maps.
C. Illustrative Test Function
Array creation, view "shallow" copy, and "deep" copy semantics are illustrated in Figure 2 . In this illustrative test function an array's member data is filled on the host, copied to the device, copied between arrays on the device, and then copied back to a different array on the host. As an illustration of view semantics the view host_z is assigned to also view the array allocated for view host_y. When view host_y is destroyed the member data originally allocated for it is not deallocated because the host_z view to that data still exists. 
D. Multivector and Value Views
The MDArrayView API presented in Figure 1 supports multidimensional arrays with runtime-defined multi-index spaces of rank from 1 to 8, and device-specific mappings of their multi-index space to device-resident data. The Kokkos array API provides two additional C++ interface classes for device-resident data: ValueView and MultiVectorView. The MultiVectorView is restricted form of a multidimensional array -a collection of one dimensional vectors all of which are the same length. The multivector is at most rank-two and has a single, defined multi-index space mapping; thus a multivector has a simpler abstraction and API. The ValueView is even simpler -it is by definition rank-zero. Value views are used to create and manage persistent parameters on the multicore / manycore devices; i.e., parameters that are shared by all of the parallel threads. For example, the output parameters of a parallel reduce kernel (recall Definition 7) can be maintained in the device's memory space as a value view.
Both MultiVectorView and ValueView APIs use the same shared-ownership view semantics as the MDArrayView. This API includes corresponding C++ constructors, assignment operator, destructor, create_multivector, create_value, and deep_copy functions as were defined for the MDArrayView in Figure 1 . For brevity the simpler MultiVectorView and ValueView APIs have not been included here.
IV. COMPUTATIONAL KERNEL FUNCTOR API
Computational kernels conform to functor semantics and APIs for either parallel_for or parallel_reduce operations. A functor is the composition of a computation and the data, or views of data, to which the computation is applied (recall Definitions 6 and 7). Functor semantics are to many programming models; for example, the C++ Standard Template Library (STL) algorithms [5] , Intel Threading Building Blocks [6] , and Thrust [3] use functors.
In the Trilinos-Kokkos programming model a functor is created on the host process, copied to the device for execution, and then run thread-parallel on the device. In this progamming model a functor is a C++ class that (1) identifies a target manycore device via a typedef device_type statement and (2) provide a computational function via the operator() class member function. These functor API requirements are illustrated for a parallel_for and parallel_reduce functor in Figures 3 and 4 . A parallel_reduce functor has additional API requirements to support the thread-parallel reduction operation f Θ in Definition 7. For device-portability a functor must be (1) templated with respect to the device and (2) attribute the operator() with KOKKOS_MACRO_DEVICE_FUNCTION macro. The functor's member multidimensional arrays must use the functor's device template argument. Instantiation of the functor's member array views causes a device-specific multi-index → data member map to be compiled into the functor, resulting in an optimal memory access pattern by the operator() ( parallel_for or parallel_reduce statements. Each call to the functor is passed a unique index iP in the range [0..nP]. For thread-safe parallel execution the functor's operator() computation (1) must only access the data members of the input and output arrays associated with the parallel work index iP and (2) do not assume a particular map from multi-indices to data members. For example, in Figure 3 the functor access appropriate parallel partitioned data members of the input and output arrays; i.e., X(iP, * , * ) and Y(iP, * ).
Parallel functor performance on manycore devices has numerous considerations which are the subject of ongoing research. One consideration illustrated in Figure 3 is to minimize global-memory accesses by referencing each data member of an array exactly once. In this example functor a temporary variable is used to accumulate the X(iP,i, * ) values so that the corresponding Y(iP,i) data member is updated exactly once. This "access global-memory only once" performance guideline is motivated by the large global memory access times of some manycore devices (e.g., NVIDIA R ), sharing of global-memory access bandwidth by cores, and small overhead associated with the multi-index map operation.
A reduction functor, as illustrated in Figure 4 , has the following additional API requirements to implement the f Θ reduction function.
• 
V. PERFORMANCE OF INITIAL IMPLEMENTATION
Initial implementations of the Kokkos array programming model include a "generic" Pthread implementation using the Trilinos-ThreadPool library [7] and a CUDA TM implementation using CUDA version 4.
Performance-portability of these initial implementation is evaluated through two test cases. These performanceevaluation test cases will be used to analyze and improve the runtime performance of the programming model, and the usability of the API. The first case applies the parallel_for to a kernel which computes the gradients of a linear hexahedral finite element's basis functions. Each call to the kernel loads an element's eight vertex coordinates from global memory (24 values), performs 318 floating point operations, and writes the eight basis function gradients to global memory (24 values). The coordinate and basis function gradients data are managed via rank-three MDArrayView entities with dimensions: number of elements × spatial dimension × number of vertices (or linear basis functions) per element.
The second case implements a Modified GramSchmidt orthogonalization algorithm through a sequence of parallel_for and parallel_reduce operations on simple "level one" basic linear algebra operations (i.e.; vector scaling, addition, and inner products). In this test case the orthogonalization algorithm is applied to a multivector of dimension N ×32. Multivector and scalar data used in this test case are managed in MultiVectorView and ValueView entities.
The two test cases are instantiated for double precision data and computations, and run on NVIDIA, Opteron, and Xeon multicore / manycore devices. For the Opteron and Xeon devices a the best performance was obtained by fully subscribing the devices with 24 Pthreads. Initial results for these test cases and devices are presented in Figure 5 and Figure 6 . These initial test cases have a significant difference that impacts their potential performance on the devices. The hexahedral gradient test case executes a single pass through the data with a single kernel. In contrast the modified GramSchmidt test case iterates through a sequence of parallel for and parallel reduce kernels which make multiple passes through the multivector. This allows the Opteron and Xeon devices to reuse cached data, enabling significantly better performance for modified Gram-Schmidt test case with 32 vectors of lengths up to approximately one hundred thousand elements ( Figure 6 ). Beyond this size cache reuse degrades and performance decreases with problem size. In addition to the three multicore / manycore device test cases, "native" CUDA versions of these test cases are implemented and run on the NVIDIA device. These handwritten CUDA test cases are used to compare performance and usability with the portable array API versions. Performance of the hand-written CUDA version of the hexahedral gradient test case is about 50% faster than the array-API version for a large number of elements ( Figure 5 ). A significant difference between the array-API and handwritten version is that the hand-written version eliminates all multi-index mapping calculations by leveraging a priori coding-time knowledge of array dimensions (N, 3, 8) and NVIDIA coalesced memory access strides. In contrast the multi-index mapping of the array-API version are currently limited to runtime knowledge of array dimensions, and selects the mapping computation at compile-time. An API enhancements to support compile-time array dimensions is being evaluated.
Performance differences between the array-API version and hand-written CUDA versions of the Modified GramSchmidt test case are small ( Figure 6 ). This result is expected as the test case does not use a multi-index map calculation when indexing into the multivectors. Performances differences for smaller multivector lengths (e.g., 1e3 to 1e5) are due to differences in the GPGPU reduction algorithms.
VI. CONCLUSION
Performance results for this initial implementation demonstrate the feasibility of the Kokkos performance-portable multidimensional array programming model. This is especially significant for an NVIDIA R GPGPU device where improper memory access patterns (i.e., non-coalesced memory access) will dramatically degrade performance. The Trilinos-Kokkos array performance-portable programming model provides a "classical" multidimensional array abstraction and API for computational kernels to organize and access computational data. However, the memory management abstraction uses non-traditional sharedownership view semantics, as opposed to exclusiveownership container semantics. View semantics are used to mitigate risks of memory management errors in large complex libraries and applications. View semantics are exclusively used in the programming model, as opposed to mixing container and view semantics, to simplify the programming model and API.
Trilinos-Kokkos is available in the public domain at http://trilinos.sandia.gov.
VII. RESEARCH & DEVELOPMENT IN PROGRESS
The Trilinos-Kokkos array programming model, API, and implementations are in the first year of research & development (R&D). Current R&D activities are focused on
• analyzing and improving performance for the current devices, • evaluating functor programmability and performance with sequences of more complex finite element kernels, and • demonstrating usability through mini-applications [8] . Planned R&D activities include
• implementations for new devices such as the Intel Knights Ferry [9] , • support for concurrent execution of multiple heterogeneous kernels within parallel for and parallel reduce operations, and • integration into applications and libraries to support performance-portable multicore / manycore parallelism.
