Abstract-Programming parallel applications for heterogeneous HPC platforms is much more straightforward when using the task-based programming paradigm. The simplicity exists because a runtime takes care of many activities usually carried out by the application developer, such as task mapping, load balancing, and memory management operations. In this paper, we present a visualization-based performance analysis methodology to investigate the CPU-GPU-Disk memory management of the StarPU runtime, a popular task-based middleware for HPC applications. We detail the design of novel graphical strategies that were fundamental to recognize performance problems in four study cases. We first identify poor management of data handles when GPU memory is saturated, leading to low application performance. Our experiments using the dense tiled-based Cholesky factorization show that our fix leads to performance gains of 66% and better scalability for larger input sizes. In the other three cases, we study scenarios where the main memory is insufficient to store all the application's data, forcing the runtime to store data out-of-core. Using our methodology, we pin-point different behavior among schedulers and how we have identified a crucial problem in the application code regarding initial block placement, which leads to poor performance.
I. INTRODUCTION
A challenge found in the High-Performance Computing (HPC) domain is the complexity of programming applications. The task-based programming paradigm presents numerous benefits, and many researchers believe this is currently the optimal approach to program for modern machines. The tasking related extensions to the OpenMP (in 4.0 and 4.5 version), and the upcoming 5.0 standard with even more features confirm this trend. In general, a task-based approach is extremely efficient for load-balancing and intelligently using all the resources' computational power in heterogeneous platforms. It transfers to a runtime some activities that are usually carried out by programmers, such as mapping compute kernels (tasks) to resources, data management, and communication. Taskbased applications use a Direct Acyclic Graph (DAG) of tasks as the main application structure to schedule them on resources, considering tasks dependencies and data transfers. Among many alternatives like Cilk [1], Xkaapi [2] , and OmpSs [3] ; StarPU [4] is one example of a runtime using this paradigm. Its features include the use of distinct tasks' implementations (CPU, GPU), different tasks schedulers, and automatically managing data transfers between resources.
The performance analysis of task-based parallel applications is complicated due to its inherently stochastic nature regarding variable task duration and their dynamic scheduling. Different performance analysis methods and tools can be used to aid on this matter, including analytical modeling of the taskbased application theoretical bounds [5] and the applicationruntime simulation which allows reproducible performance studies in a fully-controlled environment [6] , [7] . StarPU can also collect execution traces that describe the behavior of the application enabling other tools to provide information for the performance analysis. Possible uses of the information provided by the runtime can be in the form of performance metrics (number of ready and submitted tasks, the GFlops rate, etc.), the indication of poor behavior (i.e., absence of work in the DAG critical path), or visualization techniques (panels that illustrate the application and the runtime behavior over time). The visualization-based approach can combine all these investigation methods to facilitate the analysis with graphical elements. The StarVZ workflow [8] is an example of a visualization tool that leverages application/runtime traces. It employs consolidated data science tools, most notably R scripts, to create meaningful views that enable the identification of performance problems and testing of what-if scenarios.
Interleaving data transfers with computational tasks (data prefetching) is another technique that has a significant impact on performance [9] . The goal is to efficiently manage data transfers among different memory nodes of a platform: main (RAM), accelerator (GPUs), and out-of-core (hard drive) memories. Factors like the reduction of data transfers between heterogeneous devices and host, better use of cache, and smarter block allocation strategies play an essential role for performance. Simultaneously, many applications require an amount of memory greater than the available RAM. These applications require the use of out-of-core methods, generally because disk memory is much larger than main memory [10] . Correctly handling which data blocks stay in main or disk memory is a challenge. The complexity of evaluating these memory-aware methods motivates the design of visualizationbased performance analysis techniques tailored explicitly for data transfers and general memory optimizations.
In this paper, we focus on the analysis of the StarPU's memory management performance using trace visualization. They enable a general correlation among all factors that can impact the overall performance: the application algorithm, the runtime decisions, and memory utilization. The main contributions are the following. (a) We extend the StarVZ workflow by adding new memory-aware visual elements that help to detect performance issues in the StarPU runtime and the taskbased application code. (b) StarPU is augmented with extra trace information about the memory management operations, such as new memory requests, additional attributes on memory blocks and actions, and data coherency states. (c) We present the effectiveness of our methodology with four scenarios that use the dense linear algebra solver Chameleon [11] . In the first case, we show how we identified a problem inside the StarPU software, and compare the application performance after our proposed correction patch. In the second case, we analyze the idle times when using out-of-core. In the third case, we offer an alternative method on the application to allocate blocks in out-of-core memory in a more efficient way. In the last case, we study the memory/application behavior between the DMDAS and DMDAR schedulers. These methods lead to a reduction of ≈66% in the execution time when using a heterogeneous platform composed of CPUs and GPUs. Although we use the methods on StarPU, they are general and extendable to other runtimes.
The paper is structured as follows. Section II provides basic concepts on the StarPU runtime system and the dense linear algebra Cholesky factorization as implemented by Chameleon. Section III presents related work on the visualization of memory management and task-based applications. We also discuss our approach against the state-of-the-art. Section IV presents the visual-based methodology to investigate the performance of memory operations in the StarPU runtime, employing a modern data science framework. Section V details the experiments conducted in four test cases. Section VI discusses the limitations of our strategy and Section VII concludes this paper with future work. The companion material of this work is publicly available at https://doi.org/10.5281/zenodo.2605464.
II. BACKGROUND CONCEPTS
We provide a general overview of the StarPU runtime and a detailed explanation of how the Chameleon project implements a dense tiled-based Cholesky factorization using a Directed Acyclic Graph (DAG) of tasks for heterogeneous platforms.
A. The StarPU runtime
The StarPU runtime uses the Sequential Task Flow (STF) model [12] , where tasks are sequentially submitted during the application execution and are dynamically scheduled to workers. In such a model, there is no need to unroll the whole Directed Acyclic Graph (DAG) of tasks before starting tasks execution. StarPU tasks might have multiple implementations, one for each type of resource (such as x86 CPUs, CUDA GPUs, and OpenCL devices), and must register memory handles to identify the memory blocks on which they read and write data. Depending on resource availability and the heuristic, the scheduler dynamically chooses one of the task versions and puts it to execute. StarPU employs different heuristics to allocate tasks to resources. Classical heuristics are LWS (local work stealing) and EAGER (centralized deque). More sophisticated schedulers consider additional information.
The DMDA (deque model data aware) scheduler, for example, uses estimated task completion time and data transfer time to take its decisions [9] . Another example is the DMDAR (deque model data-aware ready) scheduler; that additionally considers memory handles already present on the workers.
The runtime is also responsible for transferring data between resources, for controlling the presence and the coherence of the memory handles. StarPU creates one memory manager for each different type of memory. For example, there is one memory manager for the RAM associated with one NUMA node (shared by all CPU cores on that socket), one for each GPU, and so on. StarPU adopts the basic MSI protocol, with the states Modified/Owned, Shared, and Invalid, to manage the state of each memory handle on the different memories. At a given moment, each memory block can assume one of the three states on the memory managers [4] . When a task is scheduled, StarPU will internally create a memory request for one of the tasks memory dependencies to the chosen resource. These requests are handled by the memory managers that are responsible for allocating the block of data and issuing the data transfer. When tasks are scheduled well in advance, StarPU prefetches data, so the transfers get overlapped with computations of the ongoing tasks [13] .
Furthermore, recent versions of StarPU support the use of out-of-core memory (disk i.e., HDD, SSD) when RAM occupation becomes too high. The runtime employs a LeastRecently-Used (LRU) algorithm to determine which data blocks should be transferred to disk to make room for new allocations on RAM. Interleaving such data transfers with computation and respecting data dependencies on the critical path is fundamental to good performance.
B. The Chameleon Package
The Chameleon package [11] contains a series of dense linear algebra solvers implemented using the sequential taskbased paradigm. From the set of available solvers, we adopt the task-based solver that implements the dense linear algebra Cholesky factorization on top of the StarPU runtime, because many HPC applications used it as a computing phase. The Cholesky factorization algorithm runs over a triangular matrix divided into blocks, using four different tasks: dpotrf (Cholesky Factorization), dtrsm (Triangular Matrix Equation Solver), dsyrk (Symmetric Rank-k Update) and dgemm (Matrix Multiplication), as shown in Figure 1a . The taskbased Cholesky factorization divides the input matrix into tiles (blocks), making each task associated with a block. The factorization essentially begins with tasks on lower coordinates blocks and iteratively computes all matrix blocks for all coordinates. The Figure 1b demonstrates the resulting DAG for a matrix divided into 25 blocks (N = 5). The Chameleon framework generates the full matrix to conduct numerical checks. Since in our case the solver is used independently of real application code, the Chameleon testing code includes an input generation task called plgsy to create double floatingpoint values for the matrix tiles. [21] focused on MPI+OpenMP or MPI+CUDA programming models can also be employed, but they lack the fundamental ability to consider critical path analysis or task dependencies delays as indicated by the DAG. More recently, StarVZ [8] has been proposed as an extensible R-based framework for the performance analysis of task-based HPC applications. It includes several visualization panels enriched with theoretical bounds and task-dependencies delays that correlate observed performance with the DAG. Even if some of these tools provide unwavering DAG-oriented support, they generally lack a specific methodology to analyze the impact of different block allocation policies on application performance.
More recently, Ceballos et al. [22] propose TaskInsight, a tool to evaluate how data reuse among application tasks might affect application performance. They quantify the variability of task execution by measuring data-aware hardware counters (i.e., cache misses) of some tasks when another task scheduling is being carried out. Despite their focus on such kind of memory interference, they overlook the impact of the application DAG and the effects of data prefetching and possible data transfers between different types of devices that are fundamental in current multi-GPU platforms. Miquel et al.
[23] also study data transfer operations focusing on data reuse in task-based runtimes. The authors propose the Kernel Reuse Distance (KRD) metric which measures the amount of data reuse on caches with different sizes. They consider the reuse of multiple cores that access the same levels of caches. The KRD metric is derived from data memory access traces and can be used to understand the quality of data reuses on the applications. Although this metric can be used to explain performance differences in some situations, more events could be collected from traces to provide a better view of the application memory.
Our approach provides a multi-level performance analysis of data management operations on a heterogeneous multi-GPU and multi-core platform. We combine a high-level view of the application DAG with the low-level runtime decisions, which guides us in identifying and fixing performance problems. Instead of only using low-level metrics and comparing them with multiple executions, we focus on the behavior understanding of representative executions. We also design visualization elements specifically for the performance analysis of memory transfers in a DAG-based runtime, enriching our perception of task-based applications running on heterogeneous platforms.
IV. MEMORY-AWARE VISUALIZATION PANELS
We present our methodology to investigate the memory manager behavior and memory block allocations on resources. StarPU's data management module is responsible for all actions involving the application's memory. While absent from the original StarPU code, we have added events to the runtime's tracing mechanism to track the data management system. As a consequence, we proposed a set of extensions to gather the necessary information needed for our performance analysis. We first include the events' memory identification on all events with extra information to allow correlations between runtime activities and to understand the decisions behind it. Second, we trace the memory's coherence update function to keep track of the whereabouts of a memory block along the execution. Third, we track all memory requests (prefetch, fetch, allocation, sync) carried out by the runtime. The capture of traces is a feature already present in the StarPU runtime, and we extend it to add new information.
Our memory-aware visualization panels are designed to leverage these extra behavioral data about memory activities. The presence of memory blocks on each memory manager is used to understand the general behavior of the application. In what follows, we detail our data-aware visualization strategies.
A. Enriched Memory-Aware Space/Time View
Employing Gantt-charts to analyze parallel application behavior is very common. It is used to show the behavior of observed entities (workers, threads, nodes) along time. We have adapted and enriched such kind of view to inspect the memory manager behavior, as shown in the example of Figure 2 . On the Y-axis, the figure lists the different memory managers associated to different device memories: RAM, different accelerators (memory of GPU and OpenCL devices), and permanent storage in the case of out-of-core (OOC) disk memory. In this example, we have only three memory managers: RAM, GPU1, and GPU2. The plot presents the actions of each manager over time with colored rectangles tagged with block coordinates (i.e., for GPU2: 1×3, 0×2, and so on) from the application problem. The rectangles in this figure mainly represent different Allocating states carried out by those managers, except for the RAM manager that had no registered behavior in the depicted 10ms interval. In the right of each manager line, the panel describes the percentage of time of the most recurring state using the same color. For instance, the GPU2 manager spent 75.15% of the time of this specific time interval in the Allocating state. 
B. Block Residency in Memory Nodes
A given block coordinate of an HPC application (i.e., Cholesky factorization) may reside in multiple memory nodes along the execution. For example, there can be many copies of a given block if workers executing on different devices perform read operations only. This observation is due to the adoption of the MSI protocol by StarPU, where multiple memory nodes have copies of the same memory block (see Section II for details). Figure 3 represents the location of a given memory block along the execution. Each of the five facets of the Figure represents one memory block with the coordinates 0×0, 0×1, 0×2, 0×3 and 0×4 of the input matrix. For each block, the X-axis is the execution time, discretized in time intervals of 20ms. This interval is sufficiently large for the visualization and small enough to show the application behavior evolution. At each time interval, the Y-axis shows the percentage of time that this block is on each memory node (color). For example, if a block is first owned by RAM for 18ms and then for 2ms by GPU2, the bar will be 90% blue and 10% yellow. Since each block can be shared and hence present on multiple memory nodes, the maximum residency percentage on the Y-axis may exceed 100%. Moreover, if the memory resides for only a portion of the interval, the percentage would be less than 100%.
With this new visualization, we can check a summarized evolution of data movement and resource's memory utilization. For example, Figure 3 details that the memory block with coordinates 0×0 stayed in RAM throughout the execution, while other blocks remain in RAM only until ≈80ms of the execution. We are capable to quickly spot anomalies by correlating the block coordinates residence with the application phases. Very frequently in linear algebra, a lower block coordinate is only used at the beginning of the execution, so it should be absent after the initialization phase (which would be demonstrated as 0% occupancy of that block after it is no longer needed). 
C. Detailed Temporal behavior of Memory Blocks
The previous panel (see Figure 3) shows where a given block is located (on which memory node) throughout the execution. Figure 4 , besides showing the memory block location, additionally depicts all the runtime and application tasks activities that affect the block behavior. Here, we employ the traditional Gantt-chart as a basis for the visualization, where the X-axis is the time in milliseconds, and the Y-axis represents the different memory managers. There are two types of states, depicted as colored rectangles. The ones shown in the background with a more considerable height represent the residency of the memory block on the managers: the red color expresses when a memory node is an owner, while the blue color indicates the block is shared among different managers. The inner rectangles represent the Cholesky tasks (dpotrf, dtrsm, dsyrk, dgemm, and dplgsy) that are executing and using that memory block from that memory manager. We augmented the representation with different events associated with the memory blocks on the respective manager and time. The circles (Allocation Request, Transfer Request) are either filled or unfilled, for fetch or prefetch operations, respectively. The arrows are used to represent a data transfer between two memory nodes and have a different meaning (encoded with different colors: intra-node prefetch and fetch). Finally, two vertical lines indicate the correlation (last dependency and last job on the same worker) with a specific application task that one in this example wants to study. Here, we highlight the task ID 90 (which is a dsyrk task). The green vertical line represents the end of the last dependency that releases task 90, and the yellow represents the end of the last task executed on the same worker.
D. Block Coordinates Animation to track Allocation History
The application running on top of StarPU determines the data and the tasks that will be used by the runtime. Instead of only considering the utilization of resources, we want to correlate the algorithm and the runtime decisions. We are then interested in a view that takes into account the coordinates of the blocks in the original data, illustrating which task is using each block, and their state on the managers (owned, private or shared). Figure 5 depicts a snapshot of all memory blocks locations and the running tasks in a specific time.
The visualization has three facets, one for each of memory managers (RAM, GPU1, GPU2). Each manager has a matrix with the block coordinates in the X and Y-axis. On this matrix, each colored square represents one memory block. The colored inner squares (write mode access) or circles (read) inside those blocks represent application tasks. With this visualization, it is easy to confirm how the memory data flow correlates with the blocks position. In Figure 5 , for example, we can see that only two blocks are on RAM and that both GPUs share the first row. Moreover, there is a dpotrf task executing over block 1×1 in RAM and a dgemm task on each GPU. GPU1 has write access on the dgemm task on block 1×2, and two read accesses on blocks 0×1 and 0×2. By stacking consecutive snapshots, we can create an animation that shows the residence of memory blocks along time. This feature is particularly useful to understand the algorithm behavior and the data allocation policy. As of now, StarPU developers are integrating such a view for general use.
E. Heatmap to verify Memory Block residence along time
Apart from the previous memory snapshot visualization, we are also interested in an execution overview of the handles locality among the managers. Our final panel consists of a traditional heat map visualization to provide a summary of the total presence of the tiles on each manager. Figure 6 depicts an example. There is one visualization facet for each memory manager, and each square represents a memory block positioned on its application matrix coordinate. The blue color tonality represents the total amount of time that the block is present on the manager. In the Figure 6 , for example, we can see that all blocks of the diagonal stood more time on RAM compared to other blocks (because for Cholesky, dpotrf and dsyrk tasks on the diagonal are typically executed on CPUs, to let GPUs process mostly SYRK and GEMM tasks). 
V. EXPERIMENTAL RESULTS WITH DENSE CHOLESKY
We use a host equipped with an Intel Xeon CPU E5-2620 with eight physical cores, 64 GB of DDR4 RAM, two NVIDIA
