232,684 research outputs found

    Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

    Get PDF
    A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure

    IMPROVING THE PERFORMANCE AND TIME-PREDICTABILITY OF GPUs

    Get PDF
    Graphic Processing Units (GPUs) are originally mainly designed to accelerate graphic applications. Now the capability of GPUs to accelerate applications that can be parallelized into a massive number of threads makes GPUs the ideal accelerator for boosting the performance of such kind of general-purpose applications. Meanwhile it is also very promising to apply GPUs to embedded and real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, how to fully utilize the advanced architectural features of GPUs to boost the performance and how to analyze the worst-case execution time (WCET) of GPU applications are the problems that need to be addressed before exploiting GPUs further in embedded and real-time applications. We propose to apply both architectural modification and static analysis methods to address these problems. First, we propose to study the GPU cache behavior and use bypassing to reduce unnecessary memory traffic and to improve the performance. The results show that the proposed bypassing method can reduce the global memory traffic by about 22% and improve the performance by about 13% on average. Second, we propose a cache access reordering framework based on both architectural extension and static analysis to improve the predictability of GPU L1 data caches. The evaluation results show that the proposed method can provide good predictability in GPU L1 data caches, while allowing the dynamic warp scheduling for good performance. Third, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model based on a predictable warp scheduling policy to enable the WCET estimation on GPUs. The experimental results show that the proposed WCET analyzer can effectively provide WCET estimations for both soft and hard real-time application purposes. Last, we propose to analyze the shared Last Level Cache (LLC) in integrated CPU-GPU architectures and to integrate the analysis of the shared LLC into the WCET analysis of the GPU kernels in such systems. The results show that the proposed shared data LLC analysis method can improve the accuracy of the shared LLC miss rate estimations, which can further improve the WCET estimations of the GPU kernels

    Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels

    Get PDF
    With increasing complexity of hardwares, systems with different memory nodes are ubiquitous in High Performance Computing (HPC). It is paramount to develop strategies to overlap the data transfers between memory nodes with computations in order to exploit the full potential of these systems. In this article, we consider the problem of deciding the order of data transfers between two memory nodes for a set of independent tasks with the objective to minimize the makespan. We prove that with limited memory capacity, obtaining the optimal order of data transfers is a NP-complete problem. We propose several heuristics for this problem and provide details about their favorable situations. We present an analysis of our heuristics on traces, obtained by running 2 molecular chemistry kernels, namely, Hartree-Fock (HF) and Coupled Cluster Single Double (CCSD) on 10 nodes of an HPC system. Our results show that some of our heuristics achieve significant overlap for moderate memory capacities and are very close to the lower bound of makespan
    corecore