25 research outputs found

    Algorithms for Preemptive Co-scheduling of Kernels on GPUs

    Get PDF
    International audienceModern GPUs allow concurrent kernel execution and preemption to improve hardware utilization and responsiveness. Currently, the decision on the simultaneous execution of kernels is performed by the hardware, which can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs in high competition scenarios. We propose a novel graphbased preemptive co-scheduling algorithm, with the focus on reducing the number of preemptions. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time. Based on this solution we propose graph theoretical model and an algorithm to build preemptive schedules which minimizes the number of preemptions. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks

    Independent tasks on 2 resources with co-scheduling effects

    Get PDF
    Concurrent kernel execution is a relatively new feature in modern GPUs, which was designed to improve hardware utilization and the overall system throughput. However, the decision on the simultaneous execution of tasks is performed by the hardware with a leftover policy, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task. This can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs with and without preemption, with the focus on determining the kernels submission order to reduce the number of preemptions and the kernels makespan, respectively. We propose a graph-based theoretical model to build preemptive and non-preemptive schedules. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time, and we propose an algorithm based on this solution which minimizes the number of preemptions. We also propose an algorithm that transforms a preemptive solution of optimal makespan into a non-preemptive solution with the smallest possible preemption overhead. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem, and computing the optimal non-preemptive schedule is also NP-hard. In addition, we study the non-preemptive problem, without searching first for a good preemptive solution, and present a Mixed Integer Linear Program solution to this problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks. Our non-preemptive approach, on the other side, obtains makespan within 2.5% of the optimal preemptive schedules, while previous approaches exceed the preemptive makespan by 5 to 12%

    UM ALGORITMO DE SEGMENTAÇÃO POR CRESCIMENTO DE REGIÕES PARA GPUS

    Get PDF
    Este artigo propõe um algoritmo paralelo de segmentação de imagens por crescimento de região voltado a Unidades de Processamento Gráfico (GPU). O algoritmo proposto deriva de um algoritmo sequencial largamente utilizado pela comunidade de Análise de Imagens de sensoriamento remoto Baseada em Objeto Geográfico (GEOBIA). Relativamente à versão sequencial propõem-se neste trabalho novos atributos para caracterização de heterogeneidade morfológica de segmentos, cujo cálculo pode ser realizado de modo mais eficiente em GPUs. Duas variantes do algoritmo paralelo com diferentes heurísticas para seleção dos segmentos adjacentes a serem fundidos  a cada iteração são descritas. Visando explorar o potencial de GPUs para execução paralela de  threads  de baixa granularidade, o algoritmo proposto atribui uma thread para cada pixel da imagem, o que contribui ao mesmo tempo para uma distribuição mais uniforme da carga computacional entre os processadores da GPU. Uma detalhada análise experimental utilizando uma GPU convencional sobre quatro imagens de teste indicou acelerações superiores a 8 em relação ao algoritmo sequencia

    Independent tasks on 2 resources with co-scheduling effects

    Get PDF
    Concurrent kernel execution is a relatively new feature in modern GPUs, which was designed to improve hardware utilization and the overall system throughput. However, the decision on the simultaneous execution of tasks is performed by the hardware with a leftover policy, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task. This can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs with and without preemption, with the focus on determining the kernels submission order to reduce the number of preemptions and the kernels makespan, respectively. We propose a graph-based theoretical model to build preemptive and non-preemptive schedules. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time, and we propose an algorithm based on this solution which minimizes the number of preemptions. We also propose an algorithm that transforms a preemptive solution of optimal makespan into a non-preemptive solution with the smallest possible preemption overhead. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem, and computing the optimal non-preemptive schedule is also NP-hard. In addition, we study the non-preemptive problem, without searching first for a good preemptive solution, and present a Mixed Integer Linear Program solution to this problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks. Our non-preemptive approach, on the other side, obtains makespan within 2.5% of the optimal preemptive schedules, while previous approaches exceed the preemptive makespan by 5 to 12%

    Abstract

    No full text
    Cardiovascular diseases are the the leading cause of death and disability in the world. Non-invasive tecniques are required to reduce the number of deaths as well as the patients quality of life. These techniques usually rely on 3D visualization of MRI or CT data. In this work we describe how improved volume rendering techniques, combined with graphics cards programming can provide interactive visualization of the heart internal structures. Our main focus here is to provide doctors with high-performance 3D images for evaluating patient heart anatomy and performance. Our idea is to take full advantage of the triangle-rendering hardware to provide interactive frame rates.

    Memory Efficient and Robust Software Implementation of the Raycast Algorithm

    Get PDF
    In this paper we propose two novel software implementations of the ray-casting volume rendering algorithm for irregular grids, called ME-Raycast (Memory Efficient Ray-casting) and EME-Raycast (Enhanced Memory Efficient Ray-Casting). Our algorithms improve previous work by Bunyk et al [1] in terms of complete handling of degenerate cases, memory consumption, and type of cell allowed in the grid (tetrahedral and/or hexahedral). The use of a more compact and non-redundant data structure, allowed us to achieve higher memory efficiency. Our results show consistent and significant gains in the memory usage of ME-Raycast and EME-Raycast when compared to Bunyk et al implementation. Furthermore, our results also show that handling of degenerate cases generates accurate images, correctly rendering all the pixels in the image, while Bunyk et al implementation fails in rendering up to 38 pixels in the final image. When we compare our algorithms to other robust rendering algorithm, like ZSweep [2], we have considerable performance gains and competitive memory consumption. We conclude that ME-Raycast and EME-Raycast are efficient methods for ray-casting that allow in-core rendering of large datasets with no image errors

    Electronic Auction with autonomous intelligent agents: finding opportunities by being there

    No full text
    The overwhelming options conveyed by Internet exaggerated growth bring new issues for users engaged in buying and/or selling goods using the net as the business medium. Goods and services can be exchanged, directly sold or negotiated in auctions. In any of these situations, finding the required product by the right price is the big challenge for Internet users. Especially in e-auction, timing and strategic actions are vital to a successful deal. In this paper, we propose a model for e-auction based on intelligent agents technology. The use of agents make possible to reflect better what happens in real auctions. Agents act together with buyers, sellers and auctioneers to assist them obtaining the best deal or at least finding Nash equilibrium point

    Memory-Aware and Efficient Ray-Casting Algorithm

    No full text
    Ray-casting implementations require that the connectivity between the cells of the dataset to be explicitly computed and kept in memory. This constitutes a huge obstacle for obtaining real-time rendering for very large models. In this paper, we address this problem by introducing a new implementation of the ray-casting algorithm for irregular datasets. Our implementation optimizes the memory usage of past implementations by exploring ray coherence. The idea is to keep in main memory the information of the faces traversed by the ray cast through every pixel under the projection of a visible face. Our results show that exploring pixel coherence reduces considerably the memory usage, while keeping the performance of our algorithm competitive with the fastest previous ones. 1

    Performance Analysis and Optimization of the Vector-Kronecker Product Multiplication

    No full text
    International audienceThe Kronecker product, also called tensor product, is a fundamental matrix algebra operation, used to model complex systems using structured descriptions. This operation needs to be computed efficiently, since it is a critical kernel for iterative algorithms. In this work, we focus on the vector-kronecker product operation, where we present an in-depth performance analysis of a sequential and a parallel algorithm previously proposed. Based on this analysis, we proposed three optimizations: changing the memory access pattern, reducing load imbalance and manually vectorizing some portions of the code with Intel SSE4.2 intrinsics. The obtained results show better cache usage and load balance, thus improving the performance, especially for larger matrices
    corecore