2 research outputs found

    An optimized GPU-based 2D convolution implementation

    No full text
    With the increasing sophistication of image processing algorithms, and because of its low computation complexity, convolution should fully benefit from the ever-increasing capacities of state-of-the-art graphics processing units, such as Nvidia's Kepler and Maxwell family cards. Currently, it tends to be used as a preprocessing stage within more intricate image manipulations and has recently been implemented quite efficiently by several teams. However, either their implementations do not come near hardware's peak performance or are unable to process large mask sizes. Such limitations are overrun by our original parallel register-only convolution filter implementation of two-dimensional convolution filters that can process 32-bit floating-point images on a NVidia K40 card using mask sizes up to 127×127 and at the same time achieving pixel throughputs over 29GP/s, which is, as far as we know, the highest rate known to date. Such results were obtained by using registers sparingly and by designing memory access patterns that cancel both load and store replays at warp levels, along with optimizing cache use
    corecore