24,200 research outputs found
Decreasing time consumption of microscopy image segmentation through parallel processing on the GPU
The computational performance of graphical processing units (GPUs) has improved significantly. Achieving speedup factors of more than 50x compared to single-threaded CPU execution are not uncommon due to parallel processing. This makes their use for high throughput microscopy image analysis very appealing. Unfortunately, GPU programming is not straightforward and requires a lot of programming skills and effort. Additionally, the attainable speedup factor is hard to predict, since it depends on the type of algorithm, input data and the way in which the algorithm is implemented. In this paper, we identify the characteristic algorithm and data-dependent properties that significantly relate to the achievable GPU speedup. We find that the overall GPU speedup depends on three major factors: (1) the coarse-grained parallelism of the algorithm, (2) the size of the data and (3) the computation/memory transfer ratio. This is illustrated on two types of well-known segmentation methods that are extensively used in microscopy image analysis: SLIC superpixels and high-level geometric active contours. In particular, we find that our used geometric active contour segmentation algorithm is very suitable for parallel processing, resulting in acceleration factors of 50x for 0.1 megapixel images and 100x for 10 megapixel images
CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly-Parallel Image-Analysis Algorithms
The development of parallel-processing image-analysis codes is generally a
challenging task that requires complicated choreography of interprocessor
communications. If, however, the image-analysis algorithm is embarrassingly
parallel, then the development of a parallel-processing implementation of that
algorithm can be a much easier task to accomplish because, by definition, there
is little need for communication between the compute processes. I describe the
design, implementation, and performance of a parallel-processing image-analysis
application, called CRBLASTER, which does cosmic-ray rejection of CCD
(charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC
algorithm. CRBLASTER is written in C using the high-performance computing
industry standard Message Passing Interface (MPI) library. The code has been
designed to be used by research scientists who are familiar with C as a
parallel-processing computational framework that enables the easy development
of parallel-processing image-analysis programs based on embarrassingly-parallel
algorithms. The CRBLASTER source code is freely available at the official
application website at the National Optical Astronomy Observatory. Removing
cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image
takes 44 seconds with the IRAF script lacos_im.cl running on a single core of
an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors.
CRBLASTER is 7.4 times faster processing the same image on a single core on the
same machine. Processing the same image with CRBLASTER simultaneously on all 8
cores of the same machine takes 0.875 seconds -- which is a speedup factor of
50.3 times faster than the IRAF script. A detailed analysis is presented of the
performance of CRBLASTER using between 1 and 57 processors on a low-power
Tilera 700-MHz 64-core TILE64 processor.Comment: 8 pages, 2 figures, 1 table, accepted for publication in PAS
Batch Size Influence on Performance of Graphic and Tensor Processing Units during Training and Inference Phases
The impact of the maximally possible batch size (for the better runtime) on
performance of graphic processing units (GPU) and tensor processing units (TPU)
during training and inference phases is investigated. The numerous runs of the
selected deep neural network (DNN) were performed on the standard MNIST and
Fashion-MNIST datasets. The significant speedup was obtained even for extremely
low-scale usage of Google TPUv2 units (8 cores only) in comparison to the quite
powerful GPU NVIDIA Tesla K80 card with the speedup up to 10x for training
stage (without taking into account the overheads) and speedup up to 2x for
prediction stage (with and without taking into account overheads). The precise
speedup values depend on the utilization level of TPUv2 units and increase with
the increase of the data volume under processing, but for the datasets used in
this work (MNIST and Fashion-MNIST with images of sizes 28x28) the speedup was
observed for batch sizes >512 images for training phase and >40 000 images for
prediction phase. It should be noted that these results were obtained without
detriment to the prediction accuracy and loss that were equal for both GPU and
TPU runs up to the 3rd significant digit for MNIST dataset, and up to the 2nd
significant digit for Fashion-MNIST dataset.Comment: 10 pages, 7 figures, 2 table
- …