ABSTRACT Many big data applications need to process massive images and videos, while the performance of image processing is far from reaching requirements. This paper proposes the SunwayImg, a parallel image processing library, to support image-related applications on the Sunway many-core processor as well as the Sunway TaihuLight supercomputer. The SunwayImg integrates three kinds of image algorithms: fundamental algorithms to support basic image operations on the Sunway processor, widely used image feature extraction algorithms and a typical neural network model DBN. In addition, to parallelize various kinds of image algorithms efficiently on the Sunway processor, we propose a three-tier parallelization strategy as well as fine-grained parallelization inside core-groups. Finally, we accomplish implementation of the SunwayImg and evaluate it on the Sunway TaihuLight supercomputer to verify its effectiveness and performance.
I. INTRODUCTION
Image processing plays an essential role in image-related applications, such as pattern recognition, content-based image retrieval, etc. In the era of big data, we need to process massive images and videos, while the performance of image processing is far from reaching requirements. The reason is that most image processing algorithms are both compute-intensive and data-intensive. Therefore, there exits many investigations focus on image processing algorithms parallelization on many-core processors, such as GPU and Intel Xeon Phi.
The Sunway many-core processor was published in 2016, together with the supercomputer Sunway TaihuLight [1] . With its specific architecture and 260 processor cores, the Sunway processor has strong capabilities in parallel computing, and therefore, is suitable for parallel image processing. Furthermore, it should be noted that the next-generation Sunway processor will adopt the compatible architecture and be used to build future exascale supercomputers, which will provide long-term platforms for applications based on the Sunway processors.
The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval. has a 64KB local device memory for software management. Moreover, the Sunway processor provides a different parallel programming model for the massive parallelism of the CPEs. Therefore, the limitation of Sunway hardware and software environment prevents the naive adoption of an existing image library to Sunway. Multi-core architectures can support multithreading for speeding up calculations. However, it's necessary to implement a devoted methodology [6] .
Generally, image processing algorithms have high parallelism and data locality. But various image processing algorithms have different complexity and parallelism. From the algorithm complexity and parallelism perspective, it's crucial to design an appropriate parallel scheme for the heterogeneous architecture of Sunway processor. In addition, how to leverage the limited memory of CPEs with data locality is critical for reducing the cost of data transfer from memory to local device memory via DMA.
To overcome the above challenge, this paper proposes the SunwayImg, a parallel image processing library, to support image-related applications on the Sunway many-core processors as well as the TaihuLight supercomputer. The library provides the interface of image processing for the user. The user only need to call the interface of the library to process the image in parallel, and does not need to know about the underlying architecture of the Sunway processor in detail. Main characteristics of the SunwayImg are as follows:
The SunwayImg is the first image processing library for the Sunway processors, which provides various kinds of image processing algorithms to upper-level applications and accelerates compute-intensive algorithms using many-core resources.
We propose a three-tier parallelization strategy and master-slave acceleration model to parallelize image algorithms on the Sunway processor. By using the three-tier parallelization strategy, the SunwayImg supports three kinds of parallelization schemes according to the characteristics of image algorithms, that includes multiple-images-one-core-group, one-imageone-core-group, and one-image-multiple-core-groups. In the meanwhile, by using the master-slave acceleration model, the SunwayImg achieves fine-grained parallelism and optimized data-transfer inside a Sunway core-group. The SunwayImg integrates three kinds of image algorithms: firstly, fundamental algorithms that provide basic operations to images including codec/file access, color space transformation, rotation, normalization of images, etc.; secondly, computer vision algorithms which is typical and widely used for extracting image features including LBP (Local Binary Pattern), HOG (Histogram of Oriented Gradient), SIFT (Scaleinvariant feature transform) and SURF (Speeded Up Robust Features); thirdly, DBN (deep belief networks), a typical neural network model to support imagerelated deep learning applications.
The rest of this paper is organized as follows. Section 2 introduces the architecture and software environment of the Sunway processor. Section 3 gives an overview of SunwayImg library. Section 4 describes the parallelization of image processing algorithms on the Sunway processor in detail. Section 5 introduces parallel scheduling scheme for massive images based on the TaihuLight supercomputer. Section 6 evaluates the SunwayImg. Finally, we conclude the paper in Section 7. 
II. BACKGROUND
A. THE SUNWAY MANY-CORE PROCESSOR Figure 1 shows the general architecture of the Sunway SW26010 processor [7] , which is composed of four coregroups (CGs) connected with an on-chip network. Each coregroup is composed of one full-function 64-bit RISC core (management processing element, MPE) and 64 computing processing elements (CPEs) organized as an 8x8 cluster. The MPE is responsible for handling management, memory management, and communication with the CPEs, while the CPEs is also a 64-bit RISC core with 128-bit SIMD component which supports thread-level parallelism. Each coregroup has its own memory space, which is connected to the MPE and the CPE cluster via the memory controller (MC). The system interface (SI) is used to connect the chip and off-chip system via the PCIe interface [8] . In brief, each Sunway SW26010 processor consists of 260 processor cores organized in four core-groups, with each core-group included of one MPE and 64 CPEs.
The Sunway SW26010 processor uses a two-level hierarchical memory system. Each MPE has a 32 KB L1 instruction cache and a 32 KB L1 data cache, as well as a 256 KB L2 cache for both instruction and data. Each CPEs has its private 16 KB L1 instruction cache, and a 64KB usercontrolled scratch pad memory (SPM). Each core-group has 8GB memory with 34.1 GB/s bandwidth. The peak performance of the Sunway SW26010 is 3.06 TeraFlops under clock frequency of 1.45 GHz. Furthermore, the peak performance of an upgrade of SW26010 with the same architecture, Sunway SW26020, will be 8-10 TeraFlops.
B. SOFTWARE ENVIRONMENT OF THE SUNWAY PROCESSOR
The software system of the Sunway processor includes essential software, parallel operating system environment, high-performance storage management system, parallel programming language & compilation environment, and parallel development environment [8] . The programming languages supported by the Sunway processor include C/C++ and Fortran. For parallel programming, MPI and OpenMP are recommended for four core-groups in a processor, while the Sunway OpenACC and Athread APIs are used for 64 CPEs within a core-group. The customized Sunway Athread is similar to the pthreads, which defines an application programming interface(API) for writing multiple threads executed on 64 computing cores. And the customized Sunway OpenACC tool is extended from the original OpenACC 2.0 standard. Also, the Sunway processor provides the SIMD, an automatic vectorization tool, to utilize the 128-bit SIMD component of each CPE.
III. OVERVIEW OF THE SUNWAYIMG A. ARCHITECTURE OF THE SUNWAYIMG
As an image library, the primary objective of SunwayImg is providing various kinds of APIs to image-related applications and supporting parallel image processing on the Sunway processor efficiently. To achieve this objective, our SunwayImg integrates three kinds of image algorithms:
Fundamental image processing algorithms: algorithms that provide basic operations to images, e.g., codec for jpeg formats and image file access, color space transformation, image rotation, and normalization, etc. Figure 2 shows the architecture of SunwayImg, which is designed in three layers. The bottom layer includes fundamental image algorithms, basic data structures, and memory management facilities. The middle layer comprises image feature extraction algorithms and DBNs. The top layer is the application interface for processing the various number of images on both single Sunway processor and multiple Sunway nodes(e.g., Sunway TaihuLight) supercomputer.
B. PARALLELIZATION STRATEGY OF IMAGE ALGORITHMS FOR THE SUNWAY PROCESSOR
Despite the fact that most image processing algorithms are compute-intensive, their computational complexities are not so high as traditional engineering or scientific applications, such as computational fluid mechanics (CFD), molecular dynamics simulation, etc. Some image algorithms are both compute-intensive and data-intensive when involving massive images. In particular, different image algorithms have significant differences in their computational complexities.
On the other hand, some specific characteristics of the Sunway processor are needed to be taken into account in the parallelization of image algorithms. Firstly, the Sunway processor is composed of 4 symmetric core-groups, each of which consists one general-purpose core (MPE) and 64 acceleration cores (CPEs), and the vast majority of computing power comes from CPEs. Secondly, in each core-group, CPEs are controlled by the MPE, and can only access their local private store (scratch pad memory, SPM) with limited size. Instead of sharing main memory, data in the private store can only be transferred to/from main memory via DMA under the control of MPE.
Considering characteristics of image algorithms as well as the specific architecture of the Sunway processor, we propose a three-tier parallel strategy which consists of three kinds of parallelization schemes at different levels, briefly introduced as follows:
multiple-images-one-core-group scheme: this scheme processes multiple images in one core-group of the Sunway processor, and is suitable for the algorithms with low computational complexity. Under this scheme, considering that more data are needed to transfer between main memory and local store (SPM) of the Sunway processor, the parallelization design inside core-groups should focus on i. reducing data transfers; ii. improving the efficiency of the transfer. one-image-one-core-group scheme: this scheme processes one image in one core-group, and is suitable for the algorithms with middle computational complexity. Considering that each time an entire image is processed in one core-group, the main focus of this scheme is how to divide the image into blocks to use CPEs efficiently. one-image-multiple-core-groups scheme: this scheme processes one image using multiple core-groups, and is suitable for the algorithms with high computational complexity. The main concern of this scheme is how to eliminate or reduce communications among core-groups. According to our evaluation, the appropriate scheme for some image algorithms depends on not only their computational complexity and behavior but also the number of images to process, whereas the image number can only be determined at runtime. To solve this problem, we design an adaptive mechanism in the interface layer of our SunwayImg, which selects the appropriate scheme according to the number of images to process.
IV. PARALLELIZATION OF IMAGE ALGORITHMS ON THE SUNWAY PROCESSOR
In this section, we describe the parallelization of image algorithms on the Sunway many-core processor in detail. Fundamental image processing algorithms consist of a series of simple steps, which is the pixel-by-pixel mathematical operations and can achieve satisfied parallelism with SIMD instructions. On the other hand, computer vision algorithms and deep learning algorithms have higher computational complexity. We respectively parallelize these two types of image algorithms according to the characteristics of algorithms as well as the specific architecture of the Sunway processor.
A. PARALLELIZATION OF DEEP BELIEF NETWORKS 1) RESTRICTED BOLTZMANN MACHINES AND DEEP BELIEF NETWORKS
In order to process images in artificial neural networks, each binary image is treated as a training set which can be modeled as a two-layer network called Restricted Boltzmann Machine (RBM), in which binary pixels are connected to stochastic binary feature detectors using symmetrically weighted connections. One layer of the network is called visible units that correspond to pixels, and their states can be observed; the other layer is called hidden units that correspond to feature detectors. The typical structure of RBM, in which a joint configuration, (v, h) , between visible and hidden units, has an energy function defined as:
where v i ,h j are the binary states of visible unit i and hidden unit j; a i , b j are their biases and w i,j is the weight between v i and h j . RBM is a probabilistic model on binary variables v i and h j . The probability distribution for each configuration (v, h) is defined via energy function as:
where the ''partition function'', Z , is given by:
Given an input configuration, i.e., a randomly selected training image v, the value of hidden unit j is set to 1 with the probability as (4) .
where σ denotes the logistic sigmoid. Given a hidden vector, the probability of visible unit i is set to 1, given by (5).
In order to maximize the probability of the network assigned to a training image, RBMs adjust the a i , b j and w i,j to lower the energy of the image while the energy of other images are raised. The gradient ascent with respect to a w i,j in the log probability of the training data is given as:
However, the calculation of (v i h j ) ∞ starts from any random state of the visible units and will cost a very long time to perform alternating Gibbs sampling [9] . Therefore, the Contrastive Divergence algorithm, or CD-K algorithm, is proposed by Hinton, which starts with setting states of visible units to a training vector, then the value of hidden units are computed using equation (4), once the value of hidden units is chosen, the probability of setting each v i to 1 is given by equation (5), CD-K will be used to replace the second term in (6), which uses k full-steps of alternating Gibbs sampling. After that, gradient descent of weights w, bias a and bias b are calculated as (7), (8), (9) respectively, where is the learning rate.
Then the weight and bias are updated with W , a, and b with stochastic gradient descent(SGD) algorithm respectively. Considering that the feature re-construction ability of RBMs can be used in deep learning networks, multiple RBMs are stacked together to form like Deep Belief Networks (DBNs) to obtain stronger capability on feature representation. In a deep belief network, lower layers of RBMs extract low-level features from inputs (training dataset in general), while higher layers extract higher-level features from low-level features. Therefore, DBNs can learn concepts by extracting features level-by-level and can be used in classification after adding a supervised classifier [10] such as support vector machine or logistic regression.
Training of a DBN can be divided into two steps: greedy unsupervised layer-wise pre-train and supervised backpropagation fine-tune. The unsupervised layer-wise pre-train method uses untagged data to train one layer of DBNs each time. After that, a traditional supervised training method is used to fine-tune DBNs. Training a DBN is a computationally expensive task and requires a considerable amount of time. Therefore, lots of research focus on accelerating training process of DBNs in recent years, and the training of DBNs has achieved considerable speedups on various kinds of hardware platforms, including CPU, GPU, and FPGA [11] , [12] . 
2) IMPLEMENTATION OF RBM AND DBNs ON THE SUNWAY PROCESSOR
As mentioned in section 2, each Sunway processor has an MPE for logic control and 64 CPEs for parallel computing. Based on the specific architecture of the Sunway processor, we propose a master-slave acceleration model, as shown in Fig. 3 . Firstly, the image(s) are loaded on the MPE and data are prepared for computing function of CPEs; then multiple parallel threads are created on CPEs to execute computeintensive steps, with each thread bound to a CPE (core) and owns 64KB local memory (SPM) for storing data. The image data is transferred from main memory to SPM of CPEs via DMA. After that, the image data are processed in parallel on CPEs, and results are transferred from the local store on each CPE to main memory via DMA. Finally, our procedure returns to the MPE and closes the CPE cluster after collecting final results.
Generally, most calculation of RBM training can be converted into matrix or vector operations. For example, calculating the value of hidden units H with visible units input V, and transposing of weights matrix W T and bias b involves once matrix multiplication, once vector addition and once vector sigmoid (and a vector-binomial in the special model) in turn. To accelerate these time-consuming operations, some deep learning frameworks (e.g., Caffe) use the Basic Linear Algebra Subprograms (BLAS) [13] , [14] to achieve optimized performance on CPU and GPU. The Sunway processor also provides a similar mathematics library that could invoke the CPE cluster to support parallel matrix & vector computations. Based on the master-slave acceleration model, we use these interfaces to achieve higher efficiency for common matrix and vector computations. Moreover, to avoid trapping in local optimum and overfitting, we add momentum [15] and weight decay [16] when computing gradients.
However, Sunway processor doesn't provide some matrix/ vector operations for RBMs and DBNs such as sigmoid, softmax, etc. We design several such operations on CPEs cluster of Sunway processor, which involves plenty of DMA transfer between core-group main memory and CPEs cluster's local memory. However, data transfer via DMA is relatively inefficient, and DMA transfer frequently occurs during the whole process. Therefore, we design several fast matrix/vector operations to fit for this case. We try the best effort to use of each lightweight CPE's memory space to improve data reusing in the local memory. The fully utilize local memory reduces the amount of data in DMA transfer and therefore improves performance.
We implement DBNs based on RBM-implementation mentioned above. In addition to the layer-wise training process, we apply a fine-tuned procedure which includes: a feedforward pass through all layers, a backward pass from the top of the network to the bottom, a process of computing gradient and weights updating of the net. To support image classification, we add a logistic regression layer with Sunwayacceleration on the top of DBNs. We train DBNs with both unsupervised pre-train and supervised fine-tune method. The training process of DBNs is described in algorithm 1.
3) IMPLEMENTATION ON MULTIPLE SUNWAY PROCESSORS
Based on the implementation of DBNs on the Sunway processor, it is possible to train a DBN on a single core-group. Each Sunway processor has four core-groups, and many Sunway processors are organized together in the TaihuLight supercomputer. Thus, we proposed a parallel model to train DBNs on multiple core-groups. Training DBNs with multicore structure usually uses data parallelism model [17] to perform parallel training task.
In the data parallelism model, the training dataset is divided into n pieces; each training unit owns a copy of the entire network model and trains one piece of training dataset separately. After several rounds of training, a process summarizes the gradient descent matrix from each training unit, updates the weight matrix of the network, then distributes the updated weights to every training unit for further training. Compared with the single core-group version, the implementation for multiple core-groups needs to insert a weights synchronization process at training point. Each training unit starts training with the same model after weights synchronization. In general, data parallelism model sets a parameter server to organize weights synchronization, as shown in Fig. 4 .
Algorithm 1 DBNs Training Input: Pre-training dataset pX _a_0
Fine-tune dataset fX _a_0, label Y * x_a_k: the #a mini-batch of dataset output by first k layers y_a: label of #a mini-batch of dataset 1: Pre-train: 2: for layer k : 1 to n do 3: for each epoch do 4: for each mini-batch a do 5: load dataset px_a_k − 1 6: train RBM (layer k) with input px_a_k − 1 7: end for 8: end for 9: compute pX _a_k with RBM (layer k) 10: end for 11: for each epoch do 12: for mini-batch do 13: train Logistic Regression layer with px_a_n 14: end for 15 : end for 16 : Fine-tune: 17: for each epoch do 18: for each mini-batch a do 19: load mini-batch fx_a_0, label y_a 20: for layer k : 1 to n + 1(n-layers RBMs and onelayer classifier) do 21: upward (active hidden units) computes fx_a_k with fx_a_k − 1 22: end for 23: compute error_n + 1 with fx_a_n + 1 and y_a 24: for layer k : n + 1 to 1 do 25: backward (active visible units) computes error_k − 1 with error_k 26: compute gradient descent with error_k and fx_a_k 27: end for 28: update weights with gradient descent 29: end for 30: end for In our implementation, each core-group is a training unit, in which a whole DBN model performs with the master-slave acceleration pattern. Based on the master-save acceleration pattern, we parallelize most calculation of DBN training on the computing core of core-group. To utilize the computing power of the TaihuLight supercomputer, we adopt a dataparallelism strategy to train DBN on separate data subsets in parallel simultaneously. Therefore, it's critical to address the issues of data communication, synchronization, and workload balancing. For the workload balancing, the large-scale training datasets are averaged into each training unit due to each core-group has the same computation capacity. And we use the MPI interface to perform data transfer between training units. However, data transfers between core-groups are time-consuming and cost colossal bandwidth. Therefore, reducing data transfer among training units is our primary task. Besides, we propose a multiple parameter servers strategy to minimize weight synchronization.
We set up multiple parameter servers with each of which assigned one part of parameters. Compared with using only one parameter server, this approach can reduce the transmission and reduce the workload of each parameter server effectively. Due to the organization of multiple core-groups and limited bandwidth, linear topology turns out to be extremely suitable for data transfers and shows good scalability [18] . It's easy to extend this approach to the training process of other deep learning models, such as Convolutional Neural Network(CNN), Spiking Neural Network (SNN), etc., because it's data-parallelism parallelization which doesn't change the structure of the models.
We set a parameter server in each training unit and divide the gradient descent matrix ( W , a, b) into N /2 (N is the number of training units) chunks. Half of the parameter servers are master parameter servers, and a master parameter updated each chunk. We adopt the linear topology pattern to perform weights synchronization, and an example is shown in Fig. 5 , which demonstrates how N training units (N = 4) perform weights synchronization.
There are three stages in weights synchronization. The first stage is gathering gradient descent matrix: half of the parameter servers send a chunk to its next parameter server, and the rest of parameter servers add the received chunk to their corresponding chunk. The second stage is updating weight matrix: until every master parameter sever #(k * 2) (k is 1 and 2 in this example) gathers all chunks #k of other parameter servers, it adds the corresponding part of weight matrix to chunk #k. At that time, each master parameter server has gathered a piece of training results from all training units and generated one part of the new weight matrix. At the last stage, each parameter server sends the updated chunk to its next parameter server until all parameter servers get all fresh chunks, the process of which is firstly started from master parameter servers.
Every training unit gets the same updated weights matrix and can perform next round training from the same starting point after weights synchronization. At the first and last stage, all chunks are transferred among parameter servers during each period, and every parameter servers are either sending or receiving data. Such the method makes full use of bandwidth. During weights synchronization, there's a size of 2 * (N − 1) * matrix data transfers among parameter servers, which reduces 2 * matrix data transfer compared with using only one parameter server.
B. PARALLELIZATION OF IMAGE FEATURE EXTRACTION ALGORITHMS
In the early days, many computer vision tasks apply global or local features. The most widely used image local feature extraction algorithms include LBP, HOG, SIFT, and SURF, that calculate different kinds of features from an image and represent them as vectors. Recently, along with the rapid development of deep learning techniques, the convolution neural network(CNN) shows impressive performance on many computer vision tasks. However, traditional image feature extraction algorithms still play an important role in some computer vision tasks. For example, the CNN approach involves training with massive samples and often faces difficulties in model adjustment; on the other hand, the HOG features are suitable for small sample training. In addition, for image retrieval, some solutions need to combine the low-level SIFT-based image features with semantic-aware representation from CNN to improve the object retrieval performance [19] , [20] .
1) OVERVIEW OF THE ALGORITHMS
SIFT and SURF: both are traditional local descriptors for extracting local features from images. SIFT was firstly proposed in 2004 for extracting distinctive invariant features from images that can be invariant to image scale and rotation. Due to its relatively lower performance, an improved approach, SURF, was proposed in 2006 to speed up robust features extraction using integral images for image convolutions and FastHessian detector. The procedure of SURF and SIFT consists of four main steps: i. scale space construction; ii. keypoint detection and localization; iii. keypoint orientation computation; and iv. keypoint description computation. [21] , [22] . HOG: HOG is a classical image processing algorithm that is implemented to capture shape and appearance feature in pedestrian detection [23] . This algorithm calculates the distribution of horizontal and vertical directions of gradients of an image due to the magnitude of gradients is large around edges and corners. LBP: the algorithm labels pixels of an image with numbers, and encodes the relation of the local structure around each pixel. The LBP is widely used in image retrieval, face detection, and facial analysis [24] . The above algorithms have already been parallelized on multi-core processors and accelerators such as GPU [25] - [27] . Nevertheless, the parallelization of these algorithms on the Sunway processor needs different considerations to match algorithms with the architecture of Sunway processor. Our solution combines data partitioning with multithreading on CPE clusters to realize coarse-grained parallelism in core-group level and fine-grained parallelism in core level.
2) PARALLELIZATION DESIGN a: SIFT AND SURF
As shown in Fig.6 , our parallelization scheme uses multiple core-groups to execute SIFT and SURF algorithm simultaneously to process different blocks of an image, i.e., the one-image-multiple-core-groups scheme. In each core-group, the MPE is used to control the logic of individual steps, while the CPEs is responsible for compute-intensive part of algorithms based on the master-slave acceleration model. The parallelization scheme is described in detail as follows.
The most time-consuming step of SIFT and SURF is the scale-space construction, which is very suitable for parallelization. Besides, the subsequent phases of algorithms are all dependent on scale-space construction. Therefore, our coarse-grained parallelism design is based on scale-space. According to the principle of SIFT and SURF, the scale-space is divided into O(O ≥ 1) octaves, with each octave further divided into S + 3 scale images. For SIFT, the construction of each octave is independent with each other. But every scale image of each octave is generated from previous scale image convolved by a Gaussian filter. For SURF, each image of the scale space is generated from convolution with the original image in different size of box filters. In conclusion, the parallel granularity of SIFT and SURF is one octave and one layer of the octave, respectively. Therefore, we perform each octave scale-space construction on one core-group for SIFT and generate each scaled image on one core-group for SURF.
On fine-grained parallelism design, the method of data partitioning is different in various filters in our paper. Due to the limited local store in the CPE cluster, the maximum size of the data block should be less than 64 KB. On the other hand, the range of size of box filter is so broad that a reasonable partition is needed, while the size of the Gaussian filter is small enough for the local store. Generally, the size of the box filter begins with a default size of 9 × 9. The maximum size of the box filter reaches 195 × 195. If pixels of an image are stored in floating-point type, the memory size needed in calculating the response of box filter will be 152KB, which is beyond the capacity of the local storage in each CPE. In this situation, frequent data transfers between main memory and local story cannot be avoided. To improve the efficiency of data transfers, we divide the image in a row to make data continuously stored in memory, which is beneficial to the DMA. We partition all data in a grid of 64 × 64 pixels for SIFT. As for SURF, we found that if the box filter size is less than 27, the size of the block set as 64 × 64 pixels could reach the best performance. This design allows substantially efficient use of the local store and reduces the cost for data transfer. When the box filter size is larger than 27 but smaller than 51, it's better to partition the image data by one-full row. The rest condition is fit for partition data by 1 × 1 pixel.
The subsequent steps of SIFT and SURF include key point detection, computation of key point orientation, and description. The key point detection step involves finding a local maximum point. For the central S images in each octave, and each of pixels in them, values of 26 neighbors (8 in the same level; 9 in the next higher level and 9 in the next lower level) are inspected to detect local minima or maxima [28] . The computation of key point orientation and description are independent with each other. Hence, we execute these three steps in one core-group. In the stage of key point detection, we partition image data in grid and transfer them to the CPEs, and then we use to one core of CPEs to calculate orientation and description of each key point.
b: HOG
Unlike complicated SIFT and SURF algorithms, the HOG feature extraction algorithm only computes gradient orientation and gradient magnitude of the pixel value and generates the histogram of image gradients in different directions. It has a much lower computation/memory access ratio compared with the SIFT and SURF algorithms. If we execute the HOG algorithm among multiple core-groups, the MPI message passing among core-groups will affect the performance. Therefore, we parallelize the HOG algorithm by the one-image-one-core-group scheme, that is, using one core-group to process one image each time. And we focus on the fine-grained parallelism inside core-group.
HOG is to categorize gradient orientation of a cell, which has 8 × 8 pixels, into nine histogram bins: 0
• . Then grids of 2 × 2 cells are grouped in blocks. So a histogram block is a 36-dimensional vector, which is then used to form the feature vector of HOG by collecting all blocks over detection window with size 64 × 128. HOG takes tri-linear interpolate to vote for orientation between blocks, namely, according to gradient direction, abscissa, and ordinate to determine gradient magnitude of a pixel when inserted into the bin in the histogram. Therefore, when we partition the image data, the fine-grained granularity is based on the 2 × 2 cells. This ensures that the data needed for the final computing value of HOG feature vector can be transferred to the local store in one time, and reduces the time of data transfer between main memory and local store.
c: LBP
Considering the relative lower computational complexity of LBP, we firstly parallelize the algorithm using the multipleimage-one-core-group scheme. As a typical texture feature extraction algorithm, LBP consists of lots of template calculations with size 3 × 3, in which each pixel is compared with its eight neighbors in a 3 × 3 neighborhood by subtraction. In each core-group, we design a fine-grained parallelization scheme that divides the image data into blocks by rows of the image. This scheme not only improves the efficiency of data transfer between main memory and local store but also reuses data of rows in pipelining style. As shown in Fig. 7 , in the first time, three rows of image data are transferred to the SPM (local store), then LBP value of the middle row is calculated, in subsequent steps, each time we only need to replace one row to calculate LBP value of the next row.
C. FURTHER DISCUSSIONS OF PARALLELIZATION STRATEGY
As mentioned in section 3.2, our evaluation shows that the appropriate parallelization scheme for some image algorithms depends on not only their computational complexity and behavior but also the number of images to process. This section gives evaluation results and discusses appropriate parallelization scheme of the algorithms.
1) LBP AND HOG
Both LBP and HOG algorithms have relatively lower computational complexities, and their computing time is short. We evaluate the performance of these two algorithms under the scheme multiple-images-one-core-group and one-imageone-core-group, and results are shown in Fig. 8 (a) and (b) . From the figure we can see that when the number of images is less than 256, the one-image-one-core-group scheme behaves better; when the number is bigger than 256, the multipleimages-one-core-group scheme is better. Based on the evaluation results, the interface function of LBP and HOG select the multiple-images-one-core-group or one-image-one-coregroup scheme by judging if the number of images is an integral multiple of 64.
2) SIFT AND SURF
Considering the higher computational complexities of SIFT and SURF, we firstly implement these two algorithms using the one-image-multiple-core-groups scheme. However, our evaluation shows that the one-image-one-core-group scheme behaves better when the number of images increases, as shown in Fig. 8 (c) and (d) . The reason is that the number of keypoint is not predictable and different layer of scale space has various keypoint, causing that some coregroups maybe idle in the steps of keypoint detection and orientation and description computation. So we choose the one-image-multiple-core-groups scheme for a single image and the one-image-one-core-group scheme for massive images respectively.
3) DBNs
Deep learning model involves training neural network with a large number of training data, which is usually performed by combining task parallelism and data parallelism model. We converted most calculation of DBN into matrix or vector operation and parallelize them on the computing core of the core-group with multithreading technology. Meanwhile, we train the DBN models on multiple core-groups of the Sunway processor based on the data parallelism model. Since we use multiple core-groups to train the DBN model on large-scale datasets, its parallelization scheme falls into the multiple-image-multiple-core-groups.
V. IMPLEMENTATION
We implement SunwayImg in C language. The parallel programming inside core-group uses the Athread programming interface. As mentioned in previous sections, we regard each core-group as a computing node to simplify parallel hierarchies for applications across multiple machine-nodes. So the parallel programming among multiple core-groups uses MPI. Table 1 lists major APIs of SunwayImg. Applications can invoke these interface functions to realize the parallel execution of various kinds of image algorithms. For image feature extraction algorithms, we implement different interfaces for processing single image and massive images, respectively, according to the number of images. As described in Section 4, training an RBM consists mainly of four parts. Our implementation relies on mainly four functions for both RBMs and DBNs algorithms: emphactiveHiddenUnits(), activeVisibleUnits(), computeGradients() and updateWeights(). 
VI. EVALUATION A. EXPERIMENT SETUP
We evaluate our SunwayImg library on the Sunway TaihuLight supercomputer and compare the performance of DBN, SIFT, SURF, LBP, and SIFT algorithms with their original sequential version on CPU. The CPU used in our experiment is Intel Xeon six-core E5-2420 v2 @2.2GHz with 16GB DRAM. Table 2 lists the main configuration parameters of the Sunway SW26010 processor, which is used to build the TaihuLight supercomputer. We also investigate the scalability of our library on multiple nodes.
Our experiments use two kinds of datasets. One is the Corel Image Feature dataset [29] from the UCI machine learning repository, which is used to evaluate the performance of LBP, HOG, SIFT, and SURF; the other is the MNIST dataset of hand-written digits [30] which consists of 60,000 samples of 28 × 28 hand-written digit images, which is used to evaluate the performance of DBN.
B. EXPERIMENTAL RESULTS

1) PERFORMANCE ON SINGLE NODE
We first test the performance of LBP, HOG, SIFT, and SURF algorithms on processing a different number of images with different resolution, and calculate their speedup over sequential performance on CPU; the result is shown in Fig. 9 . All subsequent procedures come from the OpenCV library. As shown in the figure, our parallel implementation achieves average speedup 63.5x, 9.45x, 27.53x, and 26.93x, respectively. Since various data partitioning schemes have a different ratio of computing time against transmission time, the speedup on a different resolution of images is also different. LBP has higher speedup than the others due to its good spatial locality. Although HOG has functional parallelism like LBP, it needs to frequently transfer data between main memory and local store, which affects its overall performance. Figure 9 also shows that the higher resolution the image has, the bigger the speedup is, and the speedup increases along with the number of images and keep stable in the final. The speedup of LBP and HOG algorithm is not significantly affected by the resolution of images, while SIFT and SURF have a better performance on high-resolution images. This is because the most time-consuming phase of SIFT and SURF is the Gaussian pyramid construction, and the Gaussian pyramid consists of many layers of images. Thus, the sequential SIFT or SURF takes more time for highresolution images. On the other hand, our parallel implementation of SIFT an SURF utilizes task parallelism to process each layer of the Gaussian pyramid, respectively.
When evaluating our implementation of DBN, we mainly focus on the performance of training DBN and pre-training process. Firstly, we compare training performance between CPU and single core-group of Sunway processor. The CPU version of DBN is modified from Caffe, which is benchmarked with dual Intel Xeon processors. Considering that the parameter number (size) of RBM and batch size of training are two key factors that directly affect the size of dataset as well as the efficiency, we firstly test performance with the size of RBMs from 1 million to 64 million and record the average training time with performing weights synchronization after training 8,000 samples. As shown in the red curve of Fig. 10(a) , the performance of Sunway single core-group exceeds dual Intel Xeon CPUs by at least 5.6 times. It achieves higher speedups when training small RBMs since that CPU is not as efficient as Sunway when training small RBMs. However, CPU achieves high efficiency with larger RBMs, and the speedups of Sunway stabilizes around 5X. Then we perform training with various batch size and network size. The speedup is calculated by comparing with sequential performance on CPU, as illustrated in Fig. 10(b) . It can be seen that the batch size affects the performance of training RBMs, no matter what size of RBMs is. With the batch size growing up, the performance of training RBMs also increases.
2) PERFORMANCE ON MULTIPLE NODES
In addition to the performance evaluation on a single Sunway processor, we also test the performance of SunwayImg on multiple processors. Figure 11 shows the scalability of four image feature extraction algorithms on multiple Sunway nodes, where the speedup is calculated by comparing the performance of various nodes with single core-group, and the same 10,000 images are processed in each execution. From the figure, we can see that our implementation has good scalability on Sunway processors. By choosing appropriate parallelization scheme on processing massive images, communications among core-groups can be reduced, that helps to achieve almost linear speedup.
Synchronization takes more time, whereas computation consumes less. So we test the efficiency of data parallelism model by performing weights synchronization with different sizes of RBMs and a different number of core-groups. In the stage of weight synchronization, the throughput of data transfer is directly proportional to the size of DBNs. Furthermore, with the growth of the number of core-groups, the single parameter server undertakes the more workload. From Fig. 12(a) , we can see that the training time of single parameter servers increases linearly with the increase of training node and RBMs size. Although the amount of transmission grows with the number of core-groups, time of weights synchronization does not occur in linear growth with core-groups due to bandwidth increment on multiple parameter servers, as illustrated in Fig. 12(b) . Multiple parameter servers achieve 4.5× speedup on eight Sunway processors compared to the single parameter server.
Based on the above experiments, we further evaluate the performance of a different number of core-groups on training with 60,000 samples. Figure 13 shows the speedup over single core-group, in which the speedup reaches the highest 12.0x on eight Sunway processors. However, weights synchronization takes more time than computation with furthermore core-groups, so speedup goes flat.
C. STATISTICAL ANALYSIS ON EXPERIMENTAL RESULTS
In the above subsection, we evaluated the performance of parallel algorithms on the Sunway processor. The examinations show that our implementation of the algorithm on the Sunway processor is significantly improved compared to the CPU version based on task-parallelism and data-parallelism optimization. Table 3 presents the statistical analysis for image feature extraction algorithm. It can be found that: (a) From the task-parallelism perspective, the optimization efficiency of the algorithm is related to the computational complexity of the algorithm and the image resolution. For all algorithms, the image with the higher resolution has higher speedup due to its friendly data locality. The LBP algorithm has better parallelism because it includes lots of simple overlap calculation, such as template calculation. (b) For the data-parallelism, scheme 1, 2, and 3 represent the proposed three-tier parallelizationstrategy, that includes multipleimages-one-core-group,one-image-one-core-group, and oneimage-multiple-core-groups, for processing massive images, respectively. It appears that the efficiency of data parallelism depends on the number of images and the ratio of computing time against transmission time. The scheme multipleimages-one-core-group is better used to process a small amount of image with lower computation/transmission algorithms. The scheme one-image-multiple-core-groups is beneficial to parallelize a complicated algorithm on multiple core groups. On the other hand, the scheme one-image-multiplecore-groups is suitable for processing massive images with various algorithm. (c) In the experiment on the scalability of four image feature extraction algorithms on multiple Sunway nodes, we almost get linear speedup over single core-group. It benefits from choosing appropriate parallelization scheme on processing massive images, communications among core-groups can be reduced.
In addition, we choose three types of DBN models to evaluate the performance of training processing on the Sunway processor. These models include the small size DBN model with four-tier RBM 784 − 500 − 500 − 500 − 1000 − 10, In table 4 , we compare the performance of training DBN models on CPU with on the Sunway processor with varying batch sizes. On the single core group, we can see that the Sunway achieves highest speedups when training small DBN with batch size equaling to 128. Overall, the Sunway could obtain better performance in training small DBN model. And the performance almost reaches the best at 128 on the single core group. Based on combining data parallelism and weights synchronization, the scalability of training processing on multiple computing nodes gets a consistent growth on the speedup over CPU. 
VII. DISCUSSION AND CONCLUSION
In this paper, we propose SunwayImg, a parallel image processing library for the Sunway many-core processor as well as the Sunway TaihuLight supercomputer. The library integrates three kinds of image processing algorithms: fundamental algorithms to support basic image operations on the Sunway processor, widely used image feature extraction algorithms and a typical neural network model DBN. Besides, this paper proposes a three-tier parallelization strategy as well as finegrained parallelization inside core-groups to parallelize various kinds of image algorithms efficiently in the Sunway processor. Finally, we evaluate the SunwayImg on the Sunway TaihuLight supercomputer to verify its effectiveness and performance. On a single processing image, the LBP, HOG, SIFT and SURF algorithm can achieve speedup 63.5×, 9.45×, 27.53×, and 26.93× respectively. While on processing massive images, the performance of algorithms increases with the rise of the image number and behaves satisfied scalability. As for the neural network model DBN, training a DBN with nearly one billion parameters on four nodes of TaihuLight supercomputer obtains more than 33 times speedup compared with CPU.
For further research, we plan to introduce more image processing algorithms to our library on the Sunway processor. Regardless of we have implemented some useful algorithm, it's necessary to extend classic algorithms and novel algorithms for various image applications. Support Vector Machine (SVM) is often described as an almost not scalable algorithm in terms of both computational time and memory use [31] . SVM consist of two different tasks; the computation of the kernel matrix(KM) and a standard SVM training procedure. The parallelization and optimization of SVM models mainly focus on two ways, including algorithmic approaches and parallel data processing methods. In this paper, we also combine the task parallelism and data parallelism for optimizing the algorithm on the Sunway processor. Our proposed slave acceleration model is developed for applying the CPE cluster's applicability on the heterogeneous architecture of Sunway processor. It can be transferred to optimize the kernel matrix manipulation of SVM on the core group of Sunway processor. Besides, the parallel data processing method that is dividing training data into subsets is similar to our training process of DBN on multiple core groups environments. However, there also exits a new challenge that we should take account in the future. SVM algorithm not only issues in computational time, but it also suffers from memory use with the large datasets and high dimension, which needs a processing and memory-bandwidth intensive procedure.
Another example of possible research on parallelization is the scalability of more deep learning models on the Sunway processor. We'll consider the model that represents stateof-the-art approaches on current computer vision domain, such as Convolutional Neural Network(CNN), Spiking Neural Network (SNN) for image classification tasks and region based convolutional neural networks for object recognition tasks [32] , [33] . The success of deep learning models depends on large-scale training datasets and reliable computing power of cloud computing, high-performance computing cluster, and supercomputing [34] . Therefore, it's critical to accelerating the training process of deep learning models based on the Sunway processor with multithreading technology. In the paper, we adopt the linear topology pattern to perform weights synchronization for multiple nodes of TaihuLight supercomputer. In the future research, we would not only transfer our training process to other deep learning models but also be interesting how to design an effective parallelization strategy according to the various structure of deep learning model based on the Sunway processor. 
