I. CUDA
INVIDIA proposed CUDA (Compute Unified Device Architecture) [2] in 2007, which created a new era of GPU computing and made GPU glide into place as a new important computing resource [3] . Generally, write programs for GPU on C language with minimum extensions. What's more, it is easy to program with single program multiple data. Figure1 shows the CUDA programming model. Program based on CUDA integrates serial host C code and highly parallel device kernel C code. A CUDA device is a coprocessor to the CPU or host and runs a large number of threads in parallel. Execution model involves kernels, threads, blocks and grids. A CUDA kernel is data-parallel portions of an application and executed by an array of threads. All threads run the same code but different data. Divide monolithic thread array into multiple blocks. A group of blocks compose of a grid. Threads within a block cooperate via shared memory, atomic operations and barrier synchronization, but threads in different blocks cannot cooperate. Every thread has an ID (1D, 2D or 3D) used to compute memory addresses and make control decisions. Similarly, each block has an ID (1D or 2D). A warp of 32 threads physically runs on SM (Streaming Multiprocessor) and shares instructions. When operands of the warp are ready, it will be executed. All warps are dynamically scheduled by SM which is an array of SPs (Streaming Processors). GTX980 (GTX280) adopts GM204 core of Maxwell architecture (GTX200) and has 2048 (240) SPs. The theoretical computation speed is 4.6 TFLOPs (933 GFLOPs) and memory bandwidth is 224GB/s.
II. CONVOLUTIONAL NEURAL NETWORK-FIVE LAYERS
The feature of CNN is described in terms of three basic conception, which are Feature Map, Weight Shared and subsampling [5] [6] . The structure of five layers CNNS applies to hand-written numeral recognition, as shown in Figure 1 . The former three layers are composed of some features mapping. The scale of the current layer is less than the previous, but quantities of the current are more than the previous. Feature Map in the same layer share an acceptance domain and a bias.
C1 is the input layer that accepts 29*29 digital handwriting images of 0-1 matrix. Therefore, C1 has 29*29 neurons, that is to say, C1 is composed of 841 typical neurons.
C2 is the convolutional layer. The convolution between a 5*5 acceptance domain and C1 produces six feature maps. Because of sampling C1, each Feature Map in C2 layer is composed of 13*13 neurons. Therefore, C2 have 13*13*6=1014 nodes and (5*5*+1)*6=156 numbers of weight linked to C1. Totally, the connection of C1 and C2 has 1014*(5*5*+1) = 26364. C3 is the convolutional layer. There are 50 feature maps that are small than C2's size and more than its numbers. The convolution between a 5*5 acceptance domain and C2 produces 5*5 images in each Feature Map. One pixel in C3's Feature Map is generated in the convolution between a 5*5 acceptance domain and the combination of corresponding region of six feature maps. This layer have 5*5*50=1250 neurons and (5*5+1)*6*50=7800 numbers of weight linked to C2. Totally, the connection of C2 and C3 has 1250*(5*5+1) = 32500. N1 is fully connected of 100 neurons. There is no Feature Map, but a bias is shared to classify. The 1250 neurons in C3 layer are fully connected to N1. Hence, the connection of C3 and N1 has 100*(1250+1) = 125100. N1 layer have (1250+1)*100=125100 numbers of weight.
N2 is the output layer that output 0~9 handwritten numeral image. The layer is fully connected to N1 and have 10 neurons used in classification and result. There is (100+1)*10=1010 numbers of weight and 10*(100+1) = 1010 connections.
Moreover, each layer uses the hyperbolic tangent function [2] expect for N2:
III. PARALLEL RECOGNITION ALGORITHM FOR CONVOLUTIONAL NEURAL NETWORKS

Parallel data structure based on CUDA
The following is the specific description of the data structure ,which is involved in the algorithm to the CUDA map and is addressed by blockldx (x, y) and threadldx (x, y, z). blockldx (x, y) and threadldx (x, y, z) are one-dimensional.
(1) Convolution feature map
The five layer convolutional neural network can be described as a data structure (accepting domain, feature map) ,which is composed of 2 2D -array,
them, F ij represent j-th feature mapping of i-th layer, P ij [N] [N] represent the j-th-accepted-domain of i-th layer is a N×N 2D array P ij , F ij [M] [M]represent the j-th feature map of i-th layer is a M×M 2D array F ij .
In a single layer, mapped to CUDA, it can constitute a one-dimensional array set as the following formula:
Among them, _P i {Pixel}N*N, _F i {NeuronCNN}M*M，represent i-th layer mapping feature set, each set share an accepted-domain _P i to accept the samples feature maps from upper layer, _F ij represent the j-th map of i-th layer, and it is an 1D array that make up with M*M convolution neurons, and they assemble _F i , which represents an 1D array that assemble by all i-th feature map. The size of j depends on the number of feature map of this layer. When i=1, it does not contain accept-set _P i . A typical neuron behavior can be modeling as: 
Thread setting
GTX280 the ability of calculation is from 1.3 of the specification, as specification, from top to bottom, the Kernel specific thread settings between the layers are shown in Table 1 . 
Structural analysis of identification results for identification
After the above process, the individual finally is classified into a set of similarity set U{ui},i=0,1,…,9, which is make up with u which represent the similarity of identified individual 0~9. Finally, the results can be obtained with MAX{U}.
IV. KEY ALGORITHM DESCRIPTION
Assume that dimGrid and dimBlock are Grid dimension and Block dimension respectively.
Convolution Computation Algorithm
Convolution computation algorithm is as follows:
Initialize the result r of convolution within the shared memory Device (3) _shared_double r=0; (4) Indexing data position, blocklD = blockldx. x (5) Take an offset e is in_W i (6) Loop step 1)~4) according to the size of accepting domain _P i 1) Do sampling from matrix _FM i according to _P i 2) Perform convolution according to formula (4) 3) Call data collection algorithm in section 4.2 4) Using the activation function in formula (1) to adjust the output amplitude of neurons End For (7) Thread synchronization _syncthreads() (8) The end Tested on NVIDIA GeForce GTX280, which contains global memory 1 GB. This GPU mounted with the Intel Core2 E8400 3.0 GHz of PC machine.
Data Collection Algorithm
In order to make the results comparable, we use MINST [6] handwritten digital word library and selfbuilt libraries relatively in the upper CPU and GPU. The results show that the accuracy of self-built libraries and library MNIST are respectively about 93% and about 95%, which shown in Table 2 . As to accuracy, CUDA technology and senior languages in x86 architecture CPU are different in technical processing. In the CPU, the currently most advanced language (including c) are in accordance with IEEE-754 floating-point standard to regular a storage format. In CUDA, computing devices follow a single-precision binary floating-point IEEE-754 standard, except that (here are only partially cited, details in Ref. [4] ):
(1) Addition and multiplication are usually combined into a multiply-add instruction (FMAD) (2) Division implemented by nonstandard reciprocal; (3) The square root of the square root through nonstandard reciprocal realization; (4) Does not support direct rounding to plus / minus infinity; (5) Does not dynamically configurable rounding mode; (6) No floating-point exception monitoring mechanism, the floating-point exception always be recorded; (7) The results of an operation contains one or more Nan, Nan bit mode is 0x7FFFFFFFF. There is a slight error between handwritten digital in GPU and CPU final output and 0-9 in similarity. From Figure 2 , we can see the standard deviation of the similarity of the number of individual numbers on the CPU and GPU 1 000 times recognition output is 7 
10
 . Calculation error is small enough, and the identification is correct or not depends on the individual numeric similarity. Therefore, it is same between on the CPU and GPU in correctness of recognition and detection but speed has two order of difference. Figure 3 appears that the comparison of floating point computing power for CPU and GPU, the difference of average floatingpoint computing power is up to 60 times in peak value. With the increase times of recognition, floating-point capabilities are flat and incline to linear trend. processors, do scheduling and allocate date reasonably, can be better for a variety of applications of neural network.
