BLAS3 optimization for the Godson-3B1500 by Ming Zhang et al.
BLAS3 optimization for the 
Godson‑3B1500
Ming Zhang, Naijie Gu* and Kaixin Ren
Introduction
Basic linear algebra subprograms (BLAS) (Netlib 2016a) are basic and significant math-
ematics kernels that provide key functions for high-performance computing (HPC) 
applications. General matrix multiplication (GEMM), the kernel of level-3 BLAS, is vital 
for the numerical software Lapack (Netlib 2016b) and performance benchmark Linpack 
(Netlib 2016c). Especially in Linpack, GEMM accounts for 93% of the entire execution 
time when it is unoptimized (Zhang et al. 2004). Moreover, GEMM is representative of 
applications where both computation and memory access are in high demand. There-
fore, optimizing the performance of GEMM is significant for guiding improvements in 
the performance of other applications. Additionally, optimizing computing-intensive 
applications such as GEMM can simulate potential problems and help to find bugs in 
newly-developed hardware platforms.
Recently, numerous studies have been conducted to improve the performance of 
BLAS. Many libraries such as Intel MKL, AMD ACML, ATLAS and GotoBLAS (Goto 
and Van De Geijn 2008a, b) have been supplied by CPU vendors or HPC research-
ers. These libraries are aimed at the highest level of performance on various hardware 
Abstract 
This paper proposes a performance model for general matrix multiplication (GEMM) 
on decoupled access/execute (DAE) architecture platforms, in order to guide improve-
ments of the GEMM performance in the Godson-3B1500. This model focuses on 
the features of access processors (APs) and execute processors (EPs). To reduce the 
synchronization overhead between APs and EPs, a synchronization module selec-
tion mechanism (SMSM) is presented. Furthermore, two optimized algorithms of 
GEMM for DAE platforms based on the performance model are proposed for ideal 
performance. In the proposed algorithms, the kernel functions are optimized with 
single instruction multiple data (SIMD) vector instructions, and the overhead of AP is 
almost overlapped with EP by taking full advantage of the features of the architecture. 
Moreover, the synchronization overhead can be reduced according to the SMSM. In 
the end, the proposed algorithms are tested on the Godson-3B1500. The experimental 
results demonstrate that the computing performance of dGEMM reaches 91.9% of the 
theoretical peak performance and that zGEMM can reach 93% of the theoretical peak 
performance.
Keywords: DAE, BLAS, Performance optimization, Godson-3B1500
Open Access
© The Author(s) 2016. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License 
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, 
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and 
indicate if changes were made.
RESEARCH
Zhang et al. SpringerPlus  (2016) 5:2014 
DOI 10.1186/s40064‑016‑3690‑3
*Correspondence:   
gunj@ustc.edu.cn 
School of Computer Science 
and Technology, University 
of Science and Technology 
of China, 508, Elec-3(Diansan) 
Building, West Campus of 
USTC, Huang Shan Road, 
Hefei, Anhui Province, China
Page 2 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
platforms. Additionally, Allen et al. (2009) described auto-tuning and optimized GEMM 
techniques for GPU. Wang et  al. (2013) have presented a template-based optimized 
framework-AUGEM that can automatically generate fully optimized assembly DLA ker-
nels. The DLA kernels generated by their template-based approach surpass the imple-
mentations of MKL and ACML libraries. Moreover, Gu et  al. (2008) have conducted 
much work for BLAS3 optimization on the Godson-2F platform. He et al. (2012) have 
carried out a study on optimization of BLAS3 on the Godson-3A. Zhang et al. (2012) 
have released a new library, OpenBLAS, which greatly improves the BLAS3 perfor-
mance on the Godson-3A.
The optimized algorithms and models described above can efficiently enhance perfor-
mance and guide users designing optimized frameworks. However, with advancements 
in the peak computing capability of processors, conventional memory access methods 
cannot satisfy computational requirements, and traditional optimization methods will 
be limited. To solve the memory wall, hardware that uses asynchronous memory access 
technologies has been developed. As a representative, decoupled access/execute archi-
tecture (DAE) (Smith 1982, 1984) was proposed by Smith in 1982. Generally, there are 
several access processors (APs) and execute processors (EPs) on DAE platforms. APs are 
accountable for memory access, and EPs are responsible for computations. These func-
tional units are independent and can work in parallel. DAE has now become a valued 
architecture for HPC applications such as BLAS and FFTW due to its superior comput-
ing ability and memory access performance.
It is difficult for applications to auto-optimize performance by making full use of EPs 
and APs, and manual optimizations are needed for the DAE architecture. The Godson-
3B series consist of DAE platforms. To improve the performance of applications for the 
Godson-3B, some studies have been conducted. Zhu (2011) has designed a new algo-
rithm of dGEMM on the Godson-3B, which has been implemented in a simulation plat-
form. Zhao et al. (2013, 2014, 2015) introduced several auto-optimization technologies 
for BLAS, and the optimizations of dGEMV and dGEMM (ATGEMM) were discussed 
in detail in the Godson-3B1000. ATGEMM was optimized by using the L2 cache as the 
intermediate storage space.
However, these studies do not give thorough consideration to the performance impacts 
of various architectures, including DAE. Therefore, in order to facilitate the optimization 
of applications in the DAE architecture, a performance model of GEMM for DAE plat-
forms is proposed in this paper. The impacts related to computation and memory access 
are parameterized in the proposed model, and the time needed by APs and EPs will be 
evaluated according to the computing account and computing power. The runtime of 
computing kernels can be preliminarily computed and presented with the features of 
EPs. Taking into account various factors in the performance model, the overall runtime 
can be preliminarily computed. The GEMM performance is improved by analyzing the 
variables that obviously influence the overall runtime in the model. Additionally, several 
optimized algorithms for ideal performance in the Godson-3B1500 are proposed based 
on the performance model.
This paper is organized as follows: second section describes the background, includ-
ing basic GEMM algorithm and the Godson-3B1500. Third section discusses the per-
formance model, followed by the optimization technologies. In fourth section, we detail 
Page 3 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
the proposed algorithms of GEMM. Fifth section presents the correctness of numeri-
cal accuracy and the performance improvements. Finally, conclusions are drawn in last 
section.
Background
This section describes the background, including the basic GEMM algorithm and the 
Godson-3B1500. To introduce the Godson-3B1500, we mainly focus on memory access 
methods and vectorization instructions.
Basic GEMM algorithm
GEMM is a basic and key algorithm in mathematics. Assuming that the dimensions of 
the matrix A, B and C are M × K , K × N  and M × N , respectively, GEMM indicates 
that C = αC + βA× B, as shown in (1).
where Am,k denotes the (m, k) entry of the matrix A, Bk ,n represents the (k, n) entry of 
the matrix B, and Cm,n represents the (m, n) entry of C. For simplification, GEMM men-
tioned above can be defined as GEMM(M, K, N).
GEMM is called dGEMM when the elements of the matrices are double-precision 
floating-point numbers. GEMM becomes zGEMM for double-precision floating-point 
complex numbers. A complex number consists of a real part and an imaginary part. 
Unlike real numbers, the multiplication of complex numbers consists of four multiplica-
tion operations and four addition/subtraction operations. Assuming that complex num-
ber x is a+ i × b (i = √−1), the result is (ac − bd)+ i × (ad + bc) when x is multiplied 
by c + id. The real and imaginary parts of the complex numbers are stored interleaved in 
the memory in BLAS.
Godson‑3B1500
The Godson (Hu et al. 2011) is a family of general-purpose MIPS64 CPU developed at 
the Institute of Computing Technology, Chinese Academy of Sciences. The Godson-3 
Multi-Processor-Chip aims at high-end desktop HPC computers, and the Godson-
3B1500 (Hu et al. 2011) is the third generation. As shown in Fig. 1, each CPU consists 
of two nodes. There are four GS464V cores, one RDMA Matrix Transposition (Trans-
DMA), four 1MB L3 cache, and one DDR3 memory controller in each node. The God-
son-3B1500 can issue four instructions in parallel within a single clock cycle, including 
two floating-point instructions and one memory access instruction.
As shown in Fig.  2, the memory subsystem of the Godson-3B1500 consists of four 
storage levels, including the L1 cache, L2 cache, shared L3 cache and memory. Com-
munication between different components in the memory subsystem occurs via the 
advanced extensible interface (AXI) protocol with cache coherence extension. The L1 
and L2 caches of each core are private, while the L3 cache is shared by cores. The caches, 
which adopt the random replacement strategy, are four-way set-associative, and the size 
of a cache line is 32B (256-bit) (Gao et  al. 2010). Each cache line corresponds to four 




Page 4 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
cache positions, and the four positions are considered as one cache group for simplicity. 
When the cache missing occurs, the corresponding cache line will be stored into the cor-
responding cache group. One cache line in the cache group will be chosen by a random 
algorithm and will be replaced by the new incoming cache line when the cache group is 
full. Since the random cache replacement strategy does not consider the temporal local-
ity of memory access, the most frequently accessed data may be replaced.
As shown in Fig. 2, TransDMA is used to transfer data between the L3 cache and the 
memory. Meanwhile, VectDMA is used to transfer data between the L3 cache/memory 
and the vector registers. Each core has one VectDMA that is configured with four chan-
nels, a, b, c and d. Channels a, b and c transfer data from the L3 cache/memory to vector 
registers, while channel d writes data back to the memory/L3 cache from vector reg-
isters. Additionally, each core holds 128-entry vector registers (VectReg) that support 
single instruction multiple data (SIMD) vector instructions. A vector register can store 
256-bit data, such as four double-precision floating-point numbers. The instruction 
Fig. 1 Architecture of the Godson-3B1500
Fig. 2 Memory hierarchies and data transfer methods in the Godson-3B1500
Page 5 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
VBCMULADDPD is representative of the SIMD vector instructions in the Godson-
3B1500, and it can launch four multiply-add operations of double-precision floating-
point numbers in one cycle. As shown in Fig.  3, VBCMULADDPD operates the vector 
registers A, B, and C, and the results are stored in C. The variable θ represents which 
number will be operated in vector register A.
The Godson-3B1500 can use the mechanism of a cache lock. When the lock window 
is configured, the cache blocks that are located in the locked L3 cache space cannot be 
replaced until they are updated manually. For computation-intensive applications, many 
data need to be accessed and computed multiple times. Cache missing brings numerous 
extra overheads, which notably influence the performance. It can be even more fatal, 
especially for platforms with a random cache replacement policy. The lock mechanism 
can keep frequently required data stored in the locked cache, which can greatly reduce 
the influence of cache missing for computation-intensive applications and enhance the 
application performance.
The Godson-3B1500 is a DAE platform. VectDMA and TransDMA can work as APs, 
and the vector function units work as EPs. The GS464V core can issue two floating-point 
vector instructions, and each instruction can launch four multiply-add operations in one 
cycle. There are two floating-point operations in the multiply-add operation. Moreover, 
there are 8 cores in the EP. When the CPU frequency is 1.5 GHz, the theoretical comput-
ing peak capacity (Perfpeak) can reach 2× 4 × 2× 8× 1.5 (192.0) GFlops. Generally, the 







Fig. 3 The SIMD vector instruction VBCMULADDPD. a Instruction VBCMULADDPD. b VBCMULADDPD opera-
tions
Page 6 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Performance optimizations
These performance optimizations are mainly issued in the MIPS architecture. MIPS is 
a streamlined and highly scalable reduced instruction set computer (RISC) architec-
ture. It can support SIMD vector instructions and visible pipeline delay slots. There are 
large number of registers, the number and the character of the instructions in the MIPS. 
Sometimes it can support different memory access methods such as normal CPU access 
and DMA access. Generally, there are multilevel caches in the MIPS architectures.
This section presents the performance model of GEMM in DAE architecture. The 
model is developed according to the features of the DAE architecture and briefly intro-
duces the relationship between overall performance and times of EPs and APs for all 
DAE architecture. Relationship between performance and architecture, the most impor-
tant part of the performance model, in which the number of functional units, computa-
tional ability of functional units, instruction pipeline structure and capacity of the APs 
are included, focuses on the Godson-3B1500. Moreover, taking into account the perfor-
mance model, some optimization technologies are discussed for performance improve-
ments of the GEMM on the Godson-3B1500.
Performance model
There are two advantages for proposing the performance model for GEMM. The first 
advantage is to optimize the BLAS3 and guide the designing of algorithms. The second 
advantage is to offer a modeling method for other applications in the DAE architecture, 
including Godson-3B1500. The users of Godson-3B1500 can learn how to design an 
effective algorithm with the help of the GEMM performance model in our manuscript. 
For a detailed analysis of factors, several variables that represent architecture parameters 
are defined. These variables are employed into character TN , which is the lower bound 
for the overall runtime of GEMM(N, N, N). Generally, factors for GEMM performance 
in DAE architecture include the following parameters.
  • Tmem, which presents the time for data transfer between different storage hierar-
chies (e.g., memory, caches and register files) by using normal CPU memory access 
instructions, can be defined as in (2). 
where l(j, k) denotes the amount of data that are accessed from the k-th layer of 
memory in the j-th computing stage, and L(k) denotes the total amount of data that 
are accessed from the k-th layer in all computing stages. ω(k) defines the memory 
bandwidth of access the k-th layer of memory.
  • Tshuﬄe It denotes the time for reorganizing the data including changing the position 
of the data and obtaining the negative of a number in the vector registers. Some-
times, Tshuﬄe can be partially minimized or avoided by optimizing the multiplication 
of complex numbers and integrating the shuffle function into SIMD vector instruc-













Page 7 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
  • TEP It presents the required time for kernel computation of GEMM. The required 
time mainly depends on the size of GEMM, and the computing capacity of the EPs. 
For GEMM(N, N, N), TEP can be defined as in (3). 
In (3), sizeop defines the amount of operations for each operation. s defines the num-
ber of function units. v defines the average degree of parallelism for each instruction. 
g defines the number of operations for each function unit in one cycle. f defines the 
frequency of the CPU. 1 denotes the overlapping factor of time for memory access 
by EPs. The parameters g, s and f are determined by the hardware and they are fixed 
for the platform.
  • TAPi, which presents the time of data transfer for the i-th AP, can be defined as in (4). 
where Counti,j defines the size of data in the j-th stage for i-th AP. SpeedAPi,j denotes 
the memory access speed of APi for the j-stage. p denotes the amount of stages. 
There are two stages for the process of GEMM. The first stage is the transfer of data 
from the memory to the locked L3 cache, and the second stage is the transfer of data 
from the L3 cache to the vector registers.
  • Tsync It defines the overhead of the synchronization between APs and EPs, such as 
the time between computation and DMA in some architectures.
  • Textra It denotes the extra overhead of other processes, such as the computation of 
positions for data prefetching and data storing.
The above-mentioned parameters can be divided into two groups. One group is the 
time for computation, Tc. The other one is the time for memory access, Tm. Tc can be 
defined as in (5), and Tm can be defined as in (6).
The computation and memory access can be processed in parallel in the DAE architecture. 
The ratio of overlapped memory access time can be defined as .̺ TN can be derived as in (7). 
When the time of APs is overlapped with the computational overhead, the performance is 
mainly determined by the execution overhead. For a large N, the extra cost of other processes 
can be ignored compared with the time of execution, and TN can be further simplified.
(3)TEP =
N 3 × sizeop







(5)Tc = TEP + Tsync + Textra
(6)








TN = max {TEP + Tsync + Textra,Tm}
= TEP + Tsync + Textra + (1− ̺)Tm
= N
3 × sizeop
s × v × g × f + Tshuﬄe + 1Tmem
+ Tsync + (1− ̺)Tm + Textra
Page 8 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
The number of floating-point operations, Ncal, for GEMM(M, K, N) is fixed. For 
GEMM(M, K, N), MKN operations between matrix elements are needed. There are 
four multiplications and four additions/subtractions for each operation between matrix 
elements in zGEMM, and there are one multiplication and one addition in dGEMM. 
Therefore, Ncal is 8MKN for zGEMM(M, K, N) and 2MKN for dGEMM(M, K, N). The 
practical performance of GEMM, P, can be calculated as P = Ncal/TN. After substitut-
ing TN into the formula P = Ncal/TN, P can be expressed as in (8).
As shown in (8), in order to enhance the performance P, the variables Tshuﬄe, Tsync, 
Textra and 1 should be reduced, while  ̺ and v should be increased. In the DAE archi-
tecture, APs and EPs can work in parallel. To reduce the memory access overhead, APs 
accomplish most missions of memory access, and the normal memory access unit is 
responsible for the remaining missions of memory access. Most GEMM tasks are com-
putations, and extra overhead makes little difference to the overall runtime.  ̺ is influ-
enced by the computation to memory access overhead ratio, and it is mainly determined 
by the features of the algorithm and hardware. Variables  ̺ and Textra will not be dis-
cussed in this paper. In the following subsections, the optimizations of v, Tshuﬄe, 1, and 
Tsync are mainly discussed.
Vectorization
The main computational overhead is kernel computation, in which v is a decisive fac-
tor. v presents the average degree of parallelism for each instruction, and vectorization 
can exponentially increase its value. To improve the overall performance P, vectorization 
is the most important optimization technology to increase v theoretically. In GEMM, 
most computations are multiply-add and multiply-subtract operations. In the Godson-
3B1500, the vector instruction VBCMULADDPD can operate the vector registers and 
launch four multiply-add operations in one cycle. The kernels of GEMM can be vector-
ized by using vector instructions. Figures 4 and 5 show the methods to optimize GEMM 
kernels with vector instructions.
Computations of dGEMM are operations between real numbers. All computations are 




s×v×g×f + Tshuﬄe + 1Tmem + Tsync + (1− ̺)Tm + Textra
Fig. 4 Vectorization for dGEMM computing kernel
Page 9 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
sizes of matrices A and B are 4 × 1 and 1× 4, respectively, for which there exist 16 mul-
tiply-add operations. Normal instructions operate the normal registers and can launch 
1 operation in one cycle. As shown in Fig. 4, when the normal multiply-add instruction 
madd is used, there are 16 madd instructions. The original value of v equals 1. In the 
BLAS library, the data in the matrix are arranged in column-major order. When blockB 
is preloaded to the L3 cache, matrix transportation is needed to match VectDMA. 
Compared to the column-major order of the original matrix B, the data of blockB in 
the L3 cache can be seen in row-major order. Every four neighboring numbers in matri-
ces A and B can be accessed by the same vector registers. Then, the instruction VBC-
MULADDPD will be called. At the end of computations, the results are stored into four 
corresponding vector registers. In total, four vector instructions are needed for kernel 
computing. After vectorization, the number of kernel instructions will decrease from 16 
to 4, and v changes from 1 to 4. There are no shuffle operations in dGEMM, and the 
value of Tshuﬄe is 0.
Computations of zGEMM are operations between complex numbers. Unlike dGEMM, 
the operations of zGEMM consist of multiply-add and multiply-subtract operations. 
Figure 5 shows the kernel multiplication of zGEMM(2, 1, 2), and the result is a 2-by-2 
matrix. When the kernel is realized with normal instructions, such as the multiply-add 
instruction madd and multiply-subtract instruction msub, the instructions operate the 
normal floating-point registers. Normal registers rx and ix are used to store the real and 
imaginary parts of complex numbers, respectively. As shown in the middle subfigure of 
Fig. 5, 16 normal instructions are needed for zGEMM(2, 1, 2), and the original value of v 
equals 1.
As shown in the right subfigure of Fig. 5, there are 5 vector instructions to vectorize 
zGEMM(2, 1, 2). First, the data of matrices A and B are loaded into vector registers VA 
and VB, respectively, by using VectDMA. Then, the results are updated with VA and the 
real parts of VB by calling VBCMULADDPD. Next, the data in the vector register VA are 
reorganized and shuffled. The results are updated with VA and the imaginary parts of VB 
by calling VBCMULADDPD in the end. After vectorization, the value of v rises from 1 to 4. 
When the computing kernel is zGEMM(m, k, n), there are k × n/2 shuffle instructions. 
The total number of vector-computing instructions is m× k × n. The ratio of the num-
ber of shuffle instructions to the number of overall instructions is 1/(2m + 1). In other 
words, shuffle operation takes up approximately 1/(2m +  1) of the overall processing 
time.
Fig. 5 Vectorization for zGEMM computing kernel
Page 10 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Mechanism for issuing multiple instructions
Assuming that the time of memory access, Tmem, is fixed, the 1 should be reduced to 
enhance overall performance. The Godson-3B1500 supports the mechanism for issu-
ing multiple instructions, and this mechanism can decrease the 1. There are two vector 
floating-point operation units, two fixed-point operation units, and one memory access 
unit in each core. Four instructions can be issued simultaneously in one cycle, includ-
ing two floating-point instructions, one memory access instruction and one fixed-point 
instruction. In the GEMM kernel, most instructions are computing instructions. To 
improve the performance, floating-point operation units should be kept working. Non-
blocking cache access instructions can be used for data preloading without influencing 
the efficiency of the computing instruction sequence.
As shown in Fig. 6, two computation instructions are issued in each cycle, and mem-
ory access instructions can be inserted into the instruction sequence. The overheads 
of memory access are much lower than those of computation, and most of the time of 
memory access can be concealed by computation. Using a reasonable arrangement of 
the instruction sequence, the mechanism for issuing multiple instructions can be used to 
reduce the value of 1.
Synchronization module selection mechanism (SMSM)
As shown in (8), the synchronization overhead, Tsync, should be reduced to enhance the 
overall performance. The synchronization overhead between APs and EPs takes up the 
highest portion of the Tsync. In order to reduce the synchronization overhead, R˜, which 
defines the EP-to-AP time ratio, is proposed. R˜ can be calculated as in (9), when syn-
chronizations are not considered. When the time of EPs is greater than that of APs, APs 
can be concealed by EPs. Otherwise, EPs will keep waiting until the APs are completed.
It naturally follows that synchronization is needed when EPs have to wait for APs. In 
the Godson-3B1500, the synchronization module consists of several lines of assembly 
languages. This module polls the state register for APs circularly, till the value of the 
register changes to the expected value. In order to reduce the time of synchronization 
and enhance the performance of GEMM, a SMSM mechanism is proposed. This mecha-
nism deploys the APs, EPs and synchronization module efficiently. The synchronization 
module is inserted into the computing instruction sequences and takes at least 5 cycles. 
When EPs have to wait, the overhead of the synchronization module will be larger. If the 
synchronization module can be discarded, the time of synchronization will be saved, and 
the performance will be enhanced.
(9)R˜ =
TEP + Textra + Tshuﬄe
max {TAPi}
Fig. 6 Mechanism for issuing multiple instructions
Page 11 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
When the time of EPs is less than that of APs, the synchronization module is needed. 
However, the EP and AP times are not fixed and change slightly with the change of the 
CPU execution state. If R˜ is casually calculated and determined, unfavorable scheduling 
may lead to wrong computing results when the synchronization module is not deployed. 
To solve this potential problem, R˜ should be determined cautiously. To ensure correct 
results, the EP and AP times for kernel computing are tested repeatedly, and the time 
results are recorded. R˜ is calculated with the minimal EP time and maximal AP time of 
the time results. When R˜ is larger than 1, there is no need for EPs to wait for APs.
The method for calculating the R˜ is shown below.
(a) Run the computation kernel for n times and test the time.
(b) Use (9) to calculate the R˜i for the i-th test and record the result as xi.
(c) Calculate the mathematical expectation X¯ with X¯ = 1n
∑n
i=1 xi and standard devi-
ation σ with σ =
√∑n
i=1 (xi − X¯)
2
/n.
(d) Use the one-tailed tests to test whether R˜ > 1 (or R˜ ≤ 1) can be established in 95% 
confidence level.
According to the definition of R˜, the SMSM is described as follows. If R˜ ≤ 1, it is uncer-
tain whether EPs need to wait for APs, or the synchronization module needs to be 
deployed. Otherwise, the synchronization module can be discarded. In our experiments, 
the R˜ can be determined in 95% confidence level when n is set to 200.
Optimized algorithm based on DMA
Classic block matrix multiplications form the essential basis of our algorithms. GEMM 
consists of multi-level matrix partitions, and every level follows the rules for block 
matrix multiplication, which are discussed in Goto and Van De Geijn (2008b). When the 
matrix is divided, there are many small block matrix multiplications. For minor matrix 
multiplications, the matrix can be divided iteratively. If every matrix partition is correct, 
the algorithm of GEMM can be proved to be correct.
For GEMM(M, K, N), there are three main types of classification:
(1) When the matrix A is broken into sub-matrix blocks of dimension M-by-k0 and 
B is divided into sub-matrix blocks of dimension k0-by-N, it can be described as ∑K/k0
i=1 GEMM(M, k0,N ).
(2) When the matrix B is divided into sub-matrix blocks of dimension K-
by-n0 and A is not divided, the GEMM(M, K, N) can be described as 
(GEMM1(M,K , n0), . . . ,GEMMN/n0(M,K , n0)).
(3) When the matrix A is divided into sub-matrix blocks of dimension m0
-by-K and B is not divided, the GEMM(M, K, N) can be described as 
(GEMM1(m0,K , n0), . . . ,GEMMM/m0(m0,K ,N ))
T .
As shown in Fig. 7, there are six levels of block matrix multiplications that can be sum-
marized into the above-mentioned types. The first and sixth levels of block matrix 
multiplication belong to type (1). The second, fourth, and fifth levels of block matrix 
multiplication belong to type (2). The third one belongs to type (3). TransDMA, 
Page 12 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
VectDMA and non-blocking cache access instructions are responsible for data transfer, 
and they will not affect the correctness of the algorithms when the correctness of the 
data transfer is ensured. In total, correct partition procedures lead to correct algorithms.
The ratio of computation to memory access is N:4 for basic dGEMM(N, N, N) and 
N:2 for zGEMM(N, N, N). Compared with the computational amount, the amount of 
memory access is very small. Because of the rapid computational power and slow mem-
ory access performance, the memory wall is still the bottleneck of GEMM performance. 
Many attempts have been made to optimize the BLAS3 with the normal optimization 
technologies such as loop unrolling, software pipelining or data prefetching of processor. 
Loop unrolling is used to enhance the re-use of the data in caches to reduce the accounts 
of memory access. Software pipelining is used to eliminate the correlation between the 
execution and memory access, and the execution and memory access units can progress 
in parallel. However, the theoretical peak performance is too high, and the time of mem-
ory access of processors cannot be concealed by the execution. Only approximately 35% 
of the theoretical peak performance can be obtained. Moreover, the parameters of loop 
unrolling have been adjusted, and the performance is still very low. Therefore, the bot-
tleneck cannot be solved by using normal optimization technologies.
In order to solve the memory wall and guarantee data supply, two novel algorithms 
based on DAE architecture are proposed, as shown in Algorithms 1 and 2. The comput-
ing kernels of these two algorithms utilize optimization technologies such as the vec-
torization and mechanism for issuing multiple instructions. SMSM is used to reduce the 
synchronization overhead for these two algorithms. The proposed performance model 
guides the overall design of algorithms.
Algorithm 1
As shown in Algorithm  1, there are six loops in GEMM. These loops can be divided 
into three types, including the outer-loop, middle-loop and kernel-loop. The three loops 
outside belong to the outer-loop. The fourth loop is in charge of task distribution for 
multi-threads in the node, and it forms the middle-loop. The remaining loops belong to 






Fig. 7 Matrix decomposing methods in the GEMM
Page 13 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
broken into sub-matrix blocks of dimension M-by-k0, and B is divided into sub-matrix 
blocks of dimension k0-by-N.
Algorithm 1: :Optimized Algorithm Based on DMA
Input: A,B,C, α, β
Output: C = αC + βA×B
1 for k = 0; k < K; k = k + k0 do
2 for n = 0;n < N ;n = n+ n0 do
3 Preload first blockA to the locked L3 cache by using TransDMA.
4 Load blockB to the locked L3 cache with normal memory access
instructions.
5 for m = 0;m < M ;m = m+m0 do
6 Preload blockAnext to the L3 cache by using TransDMA.
7 for Each CPU core j do
8 Preload blockC from the memory to the vector registers by using
channel c of VectDMA.
9 for nn = jn0/4; nn < (j + 1)n0/4; nn=nn+ nn0 do
10 Preload blockCnext from the memory to the vector registers by
using channel c of VectDMA.
11 Preload blockAs and blockBs from the locked L3 cache to the
vector registers by using channels a and b of VectDMA,
respectively.
12 for kk = 0; kk < k0; kk=kk+ k00 do
13 Arrange the synchronization selection mechanism SMSM.
14 Calculate blockAs, Bs and C with vector instructions, and
shuﬄe instructions are needed for zGEMM. Meanwhile,
preload blockAsnext and blockBsnext from the locked L3
cache to the vector registers by using channel a and b of
VectDMA, respectively.
15 Store blockC from the vector registers to the memory by using
channel d of VectDMA.
16 Preload blockCnext−next from the memory to the vector
registers by using channel c of VectDMA.
The algorithm mainly discusses the multiplications of M-by-k0 blockA and k0-by-
N blockB. First, blockB is divided with dimensions of k0 × n0. The processor gives the 
blockB access to the L3 cache with normal memory access instructions. At the same 
time, TransDMA is configured to preload the first m0-by-k0 blockA to the locked L3 
cache. Then, blockB, which is stored at the locked L3 cache, is successively multiplied 
by many m0-by-k0 blockAs in the locked L3 cache. The outer-loop is responsible for data 
transfer of blockA and blockB from the memory to the locked L3 cache. The middle-
loop performs an average distribution of blockB in the locked L3 cache to four threads in 
the node. Each thread calculates the corresponding k0-by-(n0/4) blockB.
VectDMA is responsible for data transfer between vector registers and memory/L3 
cache. Channels a and b preload blockAs and blockBs, respectively. Channel c is respon-
sible for preloading blockC from the memory to the vector registers, while channel d is 
in charge of writing blockC back to the memory. When m0-by-k0 blockA and k0-by-n0 
blockB are preloaded to the locked L3 cache, the kernel-loop starts to execute. Chan-
nel c preloads the m0-by-n00 blockC from the memory to the vector registers. Channel 
a preloads the m0-by-k00 blockAs, and channel b preloads the k00-by-n00 blockBs. At the 
same time, TransDMA starts to preload the next m0-by-k0 blockAnext to the locked L3 
Page 14 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
cache. When the computing kernel begins, channel c will preload the next blockCnext to 
the vector registers simultaneously.
When the kernel function of multiplication of blockAs and blockBs is called, channels a 
and b begin to preload the next blockAsnext and blockBsnext, respectively. After the kernel 
ends, the computing kernel of the next blockAsnext and blockBsnext is called successively. 
When multiplication of blockA and blockB ends, channels d and c begin to write blockC 
back and preload the next blockCnext, respectively. At the same time, the multiplication 
of the next blockAnext and blockBnext begins to execute.
The delay of memory access is very long for the Godson-3B1500, and the cache miss-
ing rate greatly influences the performance of GEMM. The Godson-3B1500 uses a ran-
dom cache replacement strategy, and the cache missing rate is significantly higher than 
those in other strategies for GEMM. For ideal performance, data that are frequently 
reused should not be replaced from the cache. A mechanism of locking cache is pro-
posed to keep some data in the cache. Experiments demonstrate that if more than half 
the cache spaces are locked, the system may be paralyzed due to a system deadlock.
In order to ensure that frequently reused data cannot be replaced from the cache, 
blockA and blockB should be stored in the locked cache. When one blockA is being 
computed, TransDMA will begin to transfer the next blockAnext. Variable sizeof(xx) is 
used to define the size of xx. For example, sizeof(matrix element) is utilized to define 
the size of the matrix element. Two block spaces in L3 cache are assigned to sub-matrix 
A, and blockA will occupy 2m0k0×sizeof (matrix element) L3 cache space. Additionally, 
only one block is assigned to B, and B will occupy k0n0×sizeof(matrix element) L3 cache 
space. Since only half of the L3 cache can be locked, parameters m0, k0 and n0 should 
meet the condition shown in (10).
In the kernel-loop, the kernel of computation is composed of vector instructions (e.g, 
multiply-add and shuffle), and the performance of EPs will be maximized. The comput-
ing data are stored in the vector registers. Since the size of the vector registers is lim-
ited, tiling parameters should be considered. As shown in Algorithm 1, blockAs, blockBs , 
and blockCs use the ping-pong processing strategy, and A, B, and C will be assigned two 
register blocks. BlockCs will occupy the 2m0n0×sizeof(matrix element) vector register 
space. To avoid interruption of the instruction pipelining by the performance jitter of 
VectDMA, several computing groups are needed for A and B. Assuming that the num-
ber of kernel computation groups is ρ, blockAs will occupy 2m0k00ρ×sizeof(matrix ele-
ment) vector register space, and blockBs will occupy 2k00n0ρ×sizeof(matrix element) 
vector register space. Parameters m0, k00, and n0 should meet the condition shown in 
(11).
 
(10)(2m0k0 + k0n0)sizeof(matrix element) ≤ 0.5× sizeof(L3 cache)
(11)2((m0k00 + k00n0)ρ +m0n0) ≤
sizeof(vector registers)
sizeof(matrix element)
Page 15 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Algorithm 2: :Improved Optimized Algorithm Based on DMA
Input: A,B,C, α, β
Output: C = αC + βA×B
1 for k = 0; k < K; k = k + k0 do
2 Load first blockB to the locked L3 cache by using normal memory access
instructions.
3 for n = 0;n < N ;n = n+ n0 do
4 Preload blockA to the locked L3 cache with TransDMA.
5 for m = 0;m < M ;m = m+m0 do
6 Preload blockAnext from the memory to the locked L3 cache with
TransDMA.
7 for Each CPU core j do
8 Preload blockC from the memory to the vector registers by using
channel c of VectDMA.
9 for nn = jn0/4; nn < (j + 1)n0/4; nn=nn+ nn0 do
10 Preload blockCnext from the memory to the vector registers by
using channel c of VectDMA.
11 Preload blockAs and blockBs from the L3 cache to the vector
registers by using channels a and b of VectDMA, respectively.
12 for kk = 0; kk < k0; kk = kk + k00 do
13 Arrange the synchronization selection mechanism SMSM.
14 Calculate the computing kernel GEMM(m0,k00,nn0) with
vector instructions, and insert non-blocking cache access
instructions to preload blockBnext into the computing
instructions sequences. And shuﬄe instructions are
included for zGEMM. Meanwhile, preload blockAsnext and
blockBsnext from the L3 cache to vector registers by using
channels a and b of VectDMA, respectively.
15 Store blockC from the vector registers to the memory by using
channel d of VectDMA.
16 Preload blockCnext−next from the memory to the vector
registers by using channel c of VectDMA.
Algorithm 2
In Algorithm 1, most overheads of memory access are concealed by the computing time. 
However, the time of loading matrix B to the locked L3 cache cannot be concealed. The 
overhead of loading matrix B will influence overall performance. To solve this problem, 
an optimized algorithm is proposed, in which the time of preloading the matrix B can be 
concealed, as shown in Algorithm 2.
The overhead of data prefetching of blockB cannot be masked by the EPs in Algo-
rithm  1. It will interrupt the instruction pipelining of the computation kernel to wait 
for the prefetching of blockB. As discussed in section “Mechanism for issuing multiple 
instructions”, the instructions for non-blocking cache access can be inserted into the 
computational instruction sequence. The locked L3 cache blocks that are assigned to sub-
matrix B can be divided, on average, into two parts. When one blockA is being calculated, 
the next block A can preload the data to the other part space. Unlike Algorithm 1, B in 
Algorithm 2 needs two blocks, and B will occupy the 2k0 × n0×size of (matrix element) 
L3 cache space. Parameters m0, k0 and n0 should meet the condition shown in (12).
(12)2m0 × k0 + 2k0 × n0 ≤
0.5× size of (L3 cache)
size of (matrix element)
Page 16 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
The difference between Algorithms 1 and 2 focuses on the method of preloading sub-
matrix B. In Algorithm 1, sub-matrix B is loaded using normal memory access instruc-
tions, and the computing pipeline will be interrupted to wait for the end of loading B. 
In Algorithm 2, the non-blocking cache access instructions replace the normal memory 
access instructions, and the instructions can be inserted into the computing instructions 
using the mechanism for issuing multiple instructions. Figure 8 shows the process pro-
cedures of the algorithms. Most overheads caused by preloading data from the locked L3 
cache to the vector registers can be concealed by the computing time in both algorithms. 
Additionally, the overheads of preloading matrix A to the locked L3 cache can be con-
cealed. Furthermore, as shown in Fig. 8, the overheads of loading sub-matrix B in Algo-
rithm 2 can be reduced compared with Algorithm 1.
(a)
(b)
Fig. 8 The processes of the proposed algorithms. a The process of Algorithm 1, b the process of Algorithm 2
Page 17 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Experimental results
To validate the correctness and effectiveness of the proposed algorithms, several experi-
ments were conducted. In this section, we present the experimental testbed and detail 
the experiments and results.
Experimental testbed
The kernel functions of the algorithms are mainly implemented in MIPS64 assembly lan-
guage. The hardware of the testbed is the Godson-3B1500 clocked at 800 MHz. The peak 
performance of one node is 51.2 GFlops. Experiments were tested on the Loongson-
Server Multi-libs system. The software is the GNU Compiler Collection for Godson, and 
the compile options are “-march=mips64 -mabi=64 -O2”. The compiler supports SIMD 
vector instructions of the Godson-3B1500. According to (10), (11) and (12), the param-
eters that produce the best performance are shown in Table 1. VectReg is used to define 
the number of vector registers used for the algorithms. There are 128-entry 256-bit vec-
tor registers. In all, 128 vector registers are used for zGEMM and 120 vector registers 
are used for dGEMM.
After testing, the ratio of computation to memory access, R˜dGEMM, equals 0.93 for 
dGEMM. Compared with dGEMM, zGEMM has a lower memory access ratio ,and its 
R˜zGEMM equals 1.08. According to the SMSM, the synchronization module should be 
inserted for dGEMM. On the other hand, the synchronization module can be discarded 
for zGEMM.
Results analysis
The results are analyzed from two aspects, namely, numerical accuracy and perfor-
mance. Numerical accuracy analysis is used to verify the correctness and accuracy of 
the algorithms, while performance analysis is used to calculate the improvements in effi-
ciency of the proposed algorithms.
Numerical accuracy
To verify the numerical accuracy, δ presents the relative error between the correct result 
and the results of the proposed algorithms. δ can be computed as in (13).
where R′i,j defines the (i,  j) entry of the correct result and Ri,j defines the (i,  j) entry of 








, i, j ∈ [0,N )
Table 1 Parameters for GEMM
m0 k0 n0 k00 n00 ρ sizeof(element) VectReg
dGEMMalgo1 12 512 992 4 12 2 8B 120
zGEMMalgo1 8 512 480 4 8 2 16B 128
dGEMMalgo2 12 512 480 4 12 2 8B 120
zGEMMalgo2 8 512 240 4 8 2 16B 128
Page 18 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
floating-point operations is (−10−15, 10−15). δ should be less than 10−15 when the preci-
sions of algorithms satisfy the requirements of the original libraries.
To verify the correctness and numerical accuracy, ATLAS was tested as the reference 
experiment. The matrix dimension, N, ranged from 1000 to 16,000, and experiments 
were carried out at intervals of 1000. The source data were generated randomly, and 
the input data of ATLAS and the proposed algorithms were the same. In the zGEMM 
and dGEMM, the scale of the data of the input matrices makes no difference to the cor-
rectness of algorithms, once the results are not out of bounds. The range of the random 
numbers is set to [−100,000, 100,000], and an uniform distribution is used. After experi-
ments, the relative errors were computed by using formula (13). As shown in Fig. 9, the 
relative errors of Algorithms 1 and 2 are less than 10−15. Because ATLAS satisfies the 
range of errors of double-precision floating-point numbers, the results of ATLAS are set 
to the standard results. The experiments demonstrate that the results of the proposed 
algorithms are correct and that the precisions of the proposed algorithms are equivalent 
to ATLAS.
Performance
In the Godson-3B1500, Algorithms 1 and 2 were tested. The algorithm proposed in Zhu 
(2011) was implemented in a simulation platform (single core) and did not work in the 
real chips. Therefore, Zhu (2011) was introduced as a representative of studies on the 
DAE architecture simply and were not tested. For comparison, two standard versions of 
GEMM, including ATLAS and OpenBLAS, were tested. Moreover, Algorithm 2 without 
SMSM was tested for zGEMM. These tests were performed on the node (four cores) of 
Godson-3B1500, where the theoretical computing peak capacity is 51.2 GFlops (Perfpeak
/2).
One of the most important optimization technologies of ATLAS is data tiling. Accord-
ing to the size of the L1 data cache, the block dimension is 48× 48 for dGEMM. The 
Godson-3B1500 uses the four-way set-associative cache architecture and the random 
cache replacement strategy. Therefore, there will be many cache conflicts in the matrix 
Fig. 9 The relative errors δ for the proposed algorithms
Page 19 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
blocks when the matrix size is large enough. When a conflict occurs, important data 
may be replaced out from the L1 cache. Additionally, ATLAS and OpenBLAS cannot 
fully use the vector instruction set, due to which the computing kernel performs poorly. 
Only one-fourth of the CPU execution capacity can be used. Moreover, asynchronous 
data prefetching is not used, and most of the overhead of memory access cannot be con-
cealed. As shown in Figs. 10 and 11, OpenBLAS and ATLAS perform very poorly in the 
Godson-3B1500.
In Fig. 10, Algorithm 2 performs better than the other algorithms for dGEMM when 
the size is larger than 2000. When the size is less than 4000, each core has very few 
tasks, and the power of hardware cannot be exerted. With an increasing size and ratio 
of computation, optimized algorithms are displaying promising performance. Due to 
cache missing, there are some oscillations in ATLAS and OpenBLAS with increasing 
size. However, APs access the data via the locked L3 cache rather than the L2 and L1 
caches. The size of GEMM indicates the size of total memory needed in the GEMM. It 
includes the size of matrices A, B and C. Therefore, the proposed algorithms perform 
Fig. 10 Performance comparison of dGEMM
Fig. 11 Performance comparison of zGEMM algorithms
Page 20 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
stably, and the performances do not change when the size of dGEMM exceeds the cache 
size. ATGEMM was optimized by using the L2 cache as the intermediate storage space 
for Godson-3B. It only optimized the dGEMM for Godson-3B1000, and its optimiza-
tion methods were not fit for the Godson-3B1500. ATGEMM is an automatic optimized 
algorithm for dGEMM in the Godson-3B1000 and is optimized by using the L2 cache 
as the intermediate storage space. Algorithms 1 and 2 are optimized by using the L3 
cache as the intermediate storage space. The L3 cache in Godson-3B1500 is twice than 
the L2 cache in the Godson-3B1000. Moreover, ATGEMM optimized the high levels of 
the blocking GEMM, and kernel based on the DAE processor was divided into 4 levels. 
Several levels of the ATGEMM are capable to self-adjust and the parameters are gener-
ated by using the automatic optimized algorithm. Manual adjustment to the parameters 
are needed. The parameters include the main sizes of matrix block for outer loop, the 
kernel block sizes and other parameters that influence the performance. After the man-
ual adjustments with experience, the performance improves a little. Compared with our 
approaches, ATGEMM still performs badly. In the Fig. 10, only the performance of origi-
nal ATGEMM is shown. Moreover, Algorithms 1 and 2 have adjusted the parameters for 
the Godson-3B1500, and they perform better than ATGEMM in the Godson-3B1500. 
Compared with Algorithm 1, the time of loading matrix B in Algorithm 2 is concealed by 
using the mechanism for issuing multiple instructions. Algorithm 2 performs approxi-
mately 2.5% better when the size is larger than 5200. Since the APs cannot be concealed 
by the EPs, Algorithm 2 cannot reach the theoretical peak performance. Its best perfor-
mance is 47.07 GFlops, reaching 91.9% of the theoretical peak.
As shown in Fig.  11, Algorithm  2 performs better than ATLAS and OpenBLAS for 
zGEMM when the size is larger than 1400. Due to cache missing, there are some oscil-
lations in ATLAS and OpenBLAS with increasing size. However, APs access the data 
via the locked L3 cache rather than the L2 and L1 caches. Therefore, the proposed algo-
rithms perform stably, and the performances do not change when the size of zGEMM 
exceeds the cache size. Compared with Algorithm 1, the data preloading of matrix B can 
be concealed by using non-blocking cache access instructions and the mechanism for 
issuing multiple instructions. Algorithm 2 performs approximately 3.2% better when the 
size is larger than 6000. In addition, data shuffle is required and cannot be optimized for 
zGEMM in the Godson-3B1500. The overhead of data shuffle occupies 6% (1/(2m0 + 1) ) 
of the total runtime. Therefore, Algorithm 2 cannot reach the theoretical peak perfor-
mance. Its best performance is 47.64 GFlops, reaching 93% of the theoretical computing 
peak for zGEMM.
When the overhead of APs can be almost concealed by the EPs, the synchroniza-
tion module will be redundant for zGEMM. The SMSM can reduce the overhead of the 
synchronization module and enhance the performance. For zGEMM, the ratio of com-
putation to memory access is large enough, and the synchronization module can be 
discarded. As shown in Fig.  12, Algorithm  2 without SMSM performs 3% worse than 
Algorithm 2.
Page 21 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Conclusion
By virtue of the significance of BLAS, the performance optimization of BLAS has 
attracted attention from scholars and experts. In this paper, a GEMM performance 
model for DAE is proposed to analyze the impacts of parameters. Additionally, two opti-
mized algorithms of GEMM are proposed in the Godson-3B1500 based on the perfor-
mance model. Experiments demonstrate that these two algorithms perform better than 
other versions of GEMM. The optimized algorithm reaches 93% of the theoretical peak 
performance for zGEMM and reaches 91.9% of the theoretical peak performance for 
dGEMM.
However, the performance of GEMM cannot reach the peak performance of the God-
son-3B1500. The memory wall is still the bottleneck for HPC applications. It is necessary 
to investigate how to enhance the performance of memory access in future work. Fur-
thermore, a generic model based on a DAE architecture for BLAS will be designed.
Authors’ contributions
MZ and NG designed study and carried out the analysis. MZ and KR implemented the experiments, and contributed to 
writing and revising the paper. All authors read and approved the final manuscript.
Acknowledgements
Grateful acknowledgement is made to Mr. Yunkai Du who offered us the technologies to implement the experiments 
and helped us to revise the paper. Moreover, we thank American Journal Experts (AJE) for English language editing. 
Funding was provided by Natural Science Foundation of Anhui Province (Grant No. 1408085MKL06), and Project funding 
for academic innovation in Colleges and Universities (Grant No. B07033).
Competing interests
The authors declare that they have no competing interests.
Received: 24 March 2016   Accepted: 17 November 2016
References
Allen G et al (2009) A note on auto-tuning GEMM for GPUs. In: 9th international conference, Baton Rouge. LA, USA, pp 
884–892
Gao X, Chen YJ, Wang HD, Tang D, Hu WW (2010) System architecture of Godson-3 multi-core processors. J Comput Sci 
Technol 25(2):181–191
Goto K, Van De Geijn R (2008) High-performance implementation of the level-3 BLAS. ACM Trans Math Softw (TOMS) 
35(1):1–14
Fig. 12 Performance comparison of algorithm 2 for zGEMM
Page 22 of 22Zhang et al. SpringerPlus  (2016) 5:2014 
Goto K, Van De Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw (TOMS) 
34(3):1–25
Gu NJ, Li K, Chen GL, Wu C (2008) Optimization of BLAS based on Loongson 2F architecture. Univ Sci Technol China 
38:854–859
He SS, Gu NJ, Zhu HT (2012) Optimization of BLAS for Loongson-3A architecture. J Chin Comput Syst 33(3):571–575
Hu WW, Wang R, Chen YJ et al (2011) Godson-3B: a 1 GHz 40 W 8-core 128 GFlops processor in 65 nm CMOS. In: 2011 
IEEE international solid-state circuits conference digest of technical papers (ISSCC). San Francisco, USA, pp 76–78
Hu WW, Gao YP, Chen TS, Xiao JH (2011) The Godson processors: its research, development, and contributions. J Comput 
Sci Technol 26(3):363–372
Netlib (2016a) Homepage of BLAS. http://www.netlib.org/blas/. Accessed 15 Mar 2016
Netlib (2016b) Homepage of Lapack. http://www.netlib.org/lapack/. Accessed 15 Mar 2016
Netlib (2016c) Homepage of Linpack. http://www.netlib.org/linpack/. Accessed 15 Mar 2016
Smith JE (1982) Decoupled access/execute computer architectures. In: The 9th annual symposium on computer archi-
tecture (ISCA ’82), CA, USA, pp 112–119
Smith JE (1984) Decoupled access/execute computer architectures. ACM Trans Comput Syst 2(4):289–308
Wang Q, Zhang XY, Zhang YQ, Yi Q (2013) AUGEM: automatically generate high performance dense linear algebra kernels 
on x86 CPUs. In: The international conference on high performance computing, networking, storage and analysis 
(SC ’13), New York, USA, pp 1–12
Zhang WL et al (2004) Analysis and optimization discussion on parallel Linpack. In: Institute of computing technology 
chinese academy of sciences eighth graduate symposium on computer science and technology, DaLian, China, 
2004
Zhang XY, Wang Q, Zhang YQ (2012) Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In: 
The 2012 IEEE 18th international conference on parallel and distributed systems, Singapore, pp 684–691
Zhao Z, Gu NJ, Yang YZ, Ren KX (2014) Optimizing BLAS2 for a decoupled access/execute architecture processor. J Com-
put Inf Syst 10(3):1231–1241
Zhao Z, Gu NJ, Yang YZ (2015) GEMM optimization for a decoupled access/execute architecture processor. Int J Hybrid Inf 
Technol 8(7):375–388
Zhao Z, Gu NJ, Yang YZ (2013) Auto-tuning GEMM kernels for a decoupled access/execute architecture processor. In: 
2013 first international symposium on computing and networking (CANDAR), Japan, pp 233–239
Zhu HT (2011) High-density multi-core processor architecture research. Dissertation, University of Science and Technol-
ogy of China
