Common Midpoint (CMP) and Common Reflection Surface (CRS) are widely used methods for improving the signal-to-noise ratio in the field of seismic processing. These methods are computationally intensive and require high-performance computing. This article optimizes these methods on the Sunway many-core architecture and implements large-scale seismic processing on the Sunway Taihulight supercomputer. We propose the following three optimization techniques: 1) we propose a software cache method to reduce the overhead of memory accesses, and share data among CPEs via the register communication; 2) we re-design the semblance calculation procedure to further reduce the overhead of memory accesses; 3) we propose a vectorization method to improve the performance when processing the small volume of data within short loops. The experimental results show that our implementations of CMP and CRS methods on Sunway achieve 3.50Â and 3.01Â speedup on average compared to the-state-of-the-art implementations on CPU. In addition, our implementation is capable to run on more than one million cores of Sunway TaihuLight with good scalability.
the performance of seismic processing by exploiting the massive computation resources of Sunway TaihuLight at large scale.
Although existing works have explored different architectures to optimize seismic processing, it is impossible to naively adopt the existing works to Sunway due to its unique architecture design. Specifically, the following challenges need to be addressed in order to achieve good performance for the CMP and CRS methods on Sunway. First, unlike the traditional x86 processor, the design of the CPEs does not contain a cache, but a 64KB user-controlled scratch-pad memory (SPM), which means without careful management, the frequent accesses to memory could lead to severe performance degradation. Second, in order to achieve the ideal memory bandwidth on Sunway, the Direct Memory Access (DMA) transfers issued from the CPEs must contain at least 1024B data. However, in the CMP and CRS methods, only a tiny data ranging from 4B to 76B is required during each computation step, which is prohibitive to achieve the optimal performance on Sunway. Moreover, the operations applied to the tiny data within short loops make it difficult to take advantage of the vector units on Sunway processor.
In order to solve the above challenges, this paper proposes a re-design of the CMP and CRS methods on Sunway processor. In addition, several optimization techniques are also proposed to adapt to the architecture features of Sunway efficiently. The experiment results demonstrate our implementation of seismic processing achieves significant speedup when scaling to a massive number of Sunway cores. Specifically, this paper makes the following contributions:
We propose a software cache method for seismic processing on Sunway CPEs. This method utilizes the architecture features of DMA and Local Device Memory (LDM) on Sunway. When the memory access occurs, the CPE sends the data request to the software cache. After receiving the data request, the software cache retrieves data from memory through DMA, and then send the data back to the CPE. After that, the data is buffered in the software cache to alleviate the long memory access delay effectively. We re-design the Common Depth Point (CDP) procedure that dominates the performance of CMP and CRS methods to adapt to the Sunway architecture. Specifically, we combine multiple search processes onto a single CPE, and synchronize across search processes by buffering the intermediate results from each computation step. In addition, we combine the data to be accessed at each step of the search processes, and thus reduce the number of DMA accesses. We propose a vectorization method to improve the computation efficiency when processing the tiny data within short loops. We first convert the global reduction operations into several independent elementwise vector operations, and then use the vector array to perform element-wise vector operations with the ending element processed separately. The rest of this paper is organized as follows: In Section 2, we introduce the background of the CMP and CRS methods, as well as the Sunway architecture. Section 3 presents our design and optimization of seismic processing on Sunway to achieve massively scaling. The evaluation results are given in Section 4. Section 5 discusses the related work and Section 6 concludes this paper.
BACKGROUND

Common Midpoint Method
The fundamental idea of the CMP method is shown in Fig. 1a . The sound source placed at the source point S i is excited. After the sound wave is reflected by the underground reflection point R, the receiver on the surface receives the signal at point G i . Each data record captured is a seismic trace, and a group of seismic traces that share the same midpoint is called a CMP gather. When the reflection surface is horizontal, and the speed does not change horizontally, the CMP gathers are equivalent to Common Depth Point (CDP) gathers. This is the seismic data this paper deals with, so we use the term CMP and CDP interchangeably.
In the CMP method, traces belonging to the same CMP gather are corrected and stacked, which generates a stacked trace. As shown in Fig. 1b , before the traces are stacked together, a Normal Moveout (NMO) correction is applied to the reflection traveltimes according to the distances between their sources and receivers, which groups signals that are produced by the same reflectors. The quality of the stacked trace depends on the quality of the NMO correction. The NMO in the CMP method is to correct the hyperbolic curve (also known as traveltime curve), which depends on the distance between the source and the receiver as well as the average velocity in which the wave propagated during the seismic data acquisition. Although the distance is known in advance, the velocity is usually unknown. Therefore, it is necessary to find the best stacking velocity.
In order to find the best stacking velocity, the CMP method enumerates through different velocities. For each of enumerated velocities, it computes the semblance, a coherence metric that indicates whether the traveltime curve defined by a given velocity would produce a good stacking. The semblance computation is performed over a traveltime curve that intersects seismic traces. Considering that the traces are represented by discrete samples, some points of the intersections may not align with the actual elements in the dataset. Therefore, we use the interpolation of nearby samples to estimate the seismic amplitude at that point. The Equation (1) defines the computation for semblance, where S c is the semblance coefficient that represents the specific value of multiple trace stacking. In a CDP gather, we assume there are M traces, where f ij represents the jth sample of the ith trace, and the intersection of the traveltime curve of the trace is k. The semblance calculation is performed in a window of length w, which walks through the traces of the current CDP and access w samples in each intersection. The value of w is determined by the sampling interval during data acquisition. The detailed explanation of Equation (1) can refer to [14] . In the CMP method, there is no dependency between the computation of individual CDPs, so that they can be computed in parallel.
(1)
Common Reflection Surface Method
As shown in Fig. 2a , the ray from the exciting point S to the receiving point G is the central ray, whereas the ray from the exciting point S to the receiving point G is the paraxial ray. The central points m0 and m1 belong to CDP0 and CDP1, respectively. According to the paraxial ray theory, when processing the central ray SG, it requires the data of the paraxial ray S G. Therefore, when computing the CDP0, it requires the data from CDP1. In Fig. 2b , the orange point m0 represents the central point, and the neighbors within the radius r include four CDPs (m1; m2; m3; m4) that are represented with blue dots, and the remaining green dots are not in the neighborhood of m0. It also means, when processing CDP0, the data from CDP1, CDP2, CDP3 and CDP4 is required. The semblance computation in the CMP method can be easily extended to the CRS method. The only difference is to obtain the trace data of the CDPs in its neighborhood when processing the central CDP, as well as change the NMO curve to a curved surface. For large-scale processing, we partition the two-dimensional coordinates of the middle points of each CDP using grid, and map the grids to different CGs of the Sunway processor. The adjacent grids exchange data through asynchronous MPI communication.
The Sunway Many-Core Architecture
The Sunway TaihuLight supercomputer provides a theoretical peak performance of 125PFlops. It consists of 40,960 Sunway SW26010 processors with 1.4PB memory and an aggregated bandwidth of 4,473.16TB/s. The architecture of the Sunway processor is shown in Fig. 3 , which is composed of four core groups (CGs). Each CG contains a Management Processing Element (MPE) and 64 Computing Processing Elements (CPE), and each CG is attached to 8 GB DDR3 memory. The 8 GB attached memory can be accessed by both MPE and CPEs with the bandwidth of approximately 136 GB/s. The MPE has 32 KB L1 instruction cache and 32 KB L1 data cache, in addition to 256 KB L2 cache for both instruction and data. Each CPE has its own 16 KB L1 instruction cache but no data cache. However, there is a 64 KB local device memory (LDM) on each CPE that is explicitly managed by software. The CPEs can initiate a direct memory access (DMA) operation that reads data from memory to the LDM, or writes data from the LDM to memory. The CPEs in the same row or column of the CG can communicate with each other through register communication. Each CPE has a vector unit that supports 256-bit wide vector floating-point operation. The survey paper [15] has shown that the memory bandwidth of Sunway processor is quite limited compared to the massive computation power. Therefore, the most effective optimization techniques on Sunway include the rational use of LDM, data transfer through register communication between CPEs, computation acceleration through vector units and data access through DMA for higher bandwidth.
Challenges on Sunway Architecture
As the dominant computation of seismic processing with both CMP and CRS methods, accelerating the procedure of semblance calculation is critical to achieve satisfactory performance on Sunway. However, the data access pattern during the semblance calculation is prohibitive to obtain good performance on Sunway for two reasons: 1) the data accesses are random, which leads to high memory access latency due to the lack of data cache on CPEs; 2) the volume of data accesses is quite small, which is unable to fully utilize the precious DMA bandwidth as well as the vector units for performance acceleration. The specific challenges are as follows:
The random data accesses during the semblance calculation deteriorate the performance of seismic processing on Sunway. Due to the lack of data cache on CPEs, a software cache method is necessary to buffer the data accessed from memory by using the limited LDM on CPEs effectively. The semblance calculation only accesses a small volume of data, which is hard to utilize the DMA bandwidth fully. Therefore, it is necessary to re-design the process of semblance calculation by combining the computations and buffering the intermediate results on each CPE in order to improve bandwidth utilization. In addition to the low bandwidth utilization, the small volume of data during the semblance calculation also prohibits the exploration of vectorization. Therefore, to utilize the vector units on Sunway, it requires revealing the vectorization potential by adjusting the computation pattern of semblance calculation.
RE-DESIGNING THE SEISMIC PROCESSING FOR MASSIVELY SCALING
3.1 Design Overview Fig. 4 shows the design and optimization of the CDP computation of the CMP and CRS methods on Sunway architecture. First, the MPE on each CG reads the partitioned seismic data. Seismic data consists of several CDPs, and each CDP contains several traces, each of which is composed of ns samples. For the CMP method, the computation on each CDP is independent from the rest. Whereas for the CRS method, the computation of the central CDP requires data from the surrounding CDPs. In such case, the MPE calculates the twodimensional coordinates of the middle point of each CDP, and then divides the inner and outer regions according to the two-dimensional coordinates. The calculation of the outer region involves the region of the adjacent mesh, which requires the CDPs in the outer region to be sent to the adjacent mesh. As shown in Fig. 4 step 1 , the data transfer of the outer region and the calculation of the inner region is performed simultaneously through asynchronous MPI. After the central CDP receives the data from surrounding CDPs, the CDP computation is the same in both CRS and CMP methods. Therefore, we take the computation of a single CDP for an example to illustrate the optimizations we have applied on each Sunway CG. The CDP computation involves the semblance calculation that walks through the traces of the current CDP. In order to improve the performance of the CDP computation on Sunway, we propose several techniques to re-design and optimize the computation procedures. First, we use the master CPE and worker CPEs collaboratively to implement a software cache to eliminate random data accesses at the intersection (step 2 ). Second, we propose a vectorization method to improve computation efficiency when processing the tiny data within short loops (step 3 ). Third, we re-design the calculation process so that each worker CPE can process multiple sample-NMO velocity pairs simultaneously to further improve bandwidth utilization (step 4 ).
To better illustrate how our proposed techniques work together, we take the processing of one CDP as shown in the upper part of Fig. 4 for example. There are two adjacent traces (trace n and trace nþ1 ) in a single CDP stored in continuous memory region, and each trace contains three (sample, NMO velocity) pairs (e.g., P j , P jþ1 , P jþ2 ). When memory access occurs, the software cache first takes into effect (step 2 .) Every two adjacent CPEs in each row of the CPE mesh are organized into a group, with one of them serving as the master CPE and the other one serving as the worker CPE. The worker CPE first sends a data request to the master CPE in its own group through register communication. After the master CPE receives the request, it retrieves the data from memory through DMA, then sends the requested data back to the worker CPE. The requested data is buffered in the LDM of the master CPE. The vectorization method (step 3a and 3b ) converts the reduction operation into independent element-wise vector operations, then uses the vector array to perform element-wise vector operation with the ending elements processed separately. The re-designed calculation process (step 4 ) synchronizes the processing of P j , P jþ1 and P jþ2 in sequence on trace n , and buffers their intermediate results. Then, the CPE group continues to process the next trace (trace nþ1 ). The above steps are repeated until the last trace is processed. After all the (sample, NMO velocity) pairs have been processed, the computation of a single CDP completes.
Improving Parallelism Within a CG
Since the CDP computation dominates the execution time of seismic processing, the optimization of the CDP on a CG is critical to fully exploit the performance of Sunway processor. The maximal number of traces in a CDP is the fold of the dataset and the total number of CDPs in a dataset is denoted as ncdps. Each seismic trace is represented by an array, where each element is a sample. We assume that the seismic traces have the same number of samples (ns) across all CDPs, which is widely accepted in literature [16] . Fig. 5a shows that a CDP contains four traces, each of which contains several samples. In the same CDP, two adjacent traces are stored in the contiguous memory region. In addition, the samples of a single trace are also stored continuously. As shown in Fig. 5a , the center of the four colored boxes is the intersection of the traveltime curve with the four traces. The semblance is computed within a window of width w, which also represents the number of samples in each color boxes.
In order to achieve parallel processing of the CDP on a single CG, we chose a grid-based search method. The entire computation is divided into three phases, including initialization, calculation and result writing back. The initialization phase is performed on the MPE. First, the CDP data is accessed, and the NMO velocity array is generated according to the upper and lower bounds of the NMO velocity. Then the halfpoints are computed that are necessary for calculating the traveltime curve. The NMO velocity is stored in an array of size nc, whereas the halfpoints are stored in another array, each element of which corresponds to a trace in the CDP. The MPE then creates a semblance matrix S with a size of ns Â nc in memory, which is used in the semblance computation to find the most coherent NMO speed. At the same time, the MPE also creates an array with a size of ns in memory to store the most coherent NMO velocity.
The calculation phase involves finding the most coherent NMO velocity for each sample from the NMO velocity array. For each sample, we enumerate the elements in the NMO velocity array, and compute the semblance of each (sample, NMO velocity) pair by walking through the traces of the current CDP according to the traveltime curve. Each (sample, NMO velocity) pair is independent from each other and can be processed in parallel. Fig. 5b shows how the computing grid is divided among the CPEs and how the results are written back to the semblance matrix. Each point in the computing grid represents a (sample, NMO velocity) pair. In this example, there are 6 NMO speeds (nc) and five samples (ns). The enlarged area shows how the points in the computing grid are mapped to CPEs, which means each CPE is responsible for computing a (sample, NMO velocity) pair. After filling in the entire semblance matrix, we need to find the NMO velocity with the biggest semblance value for each sample. This velocity is the best coherent velocity required.
The computation procedure of a single CDP is shown in Algorithm 1. For each CDP, it enumerates the sample-NMO velocity pairs (line 2), and then finds the intersection of the traveltime curve and traces. At each intersection, it first obtains the halfpoint of the current trace (line 9-11), then accesses the data with size of w (line [12] [13] , and finally retrieves the data computed in a window of width w (line [14] [15] [16] [17] [18] [19] . Each trace has its own corresponding halfpoints, therefore the accesses to halfpoints are continuous when walking through the traces sequentially. Based on this observation, we can reserve a space h s of size size h on LDM to prefetch the halfpoints in advance and save them in h s. We can control the amount of prefetched data in LDM by adjusting size h. for
ac linear 0 7: den 0 8:
if t%size h ¼¼ 0 then 10:
prefetch size h halfpoints through DMA 11:
end if 12:
calculate index k1 of random data access 13:
get data of size w by k1 through DMA 14:
for
ac linear ac linear þ v 19:
end for 20: end for 21: end for 22: end function
Eliminating Random Memory Access
Software Cache Within a CG
Due to the limited memory bandwidth on Sunway, we propose a software cache to alleviate the long memory access delay caused by random data access. We design a software cache, that is, two adjacent CPEs in each row of the CPE mesh are organized into a group, and one CPE in each group is selected to act as the software cache of the group. The selected CPE for caching is the master CPE, and the other one is the worker CPE. When memory access occurs, the master CPE accesses the data through DMA and distributes the data to the worker CPE through register communication. Existing research [17] reveals that when the accumulative data size of the DMA accesses from the 64 CPEs within a CG is less than 1024B, the achievable DMA bandwidth is proportional to the size of data accesses. In both CMP and CRS methods, the maximum size of data access is 76B. Therefore, the proposed software cache is capable of combining multiple DMA accesses, which not only reduces the number of memory accesses, but also increases the achievable DMA bandwidth. When designing the software cache, we also consider the computation characteristics of CDP. The memory accesses at the software cache during the CDP computation are shown in Fig. 6 . We denote the processing of a trace as a phase. The master and worker CPEs calculate their corresponding memory region of data access, and the worker CPE sends the requested memory region to the master CPE. After the master CPE receives the request, it identifies the minimum and maximum memory address among the regions, and then copies the data between the minimum and the maximum address to the LDM of the master CPE. The master CPE sends back the data to the worker CPE based on the requested memory region. Then both the master and worker CPEs start their corresponding calculations. When the master and worker CPEs finish processing the current trace, they proceed to the next trace.
We implement a synchronization-free mechanism to reduce synchronization overhead for the communication between the master and worker CPEs. Fig. 7 shows the communications between the master and worker CPEs during the CDP computation process. First, a request signal is sent to the master CPE from the worker CPE. After obtaining the data from memory, the master CPE sends the data back to the worker CPE. We name the above procedure as a round of communication. After multiple rounds of communication, the worker CPE finishes the calculation and sends a fin signal to the master CPE. After the master CPE receives the fin signal, it also sends a fin signal to the worker CPE, and releases the software cache. Then the master CPE enters the calculation mode to complete the remaining calculations.
Note that the correctness of our synchronization-free mechanism does not depend on the amount of tasks assigned to each CPE. Even if the worker CPE ran slower than expected, it does not jeopardize the correctness of the results. Because if the master CPE finishes its computation tasks earlier, it still functions as the software cache for the consecutive rounds of communication, which means it continues to serve the data requests from the worker CPE through DMA operation and register communication. When the worker CPE finishes all the computations, it sends the fin message to the master CPE to enter the end phase of the synchronization. Since the register communication is implemented in blocking fashion on Sunway [18] , the master CPE does not proceed until it receives the fin message from the worker CPE as shown in Fig. 7 . The blocking communication guarantees the correctness of our synchronization-free mechanism.
Re-Designing the Computation for Semblance
Due to the poor data locality in the semblance calculation, we re-design the semblance calculation procedure and the data access pattern in order to reduce the number of memory accesses and improve the data reuse. The original procedure of the semblance calculation is shown in Fig. 8a . For each trace, the computing grid contains ns Â nc points. When performing the search, there are ns Â nc intersections randomly distributed in the trace, which leads to a large number of random memory accesses. As shown in Fig. 8b , after the calculation re-design, we process all the intersections in a trace continuously, and save the intermediate results from the computation of each intersection before moving on to the next trace. The above procedure is repeated until the last trace is processed. After the re-design, the calculation of the next trace can reuse the intermediate results from the previous trace in the LDM. In addition to the calculation re-design, we also re-design the data access pattern. Each CPE has ns Â nc Ä cores intersections of a trace. Before the re-design, each data access happens in a different time period, with no opportunity to merge data accesses or reuse the data. However, after the re-design, the intersections on each CPE are processed continuously. Based on this property, we identify the minimum (min la) and maximum (max lb) memory regions for all the samples within a trace, and prefetch the data between the memory region min la and max lb before processing the trace.
Algorithm 2 presents the re-designed procedure of the semblance calculation. Multiple sample-NMO velocity pairs are processed simultaneously on a single CPE. For each Fig. 7 . The synchronization-free mechanism for the communication between the master and worker CPEs. sample-NMO velocity pair, the num, ac linear and den variables used during the computation have been expanded with one more dimension respectively for buffering data (line 2-8), compared to Algorithm 1. After initialization, the traces in a CDP are processed in sequence (line 9) and the data halfpoints is prefetched before a new trace is processed (line [10] [11] [12] . For the current trace, the memory addresses of the data accesses are calculated for each sample-NMO velocity pair and kept in the k1 array (line [13] [14] [15] . Then, the maximum and minimum memory address in k1 array is identified (line [16] [17] [18] and used to determine the memory range (length) of data accesses (line 19). The data within the memory range is copied to LDM at one time through DMA operation (line 20). Finally, the calculation is performed in a window size of w for each sample-NMO velocity pair and the intermediate results are kept in the num, ac linear and den arrays (line [21] [22] [23] [24] [25] [26] [27] [28] [29] . After processing the current trace, the algorithm continues to process the next trace until all traces in the CDP are processed. The final results are stored in the num, ac linear and den arrays. ac linear½i 0 den½i 0 8: end for for t ¼ 0 ! ntrace do 10:
if t%size h ¼¼ 0 then prefetch size h halfpoints through DMA 12:
end if for each sample-NMO pair i do 14:
calculate index k1½i of random data access end for 16:
for each sample-NMO pair i do find the min and max val min la max lb in k1½i 18:
end for length max lb À min la 20:
get data of size length by min la through DMA for each sample-NMO pair i do 22:
k k1½i À min la
end for end for 30: end for end function
Compared to the CMP method, the CRS method is more computationally intensive. Due to the limited LDM on each CPE, we cannot prefetch the data of all merged intersections at once. Therefore it is necessary to tile the merged intersections in order to assign the computation to multiple tasks. For instance, if the original loop size is len i to process the merged intersections. After tiling, the loop is divided into two tightly nested loops. The inner loop size is tile size and the outer loop size is len i/tile size. The LDM space occupied by the merged intersections is proportional to the tile size other than the len i. Therefore, the tile operation allows the program to control the usage of LDM by the merged intersections effectively.
Exploiting Vectorization
We further exploit the opportunity for vectorization after the re-design of semblance calculation. As shown in Algorithm 1, each sample-NMO velocity pair maintains the corresponding num array and ac linear, den variables. In the innermost loop, the element-wise vector calculations are applied to the num array, whereas the reduction calculations are applied to ac linear and den variables. As shown in Fig. 9 , the randomly accessed data is only a small portion of the samples from each trace, which is recorded as sub samples. To vectorize the above calculations on Sunway, two challenges need to be addressed. First, the window size w may not be a multiple of four, which is the width of the vector unit on Sunway. Considering the reduction operation, if we directly vectorize the innermost loop, then for each sub samples, the ending data cannot be effectively vectorized which requires additional processing. In particular, if w is small, which means the innermost loop is a short loop, then the overhead of processing the ending data outweighs the benefit of vectorization. Second, sub samples may not be 32B aligned in LDM due to the random data access. On Sunway, the unaligned SIMD load/ store throws an exception and then is split into several normal load/store instructions, which fails to exploit the computation capability of the vector unit. Fig. 9 shows an example on how the vectorization method is applied to a single CDP. In order to load the unaligned sub samples into the vector register, we use the simd set floatv4 instruction that can load four unrelated float variables into the float v4 variable, without requiring these four variables to be 32B aligned. However, compared to the standard simd load instruction, it requires multiple LDM accesses. For element-wise vector operations, we use the vector array that consists of the float v4 vector variables with the length of ceilð wÂ1:0 4 Þ. Taking Fig. 9 as an example, when w equals to 11, the sub samples i contains 11 elements of s 1 $ s 11 , and the num vector array contains three float v4 variables of va, vb, and vc, altogether representing the sub samples i . Three simd set floatv4 instructions are required to load s 1 $ s 12 into the vector array in order to perform element-wise vector calculations. For the reduction calculation, the vector array consists of two float v4 variables vd and ve. The s 1 $ s 4 and s 5 $ s 8 are first loaded into vd sequentially, and then the vector calculation is performed. After that, the s 9 $ s 12 are loaded into ve for the vector calculation. When the calculation of the current trace (trace i ) completes, it proceeds to the next trace (trace iþ1 ). After all the traces in a CDP are processed, the num vector array contains the results of element-wise vector operations on all sub samples of the CDP. To derive the results of reduction calculations, the four elements in vd and the first three elements in ve need to be accumulated.
In Fig. 9 , the data in s 12 is invalid, and thus the result in this corresponding position is also invalid for both elementwise vector calculation and reduction calculation. Although it seems to consume extra space and computing resources, such design can effectively reduce the overhead of processing the data at the end of sub samples in the short loop. With the re-design of semblance calculation, the intermediate results of processing multiple sample-NMO search pairs need to be buffered on the same CPE. The intermediate results of element-wise vector calculations and reduction calculations including vector arrays num, ac linear and den are also need to be buffered.
Asynchronous Parallel Processing Among CGs
For both CMP and CRS methods, the semblance calculation for a single CDP is the same, while the calculation among CDPs is quite different. For the CMP method, there is no dependency among different CDPs. Therefore, for largescale processing, we use the CDP as the granularity of a task. We divide the data into many partitions, and each MPE reads a separate data partition and processes the CDPs within the partition by assigning the CDP computation to the CPEs. In the CRS method, each CDP calculates the two-dimensional coordinates of the middle point according to the coordinates of the source point and the receiving point. As shown in Fig. 10 , each CDP draws a circle with a radius of amp based on its two-dimensional coordinates of the middle point. The data of all CDPs in this circle is collected by the central CDP to be processed. Therefore, for the CRS method we take the computation of the central CDP as the task granularity. We divide the coordinate grid by the coordinate of the middle point.
Specifically, we pre-process the CDP data, during which the CDP data belonging to the same grid is written to the same data partition. Each MPE reads a data partition, aggregates the traces into corresponding CDPs, and then calculates the middle coordinates for each CDP. As shown in Fig. 10 , each MPE in a CG possesses the CDP data based on its grid partition. The outer boundary (the gray dash line) is naturally determined based on the grid partition. Whereas the inner boundary (the blue dash line) of a grid is determined in the following way. From the CDPs nearest to the outer boundaries of its adjacent grids, a cycle is drawn from each CDP with a radius of cvr. Then the lines connect the farthest points of the cycles that reach the target grid form the inner boundary of the target grid. The circle on the right side of Fig. 10 illustrates the above process on how the inner boundary of a grid is determined. In our implementation, we set cvr as the same value to amp. This is because the computation of each CDP only requires the data in the circle with the radius of amp.
In the CRS method, the CDPs (the gray circles) residing in the outer region (between outer and inner boundary) needs to exchange data with the adjacent grids. Whereas the CDPs (the blue circles) residing in the inner region (within the inner boundary) does not need to exchange data with the adjacent grids. Because all the CDP data required for the computation of each CDP within the inner region is already in the grid. There is a special case when exchanging the CDP data at the corners (the purple square) of each grid, because it requires the data from the CDP at the corner along the diagonal grid. The data exchange between the corner CDPs along the diagonal can be achieved by leveraging the grid partition. Based on the indexes of the upper and lower adjacent grids, a grid can determine the indexes of the diagonal grids (e.g., increase or decrease the index of the upper or lower adjacent grid by one) with which to exchange the data of the corner CDPs.
Before exchanging the data of the outer region, the MPE on each CG packs the CDP data and sends it to the adjacent and diagonal grids in all directions, as well as receives CDP data vice versa. Since the computation of the inner region does not require data from other grids, the MPE assigns the computation of the inner region to the CPEs for parallelization. The computation of the inner region and data exchange of the outer region can be performed simultaneously by using MPI asynchronous communication. After the inner region is processed, the MPE checks whether the asynchronous communication finishes. After each grid receives the outer region data sent by its adjacent and diagonal grids, each MPE calls CPEs to process the outer region of its own grid in parallel.
EVALUATION
Experimental Setup
In the experiments, we use the Sunway SW26010 processor for performance evaluation. For comparison, we use thestate-of-the-art implementations [16] of the CMP and CRS methods running on 2 Intel E5-2680 V4 processors, 2 Intel Xeon Gold 6126 processors and Nvidia K40 GPU. The reason why these three particular processor models are chosen is that they deliver similar peak performance (2.16TFLOPS, 3.99TFLOPS and 4.29TFLOPS for 2xE5-2680V4, 2xGold 6126 and GPU K40, respectively) as one Sunway processor (3.52TFLOPS). We use the -O3 and -DNDEBUG options to compile the program. We also turn on the auto-vectorization during the compilation. We generate eight diverse seismic datasets with detailed properties shown in Table 1 . In general, the number of CDPs (ncdps) is proportional to the size of the dataset. Our synthesized datasets contain the number of CDPs ranging from 65,536 to 2,648,430. For a single CDP, the fold describes the number of traces contained in a CDP, the ns describes the number of samples in each trace, and the dt determines the number of data per random data access. Since there are no public seismic datasets available, our datasets are synthesized with diverse properties that we believe to be representative. The performance metric used in the evaluation is semblance trace=s, which equals to the number of intersections produced by all semblance calculations divided by the total execution time.
In the field of seismic processing, single-precision floating point is accurate enough to derive valid results [16] . Hence, all evaluation results presented in this paper are in singleprecision floating point. In order to verify whether our approach affects the accuracy of CMP and CRS method, we provide the relative error of the results compared to the executions on CPU. In addition, we compare the relative error of our optimized parallel implementations on CPEs, the sequential implementations on MPE as well as the parallel implementations on GPU. Since the trend of the relative error is almost the same between CMP and CRS method, we only provide the relative error of CRS in Table 2 . It is clear that the relative error of CRS running on Sunway is much smaller compared to running on GPU. In addition, the relative error of the parallel implementation on CPEs is almost the same compared to the sequential implementation of MPE. This demonstrates our approach hardly affects the accuracy of the CMP and CRS method.
Note that the reason for the lower relative error of CRS on Sunway compared to GPU is that, on Sunway processor, the single-precision floating-point (32-bit) data are converted to double-precision floating-point (64-bit) data by the load and store instructions automatically. The converted data are then processed by the corresponding arithmetic instructions. For the vectorization, a single-precision floating-point vector (128-bit) is represented in 256 bits in a vector register, where each element is represented in exactly the same way as a single-precision floating-point (32-bit) representation in a 64-bit floating-point register [17] . Since the input data is in singleprecision floating-point, the CRS running on GPU is using single-precision floating-point computation for higher performance. This explains why the relative error of CRS on Sunway is lower than on GPU.
Single Node Evaluation
The performance comparison of the CMP and CRS implementations on one Sunway processor, dual CPUs and GPU K40 is shown in Fig. 11 . We scale down the ncdps of all datasets in Table 1 to 8 in order to fit the resources on a single node across all architectures. The performance on dual CPU E5-2680 V4 is chosen as the baseline. We also show the performance impact after applying our optimization techniques such as software cache, calculation re-design and vectorization (SIMD). As shown in Fig. 11 , the naive implementations on Sunway is limited by the memory bandwidth and cannot fully utilize the computation power of CPEs. It is also clear that our optimization techniques are quite effective to mitigate random memory accesses as well as exploit the vectorization for improving the performance of seismic processing on Sunway. After applying all our optimization techniques on Sunway, the CMP and CRS method achieves 3.50Â and 3.01Â speedup on average respectively across all datasets compared to the baseline. We notice that the CMP method achieves better performance in all eight datasets compared to GPU, whereas the CRS method is slightly worse than GPU on three datasets (data2, data3 and data4). This is mainly due to the limited memory bandwidth of Sunway processor (136 GB/s), whereas the memory bandwidth of GPU K40 is higher by an order of magnitude (288 GB/s). We also notice that for Xeon Gold 6126 processor, the performance of both CMP (with data1, data2, data3 and data4) and CRS (with data3, data4, data5, data6, data7 and data8) methods are no better than on Xeon E5-2680V4 processor. The reason is that the higher peak performance of Xeon Gold 6126 heavily relies on the AVX512 units, which cannot be utilized by the original implementation of CMP and CRS methods [16] . This also demonstrates the necessity of the vectorization method proposed in this paper for achieving better performance of seismic processing on processors with ever-increasing vector length.
In addition, we evaluate the cost (performance/$) and power efficiency (performance/Watt) of each processor when running the CMP and CRS methods. To measure the performance/$ and performance/Watt, we reference the market price and TDP of E5-2680V4 [19] , Xeon Gold 6126 [20] and Nvidia K40 [16] . However, there is no public information of the market price and TDP for Sunway processor. We derive the TDP of Sunway processor from [15] , which reveals that each Sunway processor provides a peak performance of 3.06TFlops, with a performance-to-power ratio of over 10GFlops/Watt. Therefore, we can infer the TDP (large scale)   data1  60  550  220  2,648,430  data2  60  550  240  2,648,430  data3  60  1,650  220  1,000,000  data4  60  1,650  240  1,000,000  data5  1,000  550  220  147,456  data6  1,000  550  240  147,456  data7 1,000 1,650 220 65,536 data8 1,000 1,650 240 65,536 of Sunway processor is about 306Watt. We estimate the market price of Sunway processor based on the Intel Knight Landing many-core processor, which contains 64 physical cores with a peak performance of 2.662TFlops [21] . The market price and TDP of Knights Landing processor is $2,147 and 230Watt [22] respectively. We scale the market price of Sunway processor in proportion of the peak performance compared to Knight Landing processor as shown in Equation (2). The market price and TDP of the three processors are listed in Table 3 .
2:662ðTFlopsÞ 2; 147ðdollarsÞ ¼ 3:06ðTFlopsÞ estimated priceðdollarsÞ :
As shown in Fig. 12 , for both CMP and CRS methods, Sunway processor exhibits better performance per price than CPUs and K40, which indicates Sunway processor is cost-effective and thus competitive in the HPC market. In terms of power efficiency, for both CMP and CRS methods, the performance per Watt of Sunway processor is higher than the CPU processors such as E5-2680V4 and Gold 6126. However, for CRS method, the performance per Watt of Sunway processor on all datasets is lower than GPU K40. This is mainly due to the much higher TDP of Sunway processor (306Watt) compared to GPU K40 (235Watt). Whereas for CMP method, the performance per Watt of Sunway processor is much higher than GPU K40 on four datasets such as data1, data2, data5 and data6. The opposite trend of power efficiency with CMP method on the above datasets can be attributed to the much lower computation intensity of CMP method (determined by the ns property of the dataset) compared to CRS. Therefore, the performance of CMP method on GPU K40 is much lower compared to CRS. However, with our optimizations such as vectorization, the performance of CMP method improves significantly on Sunway processor and thus leads to higher power efficiency compared to GPU K40. This is consistent with the performance speedup shown in Fig. 11 .
Understanding the Tradeoff in Software Cache
To better understand the tradeoffs in our implementation, we take the design of software cache for an example. The number of CPEs assigned to a group actually reflects the tradeoff between register communication delay and DMA access. Because with more CPEs in a group, we can merge the data requests in the same group to reduce the number of DMA accesses. However, more CPEs in a group also mean longer delay when sending and receiving messages to/from master CPE through register communication. This is because the master CPE can process one register communication at a time, therefore with more CPEs assigned to a group, the register communication is queued up at the master CPE that increases the communication delay significantly. The optimal number of CPEs in a group strikes a balance between register communication delay and DMA access. Therefore, we evaluate the performance of CMP and CRS when using two, four, and eight CPEs of a group in the software cache. The reason for evaluating two, four, and eight CPEs is because 1) all CPEs can be fully assigned to the groups in such setting, and 2) register communication is only performed in the same row or column, assigning more than eight CPEs to a group requires complex mechanism for communication within the group, which is detrimental to the performance. The results shown in Fig. 13 indicate the software cache achieves the best performance when each group contains two CPEs. In such setting, the register communication delay and the number of DMA accesses reach the balance in terms of performance. 
Roofline Model Analysis
We use the roofline model analysis to better understand the performance impact of our proposed optimization techniques on Sunway. Due to the similar computation pattern between CRS and CMP, we only provide the roofline model analysis of CRS for conciseness. We analyze the performance results of CRS implementation on data1 dataset. Other evaluation results show a similar tendency. As shown in Fig. 14, the operational intensity of the original program is 1.52 FLOPS/byte. In addition, the roofline model of a Sunway CG reveals that in order to fully utilize its performance, 33.84FLOPS calculations should be performed when accessing one-byte data in memory. As shown in Fig. 14, after applying our software cache, the operational intensity is doubled due to the data access by different intersections can be used by each other. In addition to the software cache, after re-designing the procedure of semblance calculation, the operational intensity I can be derived using Equation (3). For a particular dataset, w is a constant, the size of the tile is mainly determined by the size of the LDM, and size get refers to the size of the data accessed by a DMA operation on a CPE. The more intersections processed by a single CPE at a time, the more data is overlapped and can be reused for later calculation. The operational intensity after applying the calculation re-design increases to 16.96 FLOPS/byte. The roofline model analysis demonstrates our optimization techniques are effective in improving the performance of seismic processing on Sunway. 
Scalability
We evaluate both the strong and weak scalability of the CMP and CRS methods on Sunway. The performance is measured by semblance trace=s of both methods, excluding the I/O time. For strong scalability, the number of CGs used for seismic computation scales from 1,024 to 16,384 with the input dataset unchanged. For weak scalability, when the number of CGs doubles, the size of the input dataset also doubles. The detailed data sizes for each input dataset used in the weak and strong scalability experiments are listed in Table 4 . We use the performance when running on 1,024 CGs as the baseline. The evaluation results for strong scalability is shown in Fig. 15 . Since the CMP method does not exchange data between processes, it maintains good scalability in general. Whereas the CRS method exchanges the boundary data between processes, its scalability is poorer than CMP method in all cases. Fig. 16 shows the evaluation results of weak scalability. We use 16,384 CGs to process the dataset with maximum size, and scale down the size of the dataset as the number of CGs decreases. Similar to the strong scalability experiments, the CMP method achieves better scalability compared to CRS in all cases. Note that each CG contains 65 Sunway cores.
Therefore the number of cores used in the experiments ranges from 66,560 (1; 024 Â 65) to 1,064,960 (16; 384 Â 65, more than one million Sunway cores!). The scalability results demonstrate our implementations of seismic processing are capable to run in large scale on Sunway TaihuLight supercomputer.
Portability
Although the proposed optimization techniques are targeting the Sunway TaihuLight supercomputer, these techniques are also applicable to other systems [23] , [24] , [25] that adopt a similar many-core cache-less architecture. For example, the Cell processor [25] uses a PowerPC core that controls eight simple synergistic processing elements (SPEs). The Cell architecture has an explicitly controlled memory hierarchy and the parallelism between the eight SPEs and the PowerPC also needs to be managed explicitly. Specifically, these optimization techniques include: 1) the software cache method can effectively alleviate the long memory access delay during seismic processing on systems that lack L2 cache or with limited L2 cache; 2) the re-design of semblance calculation procedure increases the computing intensity of seismic processing significantly as shown in the roofline model analysis;
3) the vectorization method improves the computation efficiency when processing the tiny data within short loops. For the x86 processors with ever-increasing vector length such as Intel KNL and Skylake, the re-design of semblance calculation procedure as well as the vectorization method proposed in this paper are also useful for boosting the performance. Especially, the vectorization method is necessary for seismic processing to exploit the powerful vector units on Intel KNL and Skylake. For example, in the process of CDP computation, we can use the vector array to convert the reduction operation into element-wise operation, in order to use the AVX512 instruction efficiently.
RELATED WORK
Performance Optimization of Seismic Processing
There has been a lot of work trying to improve the CMP method. Silva et al. [26] evaluate the performance of the CMP method on different platforms. The CMP method is implemented using the SYCL programming model and compared with the implementations using OpenCL and OpenMP. However, the evaluated platforms have high memory bandwidth, which does not suffer the performance problem on Sunway due to the limited memory bandwidth. Zeng et al. [27] explore a different signal-to-noise ratio optimizer with the time-frequency domain-phase weighted stacking. They implement their method using the FFTW C library and the cuFFT CUDA library with significant performance improvement. However, these high-performance CUDA libraries do not exist on the emerging architectures such as Sunway. Lawrens et al. [28] analyze the characteristics of the CRS algorithm and applies the NUMA parallel computation scheme to optimize the CRS-Stack computation. Due to the unique architecture design of Sunway processor, existing optimization techniques of seismic processing are difficult to apply to Sunway directly. The above reasons motivate our work to redesign and optimize the CMP and CRS method to adapt to the Sunway architecture, so that they can fully exploit the massive computation power of Sunway TaihuLight. [30] implement an efficient multi-role based SpTRSV algorithm on Sunway. It leverages the unique register communication mechanism to address memory bandwidth limitations. Chen et al. [31] re-design the earthquake simulation algorithm to reduce memory access costs tailored for the heterogeneous many-core architecture of Sunway. All the above optimization works on Sunway have inspired our re-design and optimization techniques for the CMP and CRS method on Sunway. To the best of our knowledge, this paper is the first work to implement large-scale seismic data processing on the Sunway TaihuLight supercomputer with highly optimized CMP and CRS implementations targeting the Sunway architecture.
Performance Optimization on Sunway
CONCLUSION AND FUTURE WORK
In this paper, we propose efficient implementations of seismic processing using both the CMP and CRS methods on the Sunway TaihuLight supercomputer for massively scaling. Specifically, we propose a software cache to alleviate the random memory accesses during the computation. We re-design the semblance calculation procedure to improve the bandwidth utilization by combining the search processes and buffering the intermediate results on each CPE. Moreover, we propose a vectorization method to improve computation efficiency when processing tiny data within short loops. The experimental results show that our implementations of the CMP and CRS method on one Sunway processor achieve 3.50Â and 3.01Â speedup on average, respectively than the-state-of-the-art implementations on CPU. Moreover, our approach is able to scale to more than one million Sunway cores with good scalability. For the future work, we would like to explore the following directions: 1) applying auto-tuning approach in our optimization process. Currently, there are several optimization parameters in our implementation that are determined based on empirical studies, such as tile size, prefetched data size and software cache size on LDM. We would like to exploit the auto-tuning techniques, including simulated annealing, genetic algorithms, and machine learning algorithms to automatically identify the optimal parameter settings; 2) adapting our work to the incoming exascale high performance computer based on Sunway processor. There is a constantly evolving ecosystem based on Sunway processor, and we would like to apply our optimization techniques on the next generation Sunway processor and the near future exascale system, to enable even larger scale seismic processing.
Yongmin Hu is working toward the graduate degree in the School of Computer Science and Engineering, Beihang University. He is currently working on performance optimization of scientific applications. His research interests include HPC, performance analysis and optimization.
Hailong Yang received the PhD degree from the School of Computer Science and Engineering, Beihang University, in 2014. He is an assistant professor with the School of Computer Science and Engineering, Beihang University. He has been involved in several scientific projects such as performance analysis for big data systems and performance optimization for large scale applications. His research interests include parallel and distributed computing, HPC, performance optimization, and energy efficiency. He is a member of China Computer Federation (CCF).
Zhongzhi Luan received the PhD degree from the School of Computer Science of Xian Jiaotong University. He is currently an associate professor of Computer Science and Engineering, and an assistant director of the Sino-German Joint Software Institute (JSI) Laboratory at Beihang University, China, since 2003. His research interests including distributed computing, parallel computing, grid computing, HPC, and the new generation of network technology.
Lin Gan received the PhD degree in computer science from Tsinghua University. He is currently an assistant researcher with the Department of Computer Science and Technology at Tsinghua University, and an assistant director of the National Supercomputing Center in Wuxi. His research interests include high performance computing solutions based on hybrid platforms such as GPUs, FPGAs, and Sunway CPUs. He is the recipient of the 2016 ACM Gordon Bell Prize, the 2017 ACM Gordon Bell Prize Finalist, the 2018 IEEE-CS TCHPC Early Career Researchers Award for Excellence in HPC, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE. 
