ABSTRACT Graphs are ubiquitous, and graph analytics has been widely adopted in many big data applications such as social computation and natural language processing, as well as web-search and recommendation systems. Prior research focuses on processing large-scale graphs on distributed environments or a single multi-core machine with several terabytes of RAM. Increasing complex memory systems and on-chip interconnects are developed to mitigate the data movement bottlenecks in manycore processors such as Xeon Phi KNL CPU with heterogeneous memory, with up to 72 dual-core tiles. This paper presents a detailed study on the characteristics of manycore memory systems and their impact on the efficiency of graph analytics. Based on this paper, we introduce Ants, the first graph analytics platform on manycore memory systems. First, Ants differentially allocates graph data according to their access patterns and the behavior of heterogeneous memory. Second, to reduce excessive memory access and ease congestion on interconnects and memory controllers, Ants develops a fine-grained and effective task partitioning strategy for many cores. A detailed experiment on a 64 dual-core tile machine shows that Ants outperforms the state-of-the-art graph analytics platform-Ligra by up to 8.97X for real-world graphs.
I. INTRODUCTION
Graphs are powerful data structures that can represent complex relationships among various entities, such as people, webpages, articles and neurons. Many graph processing systems have been developed for analyzing massive graph data and extracting valuable information.
Distributed graph processing systems, including Pregel [1] , GraphLab [2] , Giraph [3] , GraphX [4] and Gemini [5] , attempt to analyze graphs with billions of edges through multiple cluster nodes. However, it is hard to partition graphs among nodes for load balancing and reducing communication overhead. As an alternative, several systems on a singleserver machine have been proposed. Galois [6] and Ligra [7] are specific for multi-core and share memory servers and keep all graph data in memory for consistent performance. Furthermore, some researchers take advantage of HDD/SSD and GPU to expand limited memory capacity and computing power. GraphChi [8] , X-Stream [9] and GridGraph [10] are designed for processing large graphs by relying on secondary storage. FlashGraph [11] and Graphene [12] utilize a flash array for a higher IO efficiency. Totem [13] , GTS [14] and Garaph [15] attempt process large-scale graphs on a CPU/GPU hybrid platform. However, data transmission between the memory and the second storage/GPU always emerges as the bottleneck.
Today, a commodity single server can easily aggregate hundreds of GBs to TBs of DRAM, and manycore processors, such as the Intel Xeon Phi KNL or the Oracle SPARC, provide growing computing capability on a single chip. This capability provides good opportunities for building an efficient graph processing system on those platforms. The Knights Landing (KNL) [16] processor is an x86-64 compatible processor architecture with up to 72 dual-core tiles, and it has a heterogeneous memory with integrated faster MCDRAM and conventional DRAM. It has three memory models and five cluster models with fifteen totally different configurations.
In this paper, we first make a comprehensive study of the characteristics of the manycore memory machine and how its configurations affect the efficiency of existing graph processing systems. MCDRAM has a higher bandwidth but longer latency and limited capacity than regular memory. KNL adopts a Cache/Home Agent to keep coherence of the L2 caches across many tiles using a MESIF protocol. It is very easy to cause excessive memory accesses and congestion on interconnects and memory controllers with so many cores in parallel computation. Base on the above observations, we propose Ants. It inherits the scatter-gather programming interface and vertex-centric iteration pattern from Ligra. The innovation of Ants is mainly in the following two aspects. Ants adopts the Flat memory model for its flexibility to allocate heterogeneous memory space. Graph vertex data and application-defined data keep on MCDRAM because of its high bandwidth and limited capacity. Graph edge data are always a few times larger than vertex data and stored on DRAM. Graph runtime states, temporary arrays and variables are also on DRAM in consideration of its lower latency. Furthermore, Ants employs a fine-grained task partitioning strategy for load balance between worker threads. At the same time, it ensures that the partitioned application-defined data need to update for each thread, which can just fit in the CPU cache line. This mechanism can be of great help for reducing memory access and congestion on interconnects result in great performance improvement.
We have implemented Ants and several typical graph applications with C++ code. A detailed evaluation have been made on a KNL machine with 64 dual-core tiles, running 256 worker threads simultaneously, with five realworld graphs. The experiment results show that Ants often outperforms Ligra, which is a very popular graph analytics platform. The source code of Ants can be found at https://github.com/xinghuan1990/Ants. This paper makes the following contributions:
• We present an extensive analysis that uncovers the characteristics of manycore memory systems and issues with graph analytics platforms.
• The Ants system exploits an effective data layout for heterogeneous memory hierarchy and a fined-grained task partitioning strategy to reduce excessive memory accesses and ease congestion on interconnects and the memory controller.
• A detailed experiment proves the efficiency and expressiveness of Ants.
• The observations and optimization methods are valuable not only for graph analytics but also for other platforms on manycore/multi-core memory systems.
II. BACKGROUND A. GRAPHS
Graphs are ubiquitous in our daily life, such as in social networks, email and instant messaging graphs, Protein-ProteinInteraction networks and web graphs. Most exhibit a skewed distribution, such as power-law degree distribution [22] . This distribution implies that only a small fraction of vertices has a significant large number of neighbors while a major fraction of vertices has relatively few neighbors [32] . In addition, neighbors appear with a random distribution resulting in frequent random memory access in graph analytics. The above characteristics cause great difficulty for data access, load balance and communication.
B. GRAPH ANALYTICS
Many graph analytics systems abstract computation as vertex-centric program and execute on each vertex in parallel [5] - [7] . The scope of computation and communication in each vertex is restricted to its neighbors. 
1) IN-MEMORY GRAPH DATA STRUCTURE
This structure mainly consists of the following three parts: graph topology data, application-defined data and graph runtime states [31] . Figure 1 (a) shows a directed graph with 8 vertices and 17 edges. Most systems commonly split the graph into separately ordered arrays for vertices and edges as in Figure 1 (b). The array for Out-edges of the vertices is partitioned by their source vertex and stores the target vertices. Also, the In-edges array is partitioned by their target vertex and stores the source vertices. The Vertex array keeps the metadata of all vertices, including the start of out-edge and in-edge partitions, and their out-degree and in-degree. In Figure 1 (c), two arrays store two versions of applicationdefined vertex data (the page values, taking an example of PageRank): the Curr-Data array contains the computed value in the previous iteration; the Next-Data array stores the update values in the current computation. Many applications also need store the graph runtime states, i.e., whether the vertex for current and next execution is active or not. In addition, many data structures can be optional for developers according to specific application.
51430 VOLUME 6, 2018
2) GRAPH PARALLEL COMPUTATION Graph analytics systems always follow a scatter-gather iterative computation model and adopt the vertex-centric iteration pattern [20] . The scatter phase propagates the current vertex value to its neighbors along the edges, and the gather phase accumulates values from neighbors to update the current vertex value. The propagation of vertex data in scatter can implement in either push or pull mode [7] . In push mode, the worker threads obtain the neighbors of the active vertex from the Out-edges array (sequential read), push its current value to (atomic if needed) update the values in the Next-Data array (random write), and set the Next-State array (random write) at the same time. In pull mode, for each vertex in the Vertex array, the worker thread first obtains its active neighbors through the In-edges array (sequential read) and the Curr-State array (random read). Then the thread pulls the active vertex values from the Curr-Data array (random read), computes the vertex value in the Next-Data array and sets the Next-State array in turn.
C. THE INTEL KNIGHTS LANDING ARCHITECTURE
The Knights Landing (KNL) is a new x86-based manycore processor released by Intel [16] - [18] , and one of the major changes regarding its predecessor (Knights Corner also known as MIC) is that KNL is shipped not only as a PCIe accelerator, but also as a standalone processor. Moreover, KNL can reach a peak performance of 6/3 Tflops of single/ double precision. 
1) CORES, TILES AND MESH
The Intel Xeon Phi KNL provides up to 72 x86 fully compliant cores, and cores organize into tiles as shown in Figure 2 . Each tile has two cores, a Cache/Home Agent (CHA) and a shared 1 MB L2 cache [33] . The Knight core has a private 32KB L1 data cache and a 32KB L1 instruction cache both with 64Byte cache-line. It runs at 1.3GHz with up to four threads for each tile. The L2 cache is private to the tile, shared by the two cores. The CAH acts as distributed tag directory to keep coherence of the L2 caches across tiles using a MESIF protocol. Figure 3 shows the complete Knights Landing Architecture. Ties connect into a 2D mesh that provides cache coherence between the L2 caches. More details can be seen in [19] .
2) MEMORY
KNL has a heterogeneous memory hierarchy with 16GB integrated MCDRAM and regular DRAM (DDR4 precisely). The MCDRAM consists of 8 * 2GB Micron Multi Channel DRAM (MCDRAM) with four memory controllers. It can reach almost 450GB/s bandwidth which is particularly higher than regular memory. The main board is configured with 6 DDR4 slots with a peak bandwidth of 90GB/s. Figure 4 shows three different memory models configured with MCDRAM. In Flat Mode, MCDRAM and DRAM form an independent address space and appear similar to separate NUMA nodes. MCDRAM is configured as a cache for the DRAM in Cache Mode. The OS organizes the data to use the MCDRAM similar to an L3 cache. The obvious benefit is increased bandwidth, as after careful application design most memory request will hit in this huge cache. However, it also has the following two downsides: the missed requests have to be communicated back into the die when MCDRAM experiences cache misses, and another request is issued out into DRAM for the relevant memory result in increased latency. Another is that all memory needs to be transferred, from DRAM to MCDRAM and finally to the L2 cache. In Hybrid Mode, MCDRAM is part cache (4 or 8GB) and part flat (12 or 8GB).
3) CLUSTER MODELS
The Knights Landing interconnecting mesh operates in one of the following three clustering models: All-to-All, Quadrant and sub-NUMA (selected at boot-time). The memory addresses are uniformly distributed across all tag directories in the chip when using All-to-All cluster mode. This is the most general model with the easiest programming mode, but it offers lower performance than that of the other modes. In Quadrant (Hemisphere) mode, the KNL chip is divided into four (two) parts, and addresses are hashed to directories in the same Quadrant (Hemisphere) resulting in spatial locality to the four memory controllers. This mode provides lower latency and higher bandwidth for the cores running in each Quadrant (Hemisphere) compared with All-to-All mode. In the sub-NUMA cluster modes, the operating system exposes all four quadrants (or two hemispheres) as virtual NUMA clusters just like four (or two) sockets on the machine. This mode provides the lowest latency if applications are specialized NUMA-aware. Hence, thread and memory pinning are employed. However, if the cache traffic crosses the NUMA boundaries, sub-NUMA clustering is less efficient than using Quadrant mode.
III. CHALLENGES AND ISSUES
This section discusses the special feathers of the Knights Landing processor, and the opportunities and challenges it brings for graph analytics.
A. MEMORY HIERARCHY AND DATA LAYOUT
One of the advantages for KNL is the onboard MCDRAM.
We have compared its characteristics with DRAM using Intel Memory Latency Checker [21] . The results under Quadrant cluster mode are shown in Table 1 and the situation is similar with a sub-NUMA mode. In Flat Mode, MCDRAM outperforms DRAM by up to 5.13x for bandwidth, but it also has a higher latency than DRAM. In the Cache Model, there is no significant increase in bandwidth relative to the pure DRAM, and the increase is even worse in the 3:1 R-W and stream-triad tests. Cache miss may cause this so that the worker thread needs resend requests to the DRAM, and the latency is the same as MCDRAM because each request must be the first send to the MCDRAM in this memory mode. Graph processing usually consists of the construction and computation stages. In existing frameworks, the graph topology and application data structures are initialized by multiple constructing threads and are kept until the program termination. While graph runtime states are allocated by the main thread at the beginning of each iteration, accessed by all processing threads and freed when the iteration completes. The graph topology data consists of the vertex array and the edge array, which is usually several times larger than other data. In the computation, worker threads read the vertex and edge arrays according to assigned vertex IDs, and then the threads access and update application values.
Most graphs have a power-law distribution result in very low spatial locality in graph processing. There will be many cache misses for large graphs under Cache mode. It is worth discussing even for small graphs due to the same latency with MCDRAM. Compared with Cache mode, the Flat mode has a better flexibility and larger capacity. It can achieve a better performance only if data structures are allocated properly. 
B. CACHE COHERENCE AND TASK PARTITIONING
The Knights Landing is an x86-based manycore processor. And our machine is configured with an Intel 7210 Xeon Phi (TM) CPU which has 64 dual-core tiles, two cores and a shared 1 MB L2 cache on each tile, and it runs 256 worker threads simultaneously though hyper-threading technology. Tiles are connected into a 2D mesh to provide cache coherence between the L2 caches (see Figure 3) . The mesh also incorporates the memory controllers and the I/O connections. If worker thread on one core updates a value, the L2 caches and cache lines in other cores that shared this value will be invalid and must obtain the entire cache line data again from memory via the 2D mesh and memory controllers.
Graph analytics systems always adopt vertex-centric iteration pattern and employ a push/pull combined mode to update application-defined values in parallel. It is easy to see the same values shared in many cache lines of different cores. Suppose there are 34 vertex values needed to calculate with two threads and each value is stored in 8Byte. A common task partitioning is shown in Figure 5 . The 64Byte cache line keeps eight vertex values. Each cache line in Core0 shares vertex data with the cache lines in Core1 which means any updates made by thread0 will cause cache line failures in Core1 and vice versa. It triggers not only excessive memory accesses but also congestion on interconnects and memory controller especially for manycore processors. 
IV. SYSTEM DESIGN
In this section, we describe the Ants-a graph analytics platform on a manycore memory system. Ants allocates graph data structures based on the heterogeneous memory characteristics, and adopts a specific fine-grained task partitioning strategy for taking full advantage of many cores and reducing the overhead from keeping cache coherency as far as possible. Ants adopts the typical scatter-gather model and provides a vertex-centric programming interface. The major APIs are VertexMap and EdgeMap inherited from Ligra [7] . User defined programs in Ants synchronously run on a directed graph G=(V,E), where V is the vertex set and E is the edge set. For undirected graphs, we only regard the undirected edges as a pair of directed edges and turn it to a directed graph. A VertexSubset type is used to define a subset of vertices U ⊆ V, and both VertexMap and EdgeMap return a VertexSubset value. Ants assumes that the graph topology is immutable during computation similar to other graph analytics systems [1] , [6] , [7] . The main interfaces are as follows:
It applies the application-define function F to all vertices in the active set A and return another vertex set: R = {v ∈ A|F (v) = true}.
2) EDGEMAP(G,A,F)
It applies the application-define function F to all edges whose source vertex belong to the active vertex set A. EdgeMap also returns an vertex set:
The graph topology data of a vertex includes in_neighbors and out_neighbors pointers (V(s).in_neighbors and V(s).out _neighbors), in_degree and out_degree (V(s).in_degree and V(s).out_degree). 
B. DATA LAYOUT FOR HETEROGENEOUS MEMORY
As seen in Table 2 , memory under Cache mode has the same latency with MCDRAM, which is a little longer than DRAM. In addition, its bandwidth is closed to DRAM, but far from MCDRAM. Though all data can be cached in MCDRAM for small graphs, it cannot develop the advantage of DRAM on latency and the memory controllers of MCDRAM must take all the I/O pressure. For large graphs, application-defined data and runtime states may be kept in MCDRAM because of frequent accesses. However most graph topology data needs to be acquired from DRAM which leads to plenty of additional requests. We can flexibly allocate space on both MCDRAM and DRAM under Flat mode, and the total memory space is also more than 16GB. Thus, Ants runs with this mode.
TABLE 2. Flat memory characteristics and data layout
The data layout of Ants is shown in Table 2 . Graph topology includes vertex data and edge data: the content of vertex data is presented in the previous section; the edge data usually consist of in edges and out edges in a continuous virtual address space. During graph processing, worker threads get the vertex information from both of them. Ants allocates vertex data on MCDRAM and edge data on DRAM for the following two reasons: edge data are usually several times larger than vertex data while MCDRAM only has a limited 16GB capacity. MCDRAM and DRAM both have their own memory controllers, so the system can achieve better performance if they co-response memory accesses. During processing, more than two hundreds worker threads read and write the application-defined values simultaneously and are VOLUME 6, 2018 allocated on the MCDRAM due to its excellent bandwidth. For graph runtime states, Ants needs apply new memory space at the beginning and frees at the end of each iteration. A failure may occur after several iterations if we put on DCDRAM due to memory fragmentation on the 16GB capacity. Temporary variables and arrays are all allocated on DRAM for its lower latency.
C. TASK PARTITIONING FOR CACHE COHERENCE
One of the greatest strengths is the numerous cores for manycore architecture. The KNL machine in our lab can run 256 worker threads in parallel with 64 dual-core tiles using hyper-threading technology. However, an inappropriate task partitioning strategy, such as in Figure 5 , may cause excessive memory access and congestion on interconnects and memory controllers. This outcome has a serious impact on the performance of graph analytics.
A high parallel program requires the inter-thread correlation that is as small as possible. An effective solution for the issue in Figure 5 needs the task granularity for each thread to fit in one or several cache lines. OpenMP and Cilk are both popular application programming interface that support multi-platform shared memory multiprocessing programming [34] , [35] . OpenMP provides a flexible task partitioning mechanism to satisfy various needs. However, we found it is less effective than Cilk through extensive experiments. In Table 3 , we have compared PageRand and BFS realized by OpenMP-based and Cilk-based Ligra on two datasets [27], [28] . The Cilk version always outperforms OpenMP by up to 3.9X for PageRank on Twitter. However, the Cilk task partitioning is based on a bisection method, and it only keeps the task granularity approximate (less than) the value we set. For example, if we set 8 as the parallel granularity for 34 iteration loops, the final task partition will be <8, 5, 4, 8, 5, 4> for each work thread. Ants uses Cilk for parallel graph processing at its high efficiency. For its built-in task schedule strategy, we put forward a simple but very effective mechanism to make fine-grained task partitioning.
A possible partitioning result by Cilk is shown in Figure 6 (a). Although thread1 only needs to compute vertex 8∼12, it must also read vertex 13∼15 to fill in the space left in cache line on core1. Any update of values 13∼15 on core0 will cause this cache line invalid associated with their L2 cache; thread0 needs to write the updated value to memory, and then thread1 rereads the corresponding data to continue its computation. With KNL, 256 worker threads update application values in parallel that will cause a large amount memory access for cache coherence and put pressure on the memory controller and Mesh. To solve this challenge, Ants adds 30 isolated vertices (no neighbors) at the end of the graph so the task partitioning result can turn into Figure 6( 
No additional memory space is required in our method, because we only need modify the number of vertex and do not have to allocate memory space for added vertex initialization.
D. OTHER OPTIMIZATIONS
We provide an application programming interface for users to conveniently allocate memory on MCDRAM or DRAM conveniently. If the memory on MCDRAM is not enough, it will allocate on DRAM. Frequently malloc/free memory will cause memory fragmentation easily and performance degradation especially for MCDRAM due to its limited capacity. Thus, we implement a just-enough memory allocation scheme to use it more efficiently. We make a reasonable estimate of memory allocation before computation and reuse the allocation at each iteration. In addition, we also provide the choice for users to skip the calculation for push/pull. Some applications always adopt one mode for better performance such as PageRank for pull.
V. APPLICATIONS
We evaluate the performance and expressiveness of Ants with both basic and complex graph algorithms. These algorithms exhibit various memory access patterns and provide a comprehensive evaluation of Ants.
A. PAGERANK (PR)
In PageRank [23] , worker threads pull values from vertex's neighbors to update the vertex PageRank in parallel. The program finishes at the maximal number of iteration we set (default 20) or the convergence of the vertex PR values.
B. BREADTH-FIRST SEARCH (BFS)
BFS starts with a single active vertex. In each iteration, the active vertices activate their (unvisited) neighbors for the next iteration. The algorithm proceeds until no active vertices left.
C. TRIANGLE COUNTING (TC)
TC computes a vertex' edge list and the neighbors of vertex on the edge list for triangles [24] . All triangles are counted once by labeling the edge direction.
D. GRAPH RADII (GR)
The radius of one vertex is the shortest distance to the furthest node it can reach. The graph diameter is the maximum radius over all vertices. We run multiple BFSs from a sample of K vertices to estimate graph radii, and the algorithm iterates until none of the bit-vectors change [25] .
E. CONNECTED COMPONENTS (CC)
CC is implemented with label propagation in a directed graph [26] . All vertices are initialized by their own IDs; the vertices are broadcast to all neighbors and the smallest vertices observed are shared. CC is completed until no vertex receives a smaller ID. PR and CC (only previous calculations) need to process all vertices from beginning to end. BFS and GR perform computation on a subset of vertices in each iteration. TC only needs one iteration for computation, but TC requires a vertex to read many edge lists resulting in a huge amount of computations.
VI. EVALUATION
We evaluate Ants with the above applications on five realworld graphs. Livejournal is from the Stanford Network Analysis Project [27] and the others are from the laboratory for Web Algorithmics [28] . Their vertices range from 2.18 million to 68.66 million and edges from 68.99 million to 1.8 billion, as detailed in Table 4 .
We compare the performance of Ants with the other two popular processing systems, Ligra and Gemini. All experiments are conducted on a machine with the Knights Landing processor, clocked at 1.3 GHz, 16GB MCDRAM and 96GB DRAM. The processor has 64 dual-core tiles. The machine runs CentOS 7.3 with a 3.10.0 kernel. 
A. CLUSTER MODELS
We observed the behavior of Ants under the following different Cluster Models: sub-NUMA, All-to-All and Quadrant. Figure 7 shows the runtime of PR and BFS, and the results of other applications are very similar. It is clear that both of them have the best performance on the Quadrant Mode with five graphs. With the All-to-All cluster mode, memory addresses are uniformly distributed across all TD on the chip, resulting in a little longer latency of cache hits/misses compared with the results of other modes. The Quadrant mode divides the tiles into four parts while the sub-NUMA mode partition the chip into four quadrants. The former behaves similar to a symmetric multi-processor (SMP), and the later exposes quadrants as NUMA nodes, where the latency of memory access on remote chips is much higher than on local ones [36] . We also compared Ants under Quadrant mode with Gemini under sub-NUMA mode. Gemini is also an excellent graph analytics platform that makes optimizations specifically for NUMA characteristics. As seen in Figure 8 , Ants always outperforms Gemini on the five graphs for PR and BFS. Ants even runs 53x faster than Gemini on gsh for BFS.
Although Gemini is NUMA-aware, it is inevitable to access remote data during graph processing due to the powerlaw distribution of graphs. This seriously affects the performance with KNL architecture because of its complex 2D Mesh. Thus, we conduct the following experiments under the Quadrant mode. Table 5 gives a detail runtime comparison between Ants and Ligra. Ligra_C means we run Ligra on the Cache memory mode, and Ligra_F is on the Flat mode. Ants outperforms Ligra in most cases except for BFS on twitter and gsh. The gap is very small for Ligra with/without MCDRAM on CC, GR and TC. It seems that Ligra can always benefit from cached data in MCDRAM on BFS. On PR, Ligra_F runs faster with the hollywood and enwiki graph; Ligra_C behaves better with the twitter and gsh graph, and it is similar with livejournal. There are no obvious rules for the behavior of Ligra with/without MCDRAM. Both the application access pattern and graph data distribution have an important influence on performance.
B. OVERALL PERFORMANCE
The right column in Table 5 shows the average speedup comparing Ligra_C and Ligra_F individually. It seems that the speedup grows with the calculated quantity. The optimization for data layout has a similar effect on all applications, and the task partitioning strategy is especially optimized for application-defined data updates. For each vertex in the graph, TC scans its neighbors and neighbors of neighbors to find triangles resulting in large amounts of calculations. For BFS, it only needs one update for each vertex after being activated. Thus, TC achieves up to 8.97X improvement while the improvement of BFS is 1.7X. We believe Ants will also achieve good performance because most applications are similar with BFS or PR with more calculations; however, Ants is especially optimized for application-defined value computations.
C. DATA LAYOUT FOR HETEROGENEOUS MEMORY
We compared the performance of Ants on the following four different schemas of memory. In the DRAM mode, all data are allocated on DRAM. In the MCDRAM mode, data are first allocated on MCDRAM; if the space is not enough, the data are stored on DRAM. In Cache mode, MCDRAM serves as a big cache for DRAM. In Flat mode, MCDRAM and DRAM work together. The results are shown in Figure 9 for PR on five graphs. The behaviors for other applications are very similar. It is clear that in Flat mode Ants achieve the best performance. Owing to our specific data layout, Ants can fully exploit the advantages of the higher bandwidth on MCDRAM and lower latency on DRAM. Meanwhile, four memory controllers for MCDRAM and two for DRAM together share the memory access pressure. MCDRAM has a slight advantage over DRAM in our experiments due to the intrinsic properties of MCDRAM (see Table 2 ). The performance under Flat mode is largely relevant to the cache hit rate which is decided by the graph distribution. Ants needs an even longer time than DRAM on livejournal, which may be caused by many cache misses. Gsh is so large that Ants must allocate part of its data on DRAM under MCDRAM mode. Thus, it takes much more time relative to Cache mode. 
D. TASK PARTITIONING FOR CACHE COHERENCE
In this section, we compared the effect of our task partitioning strategy with the partitioning schema built-in Cilk. PR application. We set the optimized Ants as 1 for clarity. For example, the number of load/store instructions is 6.47/6.99 times larger than optimized for PR on hollywood. Figure 10 (b) presents the speedup with our task partitioning mechanism. It is clearly that our strategy reduces a large number of load/store instructions and achieves a better performance improvement than that of the default Cilk partition schema. It seems that performance improvement is more relevant to load reduction from the results on enwiki, livejournal and twitter. The reason for this is that the number of load instructions is usually one or several orders of magnitude greater than the store instructions (see Table 6 ). The more that the number of instructions is reduced, the better the performance improvements is. We also measure the performance of Ants with different granularity. Figure 11 shows the results for PR application with granularity from 1 to 64 (8 to 512 in Byte). Each application-defined data is stored in 8Byte, and the CPU cache line is 64Byte. Ants runs fastest for all graphs in a granularity of 8, which only fits in one cache line. A smaller value causes excessive memory access and congestion on interconnects while a larger value is hard to keep load balanced when the number of tasks is not large enough for so many worker threads.
E. ANTS WITH OPENMP
We also implement the Openmp version for Ants and make a detailed comparison with the native version of Ants. Figure 12 shows the results of five applications on Hollywood and livejournal and the results are similar for other graphs. It is clear that the native Ants version achieves better performance in most cases. 
VII. RELATED WORKS
Ants directly departs from prior graph analytics systems such as Ligra [7] , Gemini [5] and HPGraph [29] . It adopts specific data layout for heterogeneous memory and an effective task partitioning method to reduce the excessive memory accesses and ease congestion on interconnects and memory controllers caused by cache coherence on the manycore system.
A. DISTRIBUTED GRAPH PROCESSING SYSTEM
Pregel [1] , GraphLab [2] , Trinity [30] , GraphX [4] and Gemini [5] are popular distributed graph analytics systems. Pregel was developed by Google, and it uses message passing to update vertex states among different nodes with a BulkSynchronous Parallel model. GraphLab has been deployed in many companies, which is used for both graph analytics and machine learning applications. Gemini is based on modern multi-core processors and high-speed interconnection networks. It starts from computation-centric processing and re-designs critical system components, includes graph partitioning and representation, update propagation, task scheduling and multi-level load balancing. Many techniques used in Ants are borrowed from those distributed systems.
B. SINGLE-MACHINE GRAPH PROCESSING SYSTEM
Many researches are committed to leverage multicore platforms for graph processing. GraphChi [8] and X-stream [9] both target at disk-based graph computation. FlashGraph [11] and Graphene [12] are special for high-speed flash arrays as an extension of memory. Ligra [7] , Galois [6] and Polymer [31] store all graph data in memory and try to optimize on data placement, task scheduling and NUMA features.
C. GPU-BASED GRAPH PROCESSING SYSTEM
Due to the strong computing power, some researchers develop CPU/GPU heterogeneous graph processing platforms. Totem [13] distributes tasks between CPU and GPU based on their memory capacity and synchronizes at the end of each iteration. GTS [14] transfers graph data to GPU for analytics in batches and overlaps the data transmission and computation to hide latency. Garaph [15] adopts a balanced edge-based partition to ensure work balancing among CPU threads and proposes a vertex replication scheme for maximizing the utilization of GPU. Others also attempt to build graph analytics platform with multi-GPUs for faster graph processing [37] - [39] .
VIII. CONCLUSION
This paper described Ants, a graph analytics platform on a manycore memory system. The key of Ants is the specific fine-grained task partitioning strategy that reduces large amount of memory access and eases the congestion on interconnects and memory controllers. Ants also adopts an effective data layout for heterogeneous memory that develops the advantages of MCDRAN and DRAM. This study shows the first attempt for graph analytics on a manycore memory system. The above findings and optimizations are also valuable for building other platforms on manycore/multicore systems. His research interests include parallel operating system, global file systems, and non-volatile storage.
NONG XIAO was born in Nanchang, Jiangxi, China, in 1969. He received the B.S., M.S., and Ph.D. degrees in computer science from the National University of Defense Technology.
He is currently a Professor with the School of Computer Science, National University of Defense Technology. His research interests include grid and cloud computing, and big data processing.
Dr. Xiao received awards and honors, including the Cheung Kong Scholars Chair Professor Award, the National Outstanding Young Science Fund, and the First Prize of the National Science and Technology Progress Award. She was the Deputy Chief Designer for Tianhe-2 systems with the National Supercomputing Center, Guangzhou, where she is currently the Director. She is also a Professor with the School of Computer Science, Sun Yat-sen University. Her extensive research and development experience has spanned several generations of domestic supercomputers in China and her continuing research interests include parallel operating system, high-speed communication, global file system, and advanced programming environment.
Dr. Lu was a recipient of the National Science and Technology Progress Special Award and the New Century Excellent Talent Support Plan in China. VOLUME 6, 2018 
