In this paper, we study how the Pruned Landmark Labeling (PPL) algorithm can be parallelized in a scalable fashion, producing the same results as the sequential algorithm. More specifically, we parallelize using a Vertex-Centric (VC) computational model on a modern SIMD powered multicore architecture. We design a new VC-PLL algorithm that resolves the apparent mismatch between the inherent sequential dependence of the PLL algorithm and the VertexCentric (VC) computing model. Furthermore, we introduce a novel batch execution model for VC computation and the BVC-PLL algorithm to reduce the computational inefficiency in VC-PLL. Quite surprisingly, the theoretical analysis reveals that under a reasonable assumption, BVC-PLL has lower computational and memory access costs than PLL and indicates it may run faster than PLL as a sequential algorithm. We also demonstrate how BVC-PLL algorithm can be extended to handle directed graphs and weighted graphs and how it can utilize the hierarchical parallelism on a modern parallel computing architecture. Extensive experiments on real-world graphs not only show the sequential BVC-PLL can run more than two times faster than the original PLL, but also demonstrates its parallel efficiency and scalability.
INTRODUCTION
Computing the shortest path distance between any two vertices stands out as one of the most fundamental graph operators in querying and analyzing massive graphs, with applications ranging from transportation systems, social networks, software systems, the WWW, and semantic web, among others. This operation also serves as the basis for more complex graph analytics and mining operations, such as graph pattern matching [20, 86] , distance join processing [68] , and centrality computation [14] .
Distance computation on road networks has become a common service for internet map applications, such as Google Maps. However, computing shortest path distances over scale-free complex networks-for example, massive social and web graphs-remains a challenging problem [7, 38, 58] . To provide the exact distance query result, the 2-hop labeling approach [21] has emerged as a major tool. Given a graph, it aims to assign each vertex v a label L(v) comprising a list of vertices and their distance to v. Subsequently, given any two vertices, we can only use the label information, L(u) and L(v), to recover their exact distance.
Since the seminar work by Cohen [21] , numerous efforts over a ten-year period [70, 18, 37, 19, 2, 29, 69] have largely failed in making 2-hop labeling practical on realworld graphs with millions of vertices and edges, until the discovery of Pruned Landmark Labeling (PLL) [7] . This new labeling approach adopts a fast greedy process to iteratively assign each vertex (one vertex at a time according to certain vertex order) the label of other vertices with respect to a distance check criterion: a vertex u can be added into a vertex v's label L(v) if there are no other prior recorded vertex h ∈ L(u), such that it can provide equal or shorter distance:
for all h ∈ L(u) and h ∈ L(v). Here L(u) and L(v) are the partial labels being constructed before the labeling of vertex u. Once the labeling process is done, the results are guaranteed to be minimum as no hops need or can be removed to recover all the pairwise distance or reachability information.
In the past few years, a number of studies [24, 48] have further validated and confirmed the scalability of this approach. As a result, 2-hop has gone from handling small graphs with thousands of vertices and edges to large graphs with millions of vertices and edges. PLL meets Vertex Centric Computation: As modern computing architectures become increasingly parallel with more and more powerful GPUs and multicore with SIMD architectures emerging as over-the-shelf choices for graph processing/database systems and as the size of real-world graphs continue to grow bigger, an important question naturally arises: can PLL take full advantage of modern parallel computing architectures to better handle massive real-world graphs? Furthermore, since vertex centric (VC) computation [54, 55] has become the de-facto standard for parallel graph processing and graph databases, can PLL be parallelized using the vertex centric scheme? As the industry adoption of graph databases and graph analytics systems is accelerating, answering these questions is becoming critical.
However, the marriage between PLL and VC seems to be quite a mismatch: the original PLL algorithm is inherently sequential; i.e., the algorithm operates one vertex at a time to label the entire graph, and the labeling of a vertex depends on the partial labeling results from earlier processed vertices. Not only that, it has also been claimed that PLL does not fit into a VC model [62] . Parallel PLL: Given the strong task dependency existing within a single vertex labeling process and across labeling vertices, all existing attempts on parallelization rely on the computational flow of the sequential PLL. The original PLL paper [7] has suggested to simply parallelize the BFS labeling process of each vertex instead of dealing with inter-vertex labeling dependency. Clearly, without the inter-vertex label-ing parallelization, the parallelism is quite limited. The two recent attempts [27, 62] treat each vertex labeling process as a single task, and allow multiple vertices to simultaneously traverse and label other vertices sequentially following the original PLL logic. Thus, the full benefits of PLL across the labeling order of vertices cannot be maintained and they cannot produce the same compact label as the original PLL. Our Contributions: In this paper, we study how PLL can be effectively parallelized (and more specifically, under a Vertex-Centric (VC) computational model) using a modern SIMD powered multi-core architecture. Specifically, we study the following research problems and make several interesting discoveries along the way: 1. Parallel PLL Algorithm (Section 3): To solve the mismatch between the inherent sequential/dependence of PLL algorithms and the VC model, we introduce a new VC-PLL algorithm that utilizes VC to parallelize PLL and is guaranteed to produce the same labels as original PLL. However, the performance of basic VC-PLL turns out to be quite disappointing compared to PLL (both using a single thread). The theoretical analysis reveals two key factors in VC-PLL: message passing and remote memory access during vertex computation, which both introduce additional costs compared to the PLL algorithm.
Batched Vertex-Centric PLL(Section 4):
To deal with the limitations of VC-PLL, we introduce a novel batch execution model for vertex-centric computation and a new BVC-PLL algorithm which largely preserves the same vertex computation function while reducing the costs of message passing and remote memory access. Quite surprisingly, an in-depth and apple-to-apple cost analysis between BVC-PLL and PLL reveals that under certain reasonable assumptions, VC-PLL has lower computational costs and memory access costs than PLL! This indicates BVC-PLL may run faster than PLL even as a sequential algorithm.
Generalization and System Optimization(Section 5):
We discuss how BVC-PLL can be extended to handle directed graphs and weighted graphs. We also study how BVC-PLL can be supported by modern parallel computing architecture using the hierarchical parallelism: the coarse grained thread-level parallelism and the fine-grained datalevel parallelism (i.e., SIMD parallelism or vectorization). 4. Experimental Study(Section 6): Our extensive evaluation focuses on the following questions: Can the BVC-PLL algorithm using only one thread (sequential execution) run faster than the original PLL? How does BVC-PLL scale to multiple threads and how is its parallel scalability compared to other parallel algorithms? What are the main factors affecting its performance? We show that the sequential BVC-PLL can run more than two times faster than the original PLL (both using one single thread)! Additionally, BVC-PLL also has good scalability and obtains close to linear speedup using 20 threads on several real-world datasets.
PRELIMINARIES

2-Hop Labeling and PLL
The 2-hop labeling algorithm [21] , which was pioneered by Cohen et al. [21] , provides an efficient scheme to answer distance queries. It assigns each vertex u in the (undirected) graph a label L(u) such that for any two vertices u and v, their distance can be computed using only their label information. Formally, we compute L(u) and for each h ∈ L(u), the corresponding distance from u, i.e, d(h, u). It is also called hub labeling [3] as the label set L(u) for the vertex u is referred to as the hubs of u. Table 1 illustrates a 2-hop labeling of the undirected graph G (Figure 1a) .
Formally, the shortest path distance query Dis(·, ·) between any two vertices u and v can be answered as:
Thus, 2-hop labeling can answer distance queries efficiently by traversing two lists of vertices, with an operation similar to merge sort. Given this, 2-hop labeling aims to minimize the total labeling size, i.e, if V is the vertex set of graph, the goal is to minimimize u∈V |L(u)|. For a directed graph, each vertex v is labeled with two labels Lout(u) (the hubs reachable from u) and Lin(u) (the hubs reaching u) together with their distances to and from the vertex u, and the objective function is minimizing u∈V |Lout(u)| + |Lin(u)|.
The traditional approach employs an approximate (greedy) algorithm based on set-covering, which can produce a distance oracle with a size no larger than the optimal one by a logarithmic factor. Conceptually, the ground set consists of all the reachable vertex pairs, such as {(u, v) : u reaches v}. Any subset Cv ⊆ S × S (S ⊆ V ) consisting of vertex pairs (x, y) (x, y ∈ S), which can use v to recover their shortest path distance, i.e.,
, is a candidate set. In other words, it suggests the benefits (effects) by assigning v to the labels of vertices in S. The algorithm iteratively selects a vertex v to label a subset of vertices S and cover all the vertex pairs in Cv ⊆ S × S. It continues until the entire ground set is covered. The criterion of selecting optimal v and S is based on the ratio of newly covered pairs in Cv, i.e., |Cv \ P | where P consists of already covered pairs, and the label cost |S| :
. For any given vertex v, finding the optimal subset of vertices S to be labeled is equivalent to a densest subgraph problem 1.
The major problem with the set-cover based 2-hop labeling approach is its high construction cost: its original complexity is as high as O(n 5 ) [21] , which has then be reduced to O(n 3 log n) with the latest optimization techniques [8] . A number of other improvements under the set-cover framework [70, 18, 40, 37, 76, 19, 2, 29, 69] still cannot scale to the real world graphs that have millions or even billions of vertices and edges. Hierarchical Hub Labeling (HHL) and Canonical Hierarchical Hub Labeling (CHHL): An important direction to make 2-hop labeling feasible and scalable for large graph is to restrict the choices of labeling (by imposing some special properties on what can be added into the labels). Definition 1. (Hierarchical Hub Labeling) Given two distinct vertices u and v, we say u v if u ∈ L(v) (u is a hub of v). A hub (2-hop) labeling is hierarchical if forms a partial order.
In fact, any partial order can be extended to a total order (the order-extension principle) and for a set of vertices V , the total order is defined as a bijection π : V → 1, · · · , |V | (π(v) is the rank of v). Given this, we can say that a label is hierarchical if there is a total order π which satisfies: u ∈ L(v) then π(u) < π(v) (u ranks higher than v). shortest paths between u and v (including u and v). Given a total order π on V , its canonical hub labeling is defined as follows: u ∈ L(v) if u has the highest order in Puv, i.e., no other vertex w in Puv such that π(w) < π(u).
An important implication of canonical hierarchical hub labeling is that it produces the minimal hierarchical hub labeling for a given order [9] . Thus, the optimal HHL problem can be transformed into two sub-problems: 1) finding the optimal order that minimizes the label size; 2) computing the canonical HHL with respect to a given vertex order.
A main breakthrough enabling efficient 2-hop labeling is the discovery of a simple, yet elegant algorithm called pruned landmark labeling (PLL) [7] . It computes the canonical HHL (the second subproblem) for a given vertex order efficiently. Independently, essentially the same style algorithm was discovered for 2-hop reachability labeling, and is called distribution labeling [39] . In the past few years, a number of studies [24, 48] have further validated and confirmed the efficiency and effectiveness of PLL style algorithms for distance labeling.
Theoretically, the optimal hierarchical hub labeling (HHL) as well as the original 2-hop labeling have recently been proved to be NP-hard [9] , which implies that the optimal order sub-problem (the first sub-problem listed above) is NP-hard as well. A few heuristics, such as the ranking by degree and betweenness, have been developed for addressing this sub-problem [48] . The second sub-problem (labeling generation) typically dominates the overall labeling computation and is thus the focus of this study.
Pruned Landmark Labeling (PLL)
Given a total order π of vertices, the pruned landmark labeling algorithm (PLL) [7] assigns each vertex, based on the order (π(v1) < π(v2) < · · · < π(vn)), to the labels of other vertices in the graph following a BFS process. As it assigns the vertex u with rank π(u) to a vertex v with lower rank (π(u) < π(v)), it needs to check if u is the highest rank vertex in the shortest paths between u and v (Puv). This is the canonical HHL condition and can be done by determining whether the distance between u and v can be Algorithm 1 PLL for G = (V, E) with Order π 1: for all u ∈ V {following order π from high to low} do 2: Queue Q = {(u, 0)} {BFS process to use u for labeling} 3: while Q is not empty do 4:
For all v ′ of v's neighbor when v ′ unvisited by u and
end if 9: end while 10: end for recovered by a certain higher ranking vertex:
When the condition does not hold, u will be pruned by v (i.e., is not added into the label of v and will not further expand from v) during the labeling process.
Algorithm 1 sketches the labeling process for an undirected graph. Note that d(u, v) in the algorithm is the distance computed by the BFS process, which may not be the exact distance between u and v (due to the pruning effect). But the recorded distance in the label (Line 6) is always exact (since it can travel through all the shortest paths starting from u reaching to v). Small Revision: In Algorithm 1, which is slightly different from the standard BFS process as well as all the previous PLL descriptions [7, 48] the following change is made. In Line 7, we only send the current labeling vertex u to the neighbors of v that have rank lower than v (π(u) < π(v ′ )) [9] . This can reduce the cost of sending u to v ′ if v ′ has higher rank than u. Based on the canonical labeling criterion (Definition 2), u cannot be added to v ′ and can be safely pruned without further expansion from v ′ . Running Example: Figures 1b and 1c illustrate the first three vertices I, E, and D of the PLL labeling process for graph G (Figure 1a ) with its order explicitly denoted in Figure 1b and Figure 1c . For instance, I is ranked first and D is ranked second, and so on.
Vertex Centric Graph Computing Models
The seminal vertex-centric programming model proposed by the Pregel paper [54] is one of the key driving forces behind recent parallel graph processing system research [32, 46, 57, 73, 51, 43, 82, 71, 79, 59, 53, 35, 23] . It is also known as the "think-like-a-vertex" model. Though other models have been used, the simplicity, wide-range applicability, and strong scalability make the vertex-centric model very appealing as the basic interface and abstraction for parallel graph processing [55] .
Simply speaking, parallel graph processing is viewed as an iterative process, where each iteration traverses/processes [77] and requires a global synchronization at the end of each iteration. The entire process terminates once the set of active vertices becomes empty.
In this paper, we will focus on studying how PLL can be parallelized under the vertex-centric computation. A high level abstraction of the vertex-centric computation based on a scatter-gather model [54, 64] is sketched in Algorithm 2. Each vertex computation is described through two functions: 1) the Scatter function, which describes how each vertex uses its vertex value and edge value to propagate a message to its neighbors; and 2) Gather function, which describes how each vertex computes a new value based on its original value and all the new messages it received. Each phase can traverse in parallel their corresponding vertex sets: ActiveVertices, including the vertices need to send out messages to their neighbors, for scatter phase, and vertices that have received a new message to be processed in the scatter phase. When a vertex is updated with a new value (in the Gather function), it will be added to the set (ActiveVertices). The process continues until there are no new active vertices.
Various more advanced parallel graph programming models are proposed to further refine the vertex-centric model. This includes the GAS (Gather-Apply-Scatter) [32] and the push and pull models [60, 73, 13] , where the goal is to better fit the computational and communication patterns of graph processing. There is also work on generalizing the model to finer granularity, such as the edge-centric model [64] , or to coarser granularity, such as path or subgraph-centric [63] , and k-step neighborhood [15, 44] models. However, these models do not necessarily provide more advantage/capability to support the parallelization of the PLL than the aforementioned vertex centric model. Other recent efforts like iBFS [50] , CUBE [83] , and RStream [78] target different applications or algorithms, and have distinct challenges associated with them.
BASIC VERTEX-CENTRIC ALGORITHM
Recall that the PLL algorithm (Alg. 1) iterates following the vertex rank (order): at the i th iteration, the vertex u with rank π(u) = i will be distributed to all other vertices in the graph using a BFS process. The key condition to add u into the label of v, L(v), is the distance check for the canonical labeling criterion: the distance between u and v cannot be recovered by earlier processed vertices, i.e., vertices with rank higher than u:
Otherwise, u will not be assigned to v and will not be sent to v's neighbors for any further expansion.
The main challenge in parallelizing PLL is that adding a vertex u of rank π(u) to another vertex v in the BFS traversal seems to be dependent on the completion of labeling of all higher ranked vertices (i.e., any vertex h such that π(h) < π(u)) in order to apply the distance check. In comparison, for parallelization with the vertex-centric model, we would like to distribute all vertices to their neighbors simultaneously for vertex labeling. Given this, the need to distribute all vertices simultaneously and the distance checking condition based on the vertex rank seems to be in conflict as there is no guarantee that the higher vertices can finish the distribution before lower rank ones. Indeed, as we mentioned earlier, all the existing attempts have all failed to parallelize inter-vertex labeling while preserving the canonical labeling criterion [7, 27, 62] .
The Algorithm
The main insight to help us solve the aforementioned dilemma is as follows. Assume we spread all vertices simultaneously into the graph (starting by sending each vertex to their neighbors), and we do the spreading iteration by iteration following the vertex-centric programming model. Let us consider a vertex u with the rank π(u) that reaches the vertex v at the j-th iteration: clearly in order to determine if u should be added to L(v), and spread by v continuously, we need to verify if there are any other vertices, say w, with higher ranks than u (π(w) < π(u)), which can produce an equal or shorter distance, i.e.,
Note, our key insight is that if such vertex w exists for the testing, then, it must be able to reach both u and v within the j-th iteration (d(u, v) steps).
In retrospect, the distance check condition for canonical labeling criterion requires not only the labeling of higher ranked vertices h to be completed before the distance check between u and v, but also their distances d(u, h) and d(h, v) to be smaller than d(u, v). The latter condition is the key for utilizing a VC model for PLL, and provides a natural match of the vertex spreading process at the heart of VC computation to the center mechanism of the canonical labeling in PLL: If we follow a basic label spreading process in VC, then, we can in parallel prune (or accept) vertex labels at any vertex using the distance check for canonical labeling in PLL. Algorithm Description: Algorithm 3 sketches the main process of performing PLL based on the vertex-centric computation model (Algorithm 2). In the Initialization phase, all vertices are active initially (ActiveV ertices = V ). For each vertex v, L(v) records the partial label and δL(v) records the new label being generated at each iteration. Initially, both labels of v records itself and distance 0 (any vertex reaches itself in zero steps). The main computation alternates between the Scatter phase and Gather phase and will continue until no new active vertices exist (Lines 2 to 17): 1) Scatter phase (Lines 3 to 7, also referred to as the push model): all active vertices with new labels perform
while ActiveV ertices = ∅ do {Scatter Phase:} 3: for all a ∈ ActiveVertices do a.Scatter(a.edges):
4:
for all (a, v) ∈ a.edges do 5: 9:
end if
14:
end for 15: 
to all their neighbors (Line 5) with two conditions: the rank of vertex u needs to higher than v (otherwise, it will be pruned) and it has never been added to the label of v.
2) Gather phase (Line 8-16): all vertices that receive a new message (v.messages = ∅) perform a vertex Gather function (Lines 9-15): For a vertex v, it traverses all its received messages (distance label from its neighbors), and for each unique vertex (u, d(u, v)) across the set of messages, it confirms the distance check for the canonical labeling criterion: for a distance label message (u, d(u, v)), d(u, v) must be smaller than the distances via any existing labels (L), i.e.,
. If this true, it will be added into δL(v). Once δL(v) is computed and it is not empty, we will add it into L(v) and add v to ActiveVertices (Line 15). Note that we need to identify unique vertices in the step above, because two neighbors may send the same vertex u. Running Example: Figures 2a illustrates the iteration of label spreading, where the labels in the graph record newly generated labels δL for all vertices. At each iteration, L(v) is simply the union of all δL(v) from all earlier iterations.
Theoretical Properties
Correctness: Theorem 1 proves that VC-PLL produces a canonical hierarchical labeling and therefore also generates the minimum labeling size given a vertex order. In other words, VC-PLL produces the same label as the original PLL. Theorem 1. VC-PLL (Algorithm 3) produces the canonical hierarchical hub labeling given a vertex order π.
Proof Sketch: Recall the shortest path vertex set Puv consists of all vertices on shortest paths between u and v (including u and v). Then, we need to prove u ∈ L(v) iff u has the highest order in Puv (Definition 2).
First (→), we can see that if u ∈ L(v), then we cannot find another vertex w with rank higher than u, such that
. Thus, u must have highest order in Puv. If not, assume we have another vertex w = u that has the highest rank in Puv. Then, based on our algorithm, w will be the highest ranked in Pwu and Pwv. Thus, w can always reach u and v before u reaches v (Figure 3 ) and it is in Lu and Lv when u reaches v. Second (←), assuming u has the highest order Puv, then, based on the same argument, it can definitely go through a shortest path from u to v using Algorithm 3 and if it reaches v, no other vertices in Lv (and Lu) can prune it. ✷
The following corollary can be immediately obtained.
is the exact shortest path distance between u and v, and u has the highest rank in Puv.
Tree Width and Time Complexity Following the approach in PLL [7] , we can obtain a theoretical upper-bound of VC-PLL's time complexity.
Theorem 2. Assuming graph G with a tree-decomposition [1] of tree-width w, then there is a vertex order π, in which the VC-PLL takes O(w|E| log |V | + w 2 |V |(log |V |) 2 ) time (the same as that of PLL [7] ).
Proof is omitted due to space limitation.
Limitations and Benefits of VC-PLL
We note that even though Theorem 2 provides a theoretical evidence on time complexity, it does not provide a direct comparison of the computational and memory access costs between these two algorithms, PLL and VC-PLL, for a given vertex order. In the following, we will do an in-depth comparison between VC-PLL and PLL, and identify the main performance bottleneck and potential benefits introduced in VC-PLL. Since the cost of generating (sending) distance labels and distance check dominates the total computation (similar in the original PLL [7] ), we will primarily focus on these two factors for computational costs. In addition, we will compare the memory access of the underlying graph G between them. Additional Cost of Distance Label Generation: For a given vertex u, PLL will send it to a vertex v only once. In BFS, PLL will flag v after one distance label (u, d(u, v)) is passed through (Line 7 in Algorithm 1 is sequentially executed). But VC-PLL can send multiple (u, d(u, v)) messages to the same v at two consecutive iterations. Lemma 1. Given vertex u and vertex v, a distance label (u : d(u, v)) may reach v at exactly two possible and consecutive iterations: Let a be a neighbor of v, and u ∈ L(a) (u is the highest rank vertex in Pua), then it reaches v at d(u, a) + 1 iteration, which is either: 1) equal to the shortest path distance between u and v, and u may or may not be added to L(v); or 2) equal to d(u, v) + 1, i.e., the path from u to a 
Please see Appendix for proof. However, the number of distance checks in VC-PLL can be higher than PLL, as a vertex u can be sent to v in two consecutive iterations in VC-PLL.
The computational cost of distance check
in VC-PLL is also higher than that in P LL. Assuming L(u) and L(v) are not sorted, we can first map L(u) into an array or hash-table, and then check all the vertices in L(v) against the above data structure. In PLL [7] , since we process vertex u one at a time, and when we try to process u, its label L(u) is already computed. Thus, we can first map L(u) to an array only once at the beginning of the BFS iteration. Thus, the cost of O(|L(u)|) can be practically saved for each distance check; thus the distance check for PLL is only O(|L(v)|). For VC-PLL, we cannot do this directly as it is prohibitively expensive to map every L(u) to an array or hash-table at the same time.
To summarize, VC-PLL introduces redundant distance labeling messages, which may also lead to redundant distance checks d(u, v). Furthermore, individual distance checks in PLL can be much faster due to the reuse of L(u) in an array or hash-table representation. In fact, these performance issues seems to challenge the capabilities of Vertex Centric (VC) computational in supporting: 1) effective message filtering and communication and 2) efficient remote (global) memory access. Reduced Memory Access Cost for Graph Topology: A potential benefit of VC-PLL is that it can help reduce the total memory access for the graph topology compared with PLL. Specifically, this is the total number of edge access in the graph (Line 7 in PLL and Line 4 in VC-PLL) for propagating distance messages: 1) For PLL, for each new vertex label message, an edge access (v, v ′ ) is performed for adding (v ′ , d(u, v ′ ) + 1) to the queue Q -thus, the number of total edge accesses is equivalent to the number of total distance labeling messages. 2) For VC-PLL, for vertex a, all its new potential labels at an iteration δL(a) are filtered and grouped together for one edge access (a, v). Thus, VC-PLL should have less total memory access cost for the edges in graph than PLL (assuming they propagate similar number of labeling messages).
Also, in terms of an upper-bound estimation, following the assumption in Theorem 2, there is a vertex order which has O(w log |V ||E|) complexity for the edge cost in PLL (w is the tree-width of G), whereas VC-PLL is bounded by O(D|E|), where D is the diameter of the graph. In the real-world graphs, the diameter of a graph is typically much smaller than its tree-width w [5] . Finally, we note such memory access benefits to be similar to "frontier sharing" in iBFS [50] , though the latter is not based on vertex-centric computation. Sequential Performance Comparison: We implemented Alg. 3 (VC-PLL) and tested its performance on the DBLP graph (Section 6) against PLL using a single thread. We found that it has poor performance with a total execution time of 13, 583 seconds compared to less than 100 seconds for PLL! It does not fare well against PLL in other graphs either. Basic performance analysis shows that the additional computational costs significantly outweigh the benefits of memory access cost reduction. Now, the question we face is: can VC-PLL overcome its limitations and reduce those additional costs (message passing and remote memory access)? In the next Section, we will discuss how we can extend the basic VC model to help achieve this and show that the new VC-PLL can be even faster than PLL sequentially -as it has less computational as well as memory access costs.
BATCHED VERTEX-CENTRIC ALG.
Though VC-PLL can be described in a natural VertexCentric computational scheme, it also demonstrates certain limitations of the original vertex-centric model assumptions: 1) Typically, the vertex value (and message) is fixed in VC, whereas in VC-PLL, each vertex value (and message) is a continuously growing list (or set); 2) In the Gather function, the computation needs remote memory access for checking distance conditions (Line 11 in Algorithm 3): in most cases, u is not a neighbor of v, and when we use L(u) for distance checks, the memory access is remote with respect to vertex v. Indeed, the additional computational costs of VC-PLL compared with PLL (Subsection 3.3) can be traced back to these limitations.
Batched Vertex-Centric Computation
To deal with the performance inefficiency of VC-PLL and the limitations of the vertex centric computation model, we introduce a batched strategy for the standard VC computation. Batches are processed in sequence with the vertices within each batch being processed using the vertex centric computation. The batched strategy naturally introduces 
6:
for all a ∈ ActiveVertices do a.Scatter(a.edges):
7:
for all (a, v) ∈ a.edges do
8:
for all (u, d(u, a) ) ∈ δL(a), when π(u) < π(v) ∧ u / ∈ C(v): flag u in C(v) and send (u, d(u, a) + 1) to v.messages
9:
end for
10:
end for ActiveVertices ← ∅ {Gather Phase:}
11:
for all v ∈ V : v.messages = ∅ {Received Messages} do v.Gather(v.messages):
12:
δL(v) ← ∅
13:
for all (u, d(u, v)) ∈ v.messages do 14:
16:
17:
end for 18:
19:
If v ∈ Bi: Add δL(v) to H(v)
20:
end for 21: end while 22: ∀v ∈ V, C(v) ← ∅ 23: end for mechanisms to help handle: 1) (continuously increasing) size of vertex value and redundant message passing, and 2) remote vertex memory access. Using Bit Operation for Efficient Message Passing and Filtering: In each batch processing step, an active vertex only processes up to batch_size unique labels. Based on this important observation, we can use a compact bitvector data structure called candidate bit-vector for efficient message filtering. The basic idea is as follows. Each active vertex maintains a candidate bit-vector with the length of batch_size bits, each bit corresponding to a vertex in the batch (e.g., if the batch_size is 1K, such candidate bitvector is only 128 bytes). If a vertex u in the current batch is sent to a vertex v, then its corresponding bit in the candidate bit-vector of v is set. Note that the use of bit-vectors also allows atomic compare-and-swap operation in the shared memory setting. Note that without batch processing, we have to consider doing an expensive list merge for handling message passing and aggregation (as the scatter and gather functions in VC-PLL for distance label messaging and processing, respectively). Improving Data Locality for Remote Vertex Memory Access: Simply speaking, only the vertices in the current processing batch can be accessed remotely during the vertex-centric computation. Because the number of vertices in each processing batch is limited, we can use a compact data structure such as an array or hash-table to store their labels for efficient O(1) access (similar to what is done in PLL for each processed vertex in distance checks).
BVC-PLL Algorithm
Algorithm 4 sketches the batched Vertex-Centric algorithm for PLL, referred to as BVC-PLL. Specifically, here, the batches of the vertices are formed according to the rank of each vertex (Line 2). The earlier processed batch consists of the vertices with higher ranks (Line 3). BVC-PLL labels vertices one batch at a time and for assigning the labels in each batch, the vertex centric computation in VC-PLL is followed (Lines 5-21) -more specifically, the Scatter Phase and Scatter function, Gather Phase and Gather function is preserved with only minor revisions for dealing with message passing and remote memory access.
Each vertex v is associated with a candidate-bit vector C(v). Its length is equal to the batch size. It will be initialized for each batch (Lines 1 and 22) . During the Scatter phase, for any vertex a to send a message (u, d(u, a) + 1) to its neighbor v, it will check if u is sent to v before (u / ∈ C(v), Line 8). This corresponds to the unvisited flag in the original PLL. Due to the atomic compare-and-swap operation, it can guarantee only one message from u is being sent to v and thus help resolve the redundant distance labeling generation problem (in Subsection 3.3).
Each vertex u in the batch Bi will map its existing label L(u) to a hash-table (or array) H(u) at the beginning of vertex-centric computation (Line 4). Since the new label of u may be generated during the labeling process, we will map the new label δL(v) to H(v) when the update is available (Line 19). Given this, the distance check (in Line 14) only needs to go through L(v), and thus has the same distance check cost as the original PLL (Subsection 3.3). Correctness: It is easy to see that BVC-PLL (Algorithm 4) produces the canonical hierarchical hub labeling given a vertex order π: the canonical labeling criterion (u ∈ L(v) if u has the highest rank in Puv) is maintained as BVC-PLL can assign u to L(v) at u's batch correctly (Theorem 1) following the batch processing order.
Another interesting property is that when the batch size reduces to one, i.e., when we process one vertex at a time, then BVC-PLL behaves exactly the same as the original PLL [7] .
Finally, we note that introducing and using bit-vector C(v) for each vertex v and H(u) for each processing batch vertex u does not introduce additional time complexity compared with PLL. PLL uses only one bit for each vertex v as the visited flag and one H(u) for distance check, whereas BVC-PLL simply utilizes a group of them at the same time. Thus, the time complexity results of Theorem 2 hold for BVC-PLL as well.
Detailed Computational Cost Comparison
In the following, we provide an apple-to-apple computational cost analysis between BVC-PLL and PLL. Following Subsection 3.3, we will focus on the cost of generating (sending) distance labels and distance checks. Cost of Distance Label Generation: Since in BVC-PLL, each vertex u can be sent to v exactly once, together with Lemma 2 (the same set of u reaches v), we thus observe: Lemma 3. The time complexity of sending vertex label messages (u, d(u, v)) along the edges in graph G given an order π, is the same for PLL and BVC-PLL.
Following Lemma 3, we obtain the following corollary. This is because the number of distance checks is equivalent to the total number of generated distance label message: u∈V |reach(u)| (following the algorithm logic). Given this, let us focus on only those vertices being added at batch Bi for L(v), and denote it as L i (v). Next, we break the distance check cost on |L i (v)| into two categories: 1) the positive distance check which will confirm the vertex u and can add it into the corresponding label of v; 2) the negative distance check will return false on the distance check and thus prune the vertex u. Figure 4 illustrates the key idea in the proof of Theorem 3. Assuming 9 vertices a, b, · · · , i in one batch being added into L(v) in PLL labeling, its total distance check cost is 36 no matter which order they are received in (visualized as the area under the diagonal stairs). Now assuming they arrive in three groups as shown in Figure 4 (a), then in BVC-PLL, their total distance check cost is 3 + 3 × 6 = 27, a 25% reduction compared to PLL.
Theorem 3 essentially shows that BVC-PLL is able to save the intro-group cross-vertex comparison in each batch. Basically, if vertices arrive at the same time, they have the same distance to vertex v and cannot prune one another.
To compare the time complexity difference between PLL and BVC-PLL for the negative distance check, we introduce the following notation: for any vertex x, and one of its vertex label u (u ∈ L i (x)), we denote < x, u > to be a subset:
Similarly, we define < y, v > for vertex y with its label v, v ∈ L i (y):
Theorem 4. (Negative Distance Check) In batch Bi, and on negative distance check, the time complexity saved by BVC-PLL compared with PLL is no higher than O(
The time complexity saved by PLL compared with BVC-PLL is no higher than O(
| < x, u > |).
Please refer to Appendix for proof. Theorem 4 does not provide a clear winner on the cost of negative check. However, from the symmetric expression of these two qualities, we conjecture they should be close to one another. In Section 6, we will experimentally confirm this. In addition, for negative distance check, we typically do not need to traverse through the entire L(v) set. Indeed, the bit-parallel mechanism proposed in the original PLL paper [7] can help provide almost O(1) pruning. Since the number of negative checks is the same for PLL and BVC-PLL, we expect their overall cost will be fairly close to each other. Putting It Together: Assuming that PLL and BVC-PLL have a similar cost for negative distance checks, theoretically, BVC-PLL may have smaller computational cost than that of PLL (due to positive distance check) since they have the same cost of generating/sending distance labeling! Furthermore, BVC-PLL is guaranteed to have a smaller memory access cost for graph topology than PLL as it groups messages together for each edge access. Overall, it seems BVC-PLL, an unexpected marriage between PLL and VC computation, can run faster than the original PLL sequentially and can also enjoy the scalability of the VC model! Indeed, Section 6 shows that it can be more than two times faster than PLL (both using one thread) on real-world graphs.
VARIANTS AND IMPLEMENTATION
Generalization
Directed Graphs: For directed graph, each vertex v is assigned with two labels Lin(v) and Lout(v). VC-PLL and BVC-PLL can be easily extended to handle directed graphs by considering these two labels as separate computations. Specifically, in the Scatter function, the new labels δLin and δLout will be sent out along the outgoing edges and incoming edges, respectively. In the Gather function, there will be two message queues: one for candidate vertices in Lin, and another for those in Lout. The labels generated by this algorithm will be canonical. The computational complexity analysis in Subsections 3.3 and 4.3 holds for directed graphs as well. Weighted Graphs: The direct application of VC-PLL and BVC-PLL (by changing d(u, v) + 1 to d(u, v) + we where we is the edge weight) on weighted graphs can produce a 2-hop labeling; but it may not be a canonical labeling. This is because unlike unweighted graphs, the iteration on the vertex-centric model will not be in sync with the distance between two vertices. For instance, when vertex u reaches v in two iterations, their distance may be larger than a path via vertex w with a higher rank, but w may take more than 2 iterations to reach v and u. Given this, we cannot use the partial label L(u) at an arbitrary iteration to fully determine if vertex v is a true or final label for u anymore. Thus, adding vertex v into u's partial label L(u) or δL(u) (using the partial labels in the weighted graph) may lead to unnecessary vertices being spread in the networks. To deal with this problem, at the end of each batch processing (Line 22 in BVC-PLL), we can perform a distance recheck using only the labels from the batch. Since the hash tables of the labeling vertices in the batch are still in the memory, this recheck can be quite efficient.
Implementation Issues
Hierarchical Parallelism: The BVC-PLL computation (as shown in Algorithm 4) is inherently parallel at the coarsegrained thread-level. The computation of each batch uses vertex-centric processing (line 5 to line 21) that consists of two parallel phases: (Scatter and Gather), with an implicit synchronization between them. In each phase, each thread processes a chunk of active vertices with dynamic scheduling to achieve load balance.
In addition, as aforementioned, BVC-PLL is able to significantly increase the data locality for remote vertex memory access, therefore offering us extra opportunities to better exploit fine-grained data-level parallelism (i.e., SIMD parallelism or vectorization). More specifically, consider the Gather Phase in Algorithm 4 that involves an intensive label distance check kernel (line 13 to line 17). BVC-PLL can vectorize this kernel with the help of advanced SIMD gather/scatter and mask instructions in the latest AVX512 intrinsic set 1 . Moreover, for weighted graphs, the distance recheck operation incurs extra overheads. The hierarchical parallelism is also applied to address this challenge. In particular, efficient SIMD parallelism significantly reduces the overhead of distance rechecks. Integrated Bitmap and Queue: Much temporary data is generated for both labeling vertices and active vertices 1 https://software.intel.com/sites/landingpage/IntrinsicsGuide/ during each batch processing. These steps require a clearance (e.g., Algorithm 4, line 22). The cost of this clearance is significant as this operation occurs for each batch. Traditionally, we often use either a bitmap or a queue to handle the set of active vertices. However, they become inefficient or insufficient for supporting BVC-PLL. For a bitmap, each of its cleanings can take O(|V |) where |V | is the total number of vertices; for a queue, it cannot support efficient checks for whether a given vertex is active or not. Given this, we propose a new traversal control data structure by combining both the bitmap and the queue. The basic idea is that a bitmap supports fast recording and checking visited vertices and a queue supports fast finding and clearing the visited vertices. Each time a vertex is processed, we add it to both the bitmap and the queue. This approach is different from the bitmap and queue used in the push and pull strategy presented in [60, 73, 13] because we use both the bitmap and queue simultaneously rather than in different stages of processing. Bit-parallel Adoption: Similar to PLL [7] , bit-parallel is also adopted to accelerate the distance checking in the implementation of BVC-PLL for unweighted graphs. Its construction is similar to multi-source BFS traversals and can be easily expressed in the Vertex-Centric computing model.
EVALUATION
In this section, we perform a detailed evaluation of BVC-PLL, focusing on answering the following questions: 1) How does BVC-PLL algorithm perform against the original PLL in a sequential setting (single thread; no parallelism)? Specifically, the theoretical analysis indicates it may run faster, but we conduct experiments to confirm this. 2) How does BVC-PLL scale as the number of threads increases? 3) The breakdown of runtime of BVC-PLL, and more specifically, how does the theoretical cost analysis align with experimentation on real-world graphs, such as positive and negative distance checks and memory access for graph topology? 4) How does the weighted extension of BVC-PLL perform and how does it fare against ParaPLL [62] (the state-of-the-art parallel weighted PLL algorithm)?
Experimental Setup
Platform: We perform all the experiments on an Intel Xeon Gold 6138 CPU. It is a Skylake processor with 20 cores running at 2.0 GHz supporting efficient 512-bit AVX-512 intrinsics, with 27.5 MB L3 cache and 192 GB DDR4 memory shared among all cores. All code is compiled with an Intel icc compiler (version 19.0.2.187) with -O3 optimization option. Hyper-threading is not used to simplify the analysis of experiment results. Graph Datasets: The 10 graphs used in our evaluation are characterized in Table 2 Table 5 graphs are all unweighted. To test the performance of our BVC-PLL on weighted graphs, we randomly assign weights (from 1 to 7 with a uniform distribution) to their edges. Since we only evaluate algorithms for undirected graphs, we have transformed the edges in the directed graphs in Citation and Hyperlink as undirected edges. Benchmarks: For the sequential performance comparison on unweighted graphs, we compare BVC-PLL against the PLL implementations by the original authors [7] , and by [48] . We found these two implementations provide comparable performance with the former being slightly faster. Given this, we only report its PLL performance result below. For the scalability performance comparison on weighted graphs, we compare the weighted BVC-PLL against the implementation of ParaPLL [62] . For the vertex order, we adopt the original and the most popular method where the vertices are ordered by their vertex degree [7, 48] . Batch Size of BVC-PLL: Throughout the experiments, we use 1024 as the batch size for unweighted graph and use 512 for weighted graph. In general, we observe the larger the batch size the better performance if the memory can afford such batch size. In our experimental platform, we found those two are the optimal batch size. Due to the space limitation, we will not report the performance results with respect to batch size below. Table 3 shows the performance comparison between BVC-PLL as a sequential algorithm and PLL (both using single thread and no other parallelism, such as SIMD) on all graphs. Both algorithms use the same vertex order and produce the same label size, as expected. Interestingly, the BVC-PLL algorithm consistently outperforms PLL with the speedup ranging from 1.15X (YOUTUBE) to 2.46X (GNUTELLA and HOLLYWOOD) with an average speedup 1.58X over PLL. This observation is consistent with our theoretical analysis in Subsection 4.3. In the next subsection, we will perform a more detailed cost breakdown and comparison. Figure 5 shows the scalability of BVC-PLL on all graphs. Figure 5a shows its speedup over 1-thread BVC-PLL, while Figure 5b shows its speedup over the original sequential PLL. With 20 threads, BVC-PLL can achieve up to 14.71 and 33.11 speedup over its 1-thread version and PLL, respectively, demonstrating good scalability.
BVC-PLL vs PLL and Scalability
In addition, by comparing Figure 5 and the average label size of each vertex in Table 3 , we found that generally, BVC-PLL scales better as the average label size increasing. For example, GNUTELLA and HOLLYWOOD with the largest average label sizes result in the best scalability while WIKITALK with the smallest results in the worst scalability. The labeling size provides a good indication of the total computational costs (message passing and distance checks) involved for each vertex. The better scalability of larger labeling sizes is consistent with the computing scalability of multi-core architecture. Figure 6a shows the overall running time breakdown on two graphs: GNUTELLA and TREC WT10G. Due to space limitation, we only report two -trends are similar in other graphs. We can see the Gather and the Scatter phases dominate the overall computational costs. In addition, within gather, the distance check time takes about 60% − 80% and 30%−40% of the gather phase and overall time, respectively. Table 4 shows the theoretical computational distance check cost, L(v), as being defined in Subsection 4.3. We can see that the total cost of the positive distance checks from BVC-PLL is strictly smaller than that from PLL (ranging from 1.03 to 2), on average 1.23 times smaller. Also, the theoretical negative distance check cost is indeed very close to each other, thus experimentally confirming our conjecture. Figure 6b shows the total number of edge access for PVC-PLL and PLL on two graphs: GNUTELLA and TREC WT10G. We can see that PVC-PLL has 5 and 18 times reduction for these two graphs! This also confirms our theoretical analysis on the reduced memory access for graph topology. Finally, Figure 7 shows the LLC (last level cache) miss rate and miss access count for the whole labeling process of BVC-PLL and PLL. Again, we can see BVC-PLL has consistent lower LLC miss rate and access count than PLL!
Understanding the Performance
Extension to Weighted Graphs
A similar performance study is conducted between PLL and BVC-PLL for weighted graphs. To evaluate weighted BVC-PLL's sequential performance against PLL, we have modified the original PLL implementation as suggested in [7] , changing its BFS traversal to Djkstra's algorithm. We also extended BVC-PLL as described in Subsection 5.1. Please notice: both PLL and BVC-PLL are optimized with SIMD for the weighted version (and for the unweighted version, we also implemented them with SIMD however without obvious speedup change). Table 5 shows the comparison results for 1-thread SIMD and non-SIMD versions of PLL and BVC-PLL. For all non-SIMD tests, PLL consistently performs better than BVC-PLL; while for most SIMD tests, BVC-PLL outperforms PLL. This is because the weighted BVC-PLL introduces additional distance check (due to additional message passing) and rechecks, which significantly increases the number of instructions for BVC-PLL, resulting in degraded performance. However, SIMD parallelism is a good remedy that can significantly reduce the number of instructions. It should be noted that BVC-PLL is able to effectively exploit SIMD parallelism because the data locality has been improved. (See the performance analysis in last Subsection). In particular, for SIMD version, BVC-PLL outperforms PLL for 7 out 10 graphs, resulting in 1.14X to 1.92X speedup with an average of 1.34X. For the slowdown cases, BVC-PLL's performance is only degraded up to around 10%. Please notice that our BVC-PLL is able to continue exploring hierarchical parallelism to further extract the most out of the massive parallelism of modern processors. Figure 8 shows the scalability of BVC-PLL on all weighted graphs, in which, Figure 8a shows its speedup over 1-thread BVC-PLL while Figure 8b shows its speedup over PLL. With 20 threads, BVC-PLL can achieve up to 13X and 16X speedup over its 1-thread version and PLL, demonstrating good scalability. Finally, we compare BVC-PLL with the state-of-the-art ParaPLL, which does weighted parallel PLL. Unfortunately, it can only run on small graphs (this is consistent on what being presented in their original paper [62] ). In Figure 9 shows the performance comparison of BVC-PLL and Para-PLL on the graph GNUTELLA (the only graph we are able to run for ParaPLL, as it throws an error of Segmentation Fault with the other graphs). For this graph, we can see that BVC-PLL is in general more than one order of magnitude faster than ParaPLL (even for non-SIMD version)!
RELATED WORK
Online and Parallel Shortest Path Distance Computation: The standard (single source) shortest path computation method is Dijkstra's algorithm [26] for weighted graphs and Breadth-First Search (BFS) traversal for unweighted graphs. There have been quite a list of efforts in designing parallel Dijkstra and BFS algorithms [56, 49, 47] . Particularly, certain latest studies focus on performing multi-source or concurrent BFS over modern multi-core or GPU architectures [75, 50] . However, it remains challenging to answer the shortest path distance using these approaches due to the large traversal space for large graphs. Shortest Path Computation on Road Networks: Computing shortest path on road networks has been widely studied [41, 42, 72, 67, 11, 34, 31, 66, 69, 45, 30, 74, 12, 4 , 2, 3, 6, 58], and has been applied successfully in industry practice. A more detailed review on this topic can be found in a recent survey [10] . We note that the effectiveness of these approaches rely on the essential properties of road networks, such as the ones that are almost planar, have low vertex degree, are weighted, are spatial, or have a hierarchical structure, and they may not apply on scale-free complex networks, such as social and web graphs [33, 58] . Theoretical Distance Labeling and Hop-based Labeling: There have also been several studies on estimating the distance between any vertices in large (social) networks [52, 22, 33, 84, 85, 61] . These methods fall within the group of distance-labeling [28] , where the goal is to assign each vertex u a label (for instance, a set of vertices and the distances from u to each of them) and then estimate the shortest path distance between two vertices using the assigned labels. The pioneering 2-hop labeling method by Cohen et al. [21] provides exact distance labeling on directed graphs. However, numerous efforts over a ten-year period [70, 18, 37, 19, 2, 29, 69] have largely failed in making 2-hop labeling practical on large real-world graphs until the discovery of Pruned Landmark Labeling (PLL) [7] . In the past few years, a number of studies [24, 48] have further validated and confirmed the scalability of this approach. The idea has also been extended to road networks [6] and out-of-core graph labeling [36] . Another direction of research involves the use of tree decomposition for shortest path distance computation [80] , and particularly in utilizing it for hop-based labeling [81, 16, 58] . There are also efforts that relax the distance computation to focus on cases when the distance is smaller than a certain threshold (useful for querying social networks) [17, 38] . Others: For related work on vertex-centric computation, please refer to Subsection 2.2. For recent progress on general parallel (VC-type) graph algorithms on modern computing architecture, please refer to [25, 65] .
CONCLUSION
In this paper, we proposed VC-PLL, which, to the best of our knowledge, is the first scalable parallelization of Pruned Landmark Labeling (PLL) that is able to produce the same result as the sequential method. We have achieved this by mapping the algorithm to a vertex-centric model. We also introduced a new batched execution mechanism for VC-PLL to better support message filtering and remote memory access. Based on the new model, we designed the BVC-PLL algorithm, which surprisingly can run faster than the original PLL as a sequential algorithm (demonstrated through both theoretical analysis and experimental validation). As far as we can tell, this is the first VC graph algorithm that can inherently run faster than its original counterpart even without parallelism. Our experimental results further demonstrate the parallel efficiency and scalability of BVC-PLL and shows its superiority over the most recent Para-PLL algorithms on weighted graphs (using a straightforward extension of BVC-PLL). In our future work, we plan to further investigate how to optimize BVC-PLL on weighted graphs and how to extend it for out-of-core graphs. We also plan to investigate the possibility of implementing the cost-saving mechanism in BVC-PLL for other graph algorithms. d(u, v) ) messages added into the Q (Line 7 in Algorithm 1). In VC-PLL, it corresponds to all the (u, d(u, v)) messages being sent to vertex v (Line 5 in Algorithm 3). Thus, u∈V {u} × reach(u) is the set consisting of all pairs (u, v) for distance check. In PLL and VC-PLL, for a vertex u, it is assigned to the same subset of vertices (Corollary 1). Also, it will also be sent to the same set of vertices which do not use u as label. Thus, the set u∈V {u} × reach(u) is the same for both. ✷ Theorem 4 Proof Sketch: To quantify the difference of the time complexities between two algorithms, we focus on the cases where one algorithm can save computational cost when the L i (v) will be different for distance check d (u, v) .
For the first case, let us consider vertex x, it has a vertex u ∈ L i (x). Now, consider any vertex v ∈ B i reaches vertex x for distance check and returns negative result. If v can reach x, it must be a label of neighbor y of x, i.e., v ∈ L i (y), y ∈ N (x), and v / ∈ L i (x) (false distance check). When v reaches x, it has also lower rank than u but higher than x: π(u) < π(v) < π(x). Given this, for PLL, u is already in L(x); however, for BVC-PLL, v can reach x before u reaches x. Thus, this case will introduce a gain for BVC-PLL; and such v is characterized and recorded in set < x, u >.
For the second case, let us consider vertex y, and it has a vertex v ∈ L i (y). Now, consider any vertex u ∈ B i reaches vertex y for distance check and returns negative result. If u can reach y, it must be a label of neighbor x of y, i.e., u ∈ L i (x), x ∈ N (y), and u / ∈ L i (y) (false distance check). When u reaches y, it has also higher rank than v: π(u) < π(v). Given this, for BVC-PLL, v is already in L(y); however, for PLL, u can reach x before v is added into L(y). Thus, this case will introduce a gain for PLL; and such u is characterized and recorded in set < y, v >. ✷
