Abstract-The Network-on-Chip (NoC) interconnect network of future multi-processor system-on-a-chip (MPSoC) needs to be efficient in terms of energy and delay. In this paper, we propose a topology synthesis algorithm based on shortest path Steiner arborescence (hereafter we call it ATree). The concept of temporal merging is applied to allow communication flows that are not temporal overlapping to share the same network resource. For scalability and power minimization, we build a hybrid network which consists of routers and buses. We evaluate our ATree-based topology synthesis methodology by applying it to several benchmarks and comparing the results with some existing NoC synthesis algorithms [1], [2]. The experimental results show a significant reduction in the power-latency product. The power-latency product of the synthesized topology using our ATree-based algorithm is 47% and 51% lower than [1], and 10% and 17% lower than [2] for the case without considering bus and the case with bus, respectively.
INTRODUCTION
MPSoC are now integrating more and more processors on a single die with the increase in transistor budgets enabled by Moore's law. As the number of processor cores on a single die increases, the power consumption and wire delay will also have a significant increase, which makes the on-chip communication among cores becomes a critical issue. A scalable, energyefficient on-chip interconnect network is needed to address these difficulties in order to facilitate the on-chip communication.
In this paper we investigate the topology synthesis approach for on-chip interconnect network. The limitations of previous works are discussed in Section II. To overcome these limitations, we propose an ATree-based [3] , [4] algorithm, named AT_NOC, to synthesize the topology by creating a series of refinement steps from an initial topology design, in which any pair of communicating nodes are connected by links. In each refinement, we pick the router that has the highest power consumption among the un-refined routers, and refine the sub-topology of this router to reduce the power consumption and keep the packet latency lower. Many previous topology synthesis approaches use a Steiner-tree-based topology construction (such as [2] ). The reason we use an ATree instead of the Steiner-tree is that an ATree is the shortest path tree that can reduce the packet latency. By using the ATree-based algorithm, in the sub-topology of the chosen router, the paths from all nodes to the chosen router are the shortest paths. Additional routers might be added into these shortest paths at some Steiner points. The locations of these added routers are chosen to minimize a cost function, which considers both power and packet latency. In the cost function, the power numbers of routers with different port sizes are measured by ORION [5] , [6] . In order to predict the packet latency between any two nodes, we build a latency prediction model by noting that the latency between two routers is related to the injection rates of packets at the routers [7] . The latency prediction model is built on a latency-injection rate table, which is constructed by a simple simulation. For different injection rates, we randomly inject packets into the wire and measure the average packet latency. By looking up this latencyinjection rate table, we can easily predict the packet latency under different traffic.
In our topology synthesis approach, we apply the concept of temporal merging [8] . We observe that the communication traces, which do not happen at the same time, can share the same resource. A shared bus can be used as an implementation for connecting those nodes, which are not temporal overlapping. For the example shown in Fig. 1(a) , assume that the application execution is divided into three phases, and the three numbers in the parenthesis denote the traffic volume in the three phases. Then, the communication traces from nodes 1, node 2, and node 3 to node 4 do not overlap, nor do the communications from node 5 to node 6 and node 7. Without considering the temporal merging, existing topology synthesis approaches (such as [1] ) may generate a topology shown in Fig. 1(b) with three routers, of which R1 has four ports and R3 has three ports. By considering temporal merging, we can connect node 1, node 2, and node 3 with a shared bus, and connect node 6 and node 7 by another shared bus. Then we can synthesize a topology as shown in Fig. 1(c) , in which both R1 and R3 have two ports. Compare to the synthesized topology in Fig. 1(b) , the topology in Fig. 1(c) has lower power consumption. Given these observations, communication traces should be implemented by a shared bus as much as possible for power consideration. For performance consideration, the overlapped working period of communication traces, which share the same resource, should be minimized. This paper is organized as follows. The related work and problem formulation are given in Sections II and III. Then, an ATree-based topology synthesis method is proposed in Section IV to minimize the combination of total power and packet latency. In Section V, the experimental results are described to show the effectiveness of our algorithm. Our conclusion and future work are then provided in Section VI.
II. RELATED WORK
The custom on-chip network, which targets a given application, has proved to be more efficient than the regular structure on-chip network design in [2] . The reason is that the communication requirement for each data flow is available in the design time, so the power consumption and packet latency are predictable once the links of networks are determined. Having this knowledge makes custom on-chip networks more efficient with topology synthesis as shown in [1] , [2] , [7] - [12] . Among those, [1] , [2] , [9] use a partition-based algorithm to reduce the time complexity.
Both [1] and [2] use decomposition and clustering methodologies to find the best partition of traces. For each trace partition, [1] uses a K-way merge to construct the on-chip network topology, while [2] uses a Steiner Tree engine to build topology. Both of their solution spaces [1] and [2] are limited by the implementation structure that they choose. A K-way partition plus K-way merge forces the maximum hop count to be 2, while using an existing Steiner tree package limits the location of routers to the Steiner points selected by that package. This usually results in a highly congested common backbone for the wire length consideration. The work in [9] presents a min-cut partition-based algorithm to group processing elements, and uses cross-group traffic as the cost of edge. The work in [10] uses linear programming techniques to solve the topology synthesis problem. The work in [13] provides a system-level for on-chip network that provides the user the most suitable network designs tailored to their performance requirements and power/area constraints. The design space of NoC configurations in [13] only includes several fixed topologies such as torus, mesh, ring, fat tree, which might lose the opportunity to improve the performance of on chip network.
Hybrid architecture is also used in custom on-chip networks. In [8] , the authors observe that the power consumption of a crossbar is mainly determined by its size. By introducing the concept of temporal non-overlapping, the communication traces, which do not occur at the same time, can share the same resource. However, the topology generated by the work in [8] is limited to crossbars and buses only. In [11] and [12] , the topologies are constructed on top of a standard mesh structure. [11] adds extra long-range links to reduce the average packet latency, while [12] uses radio frequency interconnect (RFI) to simplify the topology. The works in [11] and [12] do not consider bus implementation in their algorithms.
Srinivasan et al. [7] proposed a three-phase topology synthesis method including performance-aware floorplan, core to router mapping, communication traces routing. The target of [7] is to reduce power consumption with the packet latency constraints. Instead, our approach simultaneously optimizes the power consumption and packet latency. [7] formulated the problem of routing communication traces as a variation of rectilinear Steiner arborescence problem, and modeled the problem as a linear programming formulation. Before solving the communication trace routing, the locations of the routers have already been determined. However, in our algorithm, we simultaneously determine the communication trace routing and locations of the routers. This paper presents an ATree-based topology synthesis approach. The design space of synthesized topology that we explore is larger than those of the previous researches. As we discuss above, the previous research such as CosiNoC [1] and Rectilinear-Steiner-Tree-based algorithm [2] limit the structure of synthesized topologies to be K-way merge or sets of Rectilinear Steiner tree, which might lose the opportunity to synthesize a better topology in terms of power and packet latency. In our ATree-based algorithm, we iteratively refine the topology by constructing the sub-topology of the selected router in an ATree [3] , [4] fashion, which is more flexible. In this way, we can achieve power efficient topologies.
III. PROBLEM FORMULATION
Given the communication graph among processing elements and the locations of these elements, the problem of topology synthesis is to determine the number of routers, the location of newly added routers, and the connectivity between them. Power consumption and packet latency are two tradeoff factors that we need to consider during synthesizing the topology. The problem can be defined as follows:
Objective Function: 
Here, α is a user-controllable parameter for trade off between power consumption and packet latency as we will discuss in Section V.B. Power(T) means the total power consumption of topology T, which includes static power and dynamic power for routers, links, and buses. The total packet latency of all communication flows is expressed by Latency(T, G trace , Route), which is calculated by the sum of packets' latencies. The packet latency is determined by 1) the route of the packet, and 2) the traffic volume of communication traces on the route, which can be acquired from the routing table.
IV. AT_NOC SYNTHESIS ALGORITHM
Since the topology synthesis problem was proven to be NPhard in [1] , we propose an ATree-based algorithm to synthesize the topology by a series of refinement steps from an initial topology design, in which any pair of communicating nodes are connected by links. We can see that the initial topology will have minimum packet latency, but have large power consumption due to the large port size of the routers (we assume every processing element has a local router). In our initial topology design, besides local routers, no additional routers are added. In each refinement, we gain the power reduction at the expense of a longer latency. The power consumption at each refinement is better than the previous one.
As we know, a router contributes more power consumption than other components in an on-chip network, such as links, and buses. The problem of minimizing the power consumption of the on-chip network needs to focus on the power reduction of the routers. The power consumption of a router is proportional to the size of the router. As a result, the objective of refining topology is turned into a reduction of the number of input/output ports of routers. Therefore, as shown in Algorithm 1, in each refinement, we focus on optimizing a sub-topology, which consists of one high-radix router and its neighborhoods.
The overview of our topology synthesis method is described in Algorithm 1. We start from an initial network topology which has good performance and is discussed in Section IV.A. Then, we go through an iterative improvement. In line 2, we select a router that has the highest cost among the un-refined routers. We refine the sub-topology, which consists of the selected router and its neighborhood in line 4. Then, we label this selected router as refined in line 7. The refinement continues until all the routers are already refined or the topology meets the stopping criteria of power reduction. Since the area of the routers is much smaller than the area of cores [14] , we ignore the impact of the router area in our algorithms. 
A. Initial Topology Construction
As we mentioned in the previous section, our algorithm starts from an initial topology, which is constructed from the communication graph, that is, any pair of communicating nodes is connected by a link. For each node v in this communication graph, there is a corresponding node v' and a local router r v' connected by a wire (v', r v' ) in the initial topology. For each edge in the communication graph e(u,v), in the initial topology we use three links e(u', r u' ), e(r u' , r v' ), e(r v' , v') to match this flow. An example is shown in Fig. 2 . Each node in the initial topology has a local router. A communication flow from node 1 to node 3 in Fig. 2 (a) has a corresponding path: node 1, router R1, router R3, node 3 as shown in Fig. 2(b) . 
B. Refine Sub-Topology of a Selected Router
After constructing the initial topology, the topology is refined by function RefineSubTopology. This purpose of this function is to refine the topology by removing some existing edges of the selected router. Once we remove an edge, all the traces which originally go through this edge, should be rerouted. In other words, if an alternative route for an edge is found, this edge can be eliminated. In this way, we can lower the number of ports by re-routing a direct path between two nodes. Take the example shown on Fig. 2(b) : router R1 has direct links to routers R3 and R4, and router R4 has a direct link to router R3. Without considering the latency penalty, it is better to re-route all the routes on edge (R1, R3) with edge (R1, R4) and edge (R4, R3) because the number of output ports of router R1 can be decreased by one, as can the number of input ports of router R3. Based on this observation, it is natural to separate the input and output edges when we consider the possibility of re-route, because it is impossible to find an outgoing path to replace an incoming edge. For function RefineSubTopology, we refine the topology around the selected router r sel by refining the sub-topology of its fanin edges and its fanout edges, separately. In other words, two sub-topologies are constructed: one is for sub-topology of fanin edges of r sel , and the other is for sub-topology of fanout edges of r sel . For a topology is shown in Fig. 2(b) , the subtopology of router R3 with only incoming edges is shown in Fig. 3(a) . Fig. 3(a) , the root node is R3 and terminal nodes are R1 and R4.
Using the ATree algorithm proposed in [4] , we compute a rectilinear minimum Steiner arborescence that connecting root node to all terminal nodes. The motivation of adopting the rectilinear minimum Steiner arborescence is to minimize the total wire length and the distance from root node to each terminal node. In this way, we can simultaneously reduce the wire power and packet latency. The rectilinear minimum Steiner arborescence for Fig. 3(a) is shown in Fig. 3(b) .
DEFINITION 2
The Steiner point candidates of a sub-topology satisfy the following properties: 1, intersecting points of this sub-topology; 2, location is on the rectilinear minimum Steiner arborescence. In Fig. 3(a) , only st1 is a Steiner point candidate.
The algorithm of refining sub-topology goes through the following steps:
1) Refine Sub-Topology by Using ATree-Base Method
The details of refining sub-topology are shown in Algorithm 2. The first step of the function RefineSubTopology is to build the sub-topology of the selected router as we discussed in Definition 1. The set N contains all terminal nodes in the sub-topology. The set SV contains all the Steiner point candidates, terminals, and the root node. In line 3, we sort all the vertices in SV in descending order of their distance to the root node r sel . Then, we build sub-problem solutions by traversing each vertex. In the end, we reconstruct the topology by back-tracing the sub-problem solution of the root node. For all the routes that go through root vertices, our method explores possible new routes for them in order to reduce power consumption and improve performance. The possible routes might consist of original links, new links between local routers and newly added routers at Steiner point candidates, and the new shared buses, which combine original links or new links. Our algorithm explores these possibilities and chooses the one with minimal cost. As we can see, when the final topology determined, the route for each communication flow is also determined. The functions that used to calculating cost(v, W) are below. Function 2, cost(v, W) is the minimum cost among several possible topologies which are used to connect v to all terminals in W.
DEFINITION 3 A vertex v is reachable from u if v is in one of

Given v and W, we need to determine Sp(v, W). Sp(v, W)
is a set of nodes, such that for every node n in Sp(v, W), every terminal node in W has a (directed) path to v through node n.
TopoCost , cost , , cost , min TopoCost , , Fig. 4 demonstrates how to calculate cost(v, W) for a subtopology of R3 shown in Fig. 3 . Let v=R3, then W={R1, R4}, Sp(v, W)={st1, R4, R3}. There are three possible topologies: 1) R1 and R4 are connected to st1, st1 is connected to R3; 2) R1 and R4 are directly connected to R3; 3) R1 is connected to R4, R4 is connected to R3. In the first case, st1 could be a newly added router that will reduce the number of input ports of R3. Once we add st1 as a newly added router, we might have higher traffic on the link between st1 and R3, which is used to re-route all the paths that on R1→R3, R4→R3. In the second case, R1 and R4 are directly connected to R3. The packet latency will be smaller since there is no contention between packets from R1 to R3 and packets from R4 to R3. However, the port size of R3 is increased, which will increase the power of R3. In the third case, all the packets on edge (R1, R3) are rerouted with edge (R1, R4) and edge (R4, R3), which reduce port size of R3 but increase the traffic on the link between R4 and R3. Therefore, we not only consider the power reduction, but also calculate the performance degradation, which is captured in Function 1. Function 1, TopoCost(v, C), evaluates the cost of the case that v connects with all vertices in C. The cost is calculated as a combination of power consumption and total packet latency as shown in Function 1. In Function 1, v is the node where we place a router, and C is a set of vertices that have direct links to v. Therefore, we can easily obtain the number of input/output ports, which determines the power consumption of the router. The latency cost is the estimated packet latency. In order to predict the packet latency between any two connecting nodes, we build a latency prediction model by noting that the latency between two routers is related to the injection rates of packets at the routers [7] . We construct a latency-injection rate table by running simulations. By looking up this latency-injection rate table, we can easily predict the packet latency based on the traffic of the direct links between v and C.
In Algorithm 2 we traverse each vertex in SV in the order of distance to root node. Therefore, when evaluating cost for v, all the costs of v's predecessors are already evaluated (because v's predecessors have a longer distance to root node). In other words, our algorithm constructs solutions from the bottom up, and eliminates the duplicated calculation by dynamic programming. The time complexity for Algorithm 2 will be discussed in Section IV.D. In the following sections, we will propose some heuristics to reduce the time complexity. From Fig. 4 , we can see that as the number of terminals in W increases, it is possible that the number of possible topologies will be large. In order to reduce the complexity, we partition edges of a router into four quadrants according to the location of terminals. The categories of groups are southeast, northwest, southwest, and northeast. Because our algorithm is ATree-based [3] , [4] , an edge in one group cannot find an alternative path to re-route using the edges in different groups, except those edges which are shared by two groups. For example, a horizontal edge on the east may be shared by northeast and southeast groups. Therefore, we propose an algorithm to first refine the topology for each sub-topology of a group, and then refine the whole sub-topology of the router. This heuristic algorithm can help us successfully reduce the time complexity without losing quality.
2) Refine Sub-Topology by Region
3) Sub-Topology Pruning
We propose a heuristic method to further reduce the number of possible topologies in each quadrant by introducing the concept of "dominance relation."
DEFINITION 4 For a vertex v, one predecessor v a of v is said to be dominant to another predecessor v b if the transmission cycle of the direct link between v a and v is the same as the transmission cycle of the direct link between v b and v, and v b is closer to v than v a .
We calculate the transmission cycle between two nodes by using ORION [5] , [6] . By definition, for a vertex v, a dominant node v a is the farthest node that v can reach in t cycles. We call v a to be t-cycle dominant. Let v b be a non t-cycle dominant node, and it takes t cycles from v to v b . Let u be a terminal node, and the transmission cycle of the direct link between u and v is larger than t cycles. Since v a is the farthest node that v can reach in t cycles, the distance from u to v b must be larger than the distance from u to v a . Both v a and v b take t cycles to v. Therefore, the time needed to transmit a packet from v to u via v a should be less than that from v to u via v b . Therefore, comparing to non-dominant nodes, choosing a dominant node as one stop in the re-routing path has a smaller latency. With this concept, picking up these dominant nodes can reduce the number of possible topologies without degrading the solution quality, which also reduces the complexity of our topology synthesis algorithm.
C. Temporal Merging with Shared Buses
Because a shared bus is useful for reducing the network resource, we use it in our topology refinement methodology by making slight changes to the algorithm in Algorithm 2. For Function 2, we evaluate the value of cost(v, W) by assuming links are used to connect vertex v and all vertices in W. In order to support a shared bus architecture, when evaluating the value of cost(v, W), we consider merging several direct links into shared buses. As we discussed in Section I, shared buses can be used to connect those nodes that are not temporal overlapping. By using shared buses, we can reduce the power consumption, which will lower the value of cost(v, W). An example is shown in Fig. 5 where R3 is a root node, and R1 and R4 are connected to R3 by direct links. By analyzing the communication graph G trace , we find out R1 and R4 are not temporal overlapping. In Fig. 5(b) , we replace the direct links from R1, and R4 to R3 with a shared bus, which might lower the value of cost(v, W). 
D. Complexity Analysis
In this section, we analyze the complexity of our algorithm as shown in Algorithm 1. Assume the number of processing elements on the chip is n. Each processing element has a local router, and then the number of local routers is n. The routers in the synthesized topology include the local routers and the added routers at the selected Steiner points. Since our AT_NOC algorithm is based on ATree, we can expect that the number of added routers will be less than the number of local routers. So, the number of total routers in the synthesized topology is Θ(n), as is the number of iterations that while-loop executes. In the while-loop, the function RefineSubTopology, as shown in Algorithm 2, is executed. The main part of the function RefineSubTopology is the for-loop. The number of iteration that for-loop executes is Θ(n). In each iteration, cost(v, W) will be calculated. The complexity of computing cost(v, W) is proportional to the size of Sp(v, W). The maximum size of Sp(v, W) is the number of terminals and Steiner points, which is Θ(n). So, the complexity of calculating cost(v, W) is Θ(n). Based on the above analysis, we have the following theorem.
Theorem 1 The complexity of our ATree-based algorithm is Θ(n 3 ), where n is the number of processing elements on a chip.
We compare the runtime of our algorithm with those of [1] and [2] . The results are shown in Section V. The results show that the runtime of our ATree-based algorithm is acceptable.
V. EXPERIMENTAL RESULTS
A. Experimental Setup
In our experiments the processing elements include 8 cores with L1 caches, 8 L2 cache banks, and 8 memory interfaces. Each element has a local router to distribute the outgoing packets. Many previous research efforts have explored multiple router implementations. The single-stage router was proposed in [15] . Since we are focusing on the topology synthesis, we would like to measure the impact of the synthesized topology on power and packet latency. So, the router implementation used in this paper is a single-stage router, which might minimize the impact of the router on power and packet latency.
The benchmarks that we use are from PARSEC [16] and medical imaging [17] including denoising, registration, and segmentation. All of our benchmarks are written in parallel fashion. Since the processing elements have 8 cores, the benchmarks are parallelized with 8 threads. For each benchmark, we run the parallel program on SIMICS [18] to obtain the communication graph. In the communication graph, each processing element is considered a node. The communication graph contains the information of all communication messages. The information includes the size of the message, the source node of the message, the destination node of the message, the time when the message is sent out. With the information of all communication messages, we can obtain the traffic volume between any pair of nodes in the communication graph. Then, we use the communication graph as input for the floorplanner to locate all the processing elements. The floorplanner that we use is Parquet [19] , [20] . With the communication graph and locations of processing elements, we use our ATree-based topology synthesis method to generate the topology. After topologies are synthesized, we use ORION [5] , [6] to measure the total power consumption of the synthesized on-chip network. The power consumption includes the static power and the dynamic power of routers, links, and buses based on 45-nm technology. For wires and buses, the power and the latency are measured by using the model in [21] . Based on the synthesized topologies, we run the benchmarks on a trace-driven cycle-accurate simulator to measure the average packet latency.
To evaluate the effectiveness of our ATree-based topology synthesis algorithm, two topology synthesis methodologies are used as baselines. The first one is the released version of CosiNoC, which can be downloaded from [22] . The second one is the Rectilinear-Steiner-Tree (RST) based algorithm proposed in [2] with our own implementation.
B. The Trade-Off between Power and Latency
We use the benchmark Blackscholes as an example to show the trade-off between power and packet latency of the synthesized topologies generated by using our algorithm. From  Fig. 6 , we can clearly see that power and packet latency are two trade-off factors, and that the topology with bus always performs better than the topology without bus at every design point. This shows that using shared bus can effectively reduce the packet latency. The trade-off parameter between power consumption and packet latency is determined by user through setting the parameter α as mentioned in Section III. Increasing the value of α will lead to lower power consumption and higher packet latency, and vice versa. By tuning the trade-off parameter α, we could obtain lower packet latency or lower power consumption.
C. Comparisons on the Results
We compare the topologies of different benchmarks synthesized by using four different methods: the algorithm proposed in CosiNoC [1] , Rectilinear Steiner Tree (RST) based algorithm proposed in [2] , and the algorithms we proposed in this paper with and without using a shared bus, respectively. Table I shows the power consumption and average packet latency of the topologies synthesized by these four topology synthesis engines. The total power consumption of the synthesized topology is measured in terms of watt. The average packet latency is measured in terms of cycles. Table I shows that our algorithm performs better than CosiNoC and RSTbased algorithm in all benchmarks in terms of power consumption. Without shared buses, our ATree-based algorithm achieves on average 32% and 37% reduction in power over CosiNoC and the RST-based algorithm, respectively. With shared buses, our ATree-based algorithm achieves on average 36% and 41% reduction in power over CosiNoC and the RST-based algorithm, respectively. For packet latency, our ATree-based algorithm achieves 22% and 23% reduction over CosiNoC for the case without shared buses and the case with shared buses, respectively. Our ATree-based algorithm has larger average packet latency than the RST-based algorithm. The explanation is that in the RST-based algorithm, the traffic flows are partitioned into several groups. For each group, a Rectilinear-Steiner-Tree based network topology is used to connect the communication modules in the group. The solutions of the RST-based algorithm might comprise multiple custom network topologies. So, the RST-based algorithm has higher power consumption and lower packet latency.
To show the effectiveness of our algorithms, we compute the power-latency product over all the benchmarks, as shown in Table I . Fig. 7 shows that our ATree based algorithm has a significant reduction in terms of the power-latency product. Without considering shared buses, our ATree-based algorithm achieves an average 47% reduction and 10% reduction over CosiNoC and the RST-based algorithm, respectively. By using shared buses, our ATree-based algorithm achieves an average 51% and 17% reduction over CosiNoC and the RST-based algorithm, respectively.
The execution times of these four topology synthesis engines over all benchmarks are shown in Fig. 8 . For all benchmarks, CosiNoC has the shortest execution time, and the RST-based method has the longest execution time. The execution time of our ATree-based method is between the execution time of CosiNoC and the RST-based method. For most benchmarks, the execution time of the ATree-based method without shared buses is smaller than the execution time of the ATree-based method with shared buses. The explanation for this is that when using shared buses to connect the nodes that are not temporal overlapping, we need to analyze the communication graph, which will take extra time.
VI. CONCLUSION AND FUTURE WORK
The experimental results show that our algorithm performs better than CosiNoC and the Rectilinear-Steiner-Tree-based method in most of the cases studied. The reasonable explanations are: 1) the solution space of our algorithm is not limited by a fixed type of structure; 2) the power reduction is guaranteed after each refinement; 3) our algorithm is ATreebased, which results in smaller latency.
Future work includes enhancing the latency prediction model. In this paper we use traffic volume to predict the packet latency. It is difficult to capture the complexity of a real traffic pattern. We believe that in order to have an accurate latency prediction model, we need to study the traffic patterns in the communication graph. Notes: P stands for power (unit is watt). L stands for latency (unit is cycle). PL stands for power-latency product. PLI stands for power-latency product improvement.
