ABSTRACT New network services such as the Internet of Things and edge computing are accelerating the increase in traffic volume, the number of connected devices, and the diversity of communication. Next generation carrier network infrastructure should be much more scalable and adaptive to rapid increase and divergence in network demand with much lower cost. A more virtualization-aware, flexible and inexpensive system based on general-purpose hardware is necessary to transform the traditional carrier network into a more adaptive, next generation network. In this paper, we propose an architecture for carrierscale packet processing that is based on interleaved 3 dimensional (3D)-stacked dynamic random access memory (DRAM) devices. The proposed architecture enhances memory access concurrency by leveraging vault-level parallelism and bank interleaving of 3D-stacked DRAM. The proposed architecture uses the hash-function-based distribution of memory requests to each set of vault and bank; a significant portion of the full carrier-scale tables. We introduce an analytical model of the proposed architecture for two traffic patterns; one with random memory request arrivals and one with bursty arrivals. By using the model, we calculate the performance of a typical Internet protocol routing application as a benchmark of carrier-scale packet processing wherein main memory accesses are inevitable. The evaluation shows that the proposed architecture achieves around 80 Gbps for carrier-scale packet processing involving both random and bursty request arrivals.
A carrier network must accommodate a large number of subscribers across extremely wide areas such as a whole country, i.e. the carrier-scale network. Moreover, multiple grades and types of network services are required to support each subscriber's communication demand, which makes a carrier network more complex than typical data center networks.
Regarding packet processing performance, the current COTS server architecture has a fatal bottleneck; main memory access is degraded by the frequent cache memory misses imposed by the insufficient CPU cache memory size. The poor main memory access capability of ×8 CPU-based COTS servers is the dominant barrier degrading the carrier-scale packet processing performance as discussed in our previous study [11] .
This paper proposes a packet processing architecture that realizes carrier-scale applications without any dedicated hardware. The proposed architecture uses Hybrid Memory Cube (HMC), a sort of 3 Dimensional (3D)-Stacked Dynamic Random Access Memory (DRAM), to achieve acceptable memory access performance. We introduce an analytical model of the proposed architecture for two traffic patterns wherein the memory requests are either random or bursty.
As an example of carrier-scale applications, we evaluate the performance of Internet Protocol (IP) address table lookup. The table is held in the HMC instead of CPU cache memory. The evaluations show that the proposed architecture can achieve around 80 Gbps for both random arrival of requests and bursty arrival of requests in carrier-scale packet processing, in which main memory access is inevitable, since CPU cache memory is insufficient to accommodate the huge tables. This proposed architecture achieves both high performance and versatility for carrier network virtualization.
This paper is an extended version of the work in [12] . We detail the background of DRAM and 3D-stacked DRAM devices such as HMC and High Bandwidth Memory (HBM). We extensively describe an analytical model of our proposed architecture. We detail the states to describe the proposed architecture and the state transitions and we formulate the equilibrium equations for random arrivals of requests. We consider bursty arrivals of requests where memory requests concentrate on a particular partial table. We detail the states and the state transitions and we formulate the equilibrium equations for bursty arrivals of requests. We present extensive performance evaluations of the proposed architecture based on our analytical model. We describe the related work on packet processing and present a direction for expanding our analytical model to a general case.
The rest of this paper is organized as follows. Section II provides the background of this work. Section III presents the proposed architecture and its modeling. Sections IV and V provide analyses of the proposed architecture for random and bursty request arrivals, respectively. Section VI presents performance evaluations of the proposed architecture. Section VII describes related work. Section VIII describes a direction in which to expand our analytical model. Finally, Section IX concludes this paper. 
II. BACKGROUND A. DRAM MEMORY SYSTEM IN COTS SERVERS
The DRAM memory system in COTS servers consists of a memory controller and memory devices as shown in Figure 1 . The memory controller handles memory access requests from requestors such as CPUs or Direct Memory Accesses (DMAs) to read the data from memory devices or write the data to memory devices. Note that the memory controller logic is usually integrated inside the latest generation of CPUs. Memory controller and memory devices are connected by a command bus and a data bus. Both buses are accessible in parallel, which means that one requestor can use the command bus while another requestor uses the data bus at the same time. However, no more than one requestor can use the same bus simultaneously.
Modern DRAM systems have a Dual Inline Memory Module (DIMM) interface with multiple channels which allows requestors to access multiple DIMMs simultaneously using multiple command bus and data bus units. Note that multiple DIMMs might be attached to a channel to share the buses in the channel among the DIMMs, which means that the DIMMs attached to the channel cannot be accessed at the same time. A DIMM is organized into ranks, and only one rank can be accessed at a time. We only consider DRAMs with only one rank for simplicity.
Each rank consists of multiple DRAM chips. Furthermore, each DRAM chip comprises banks that can be accessed in parallel if there are no collisions on either bus. Each bank has a row buffer and an array of storage cells organized in rows and columns as shown in Figure 2 . Requestors can only access the content of the row buffer, not the data in the VOLUME 7, 2019 storage array. To access a specific memory location, the row that contains the desired data must be loaded into the row buffer by an Activate command. When the controller wishes to load a different row, the current row buffer has to be written back to the array by a Precharge command in advance. The actual Read or Write commands only handle the data in the row buffer. A row that is cached in the row buffer is usually referred to as an open row. On the other hand, a row that is not cached in the row buffer is considered as a closed row. Figure 3 shows the schematic diagram of DRAM bank interleaving. By issuing read commands to an open row at one bank to another, these banks can be interleaved to increase memory access performance with no additional hardware modification.
B. 3D-STACKED DRAM
3D-stacked DRAM is a memory device that vertically stacks traditional DRAM devices by using Through Silicon Via (TSV) technology. There are several commercially available 3D-stacked DRAM devices such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM). Since 3D-stacked DRAMs are based on general-purpose DRAMs, they are general-purpose devices but higher performance. The next level in performance is achieved by expensive dedicated devices such as Ternary Content Addressable Memory (TCAM), which is the de facto choice of network search engines [13] , [14] .
HMC comprises several DRAM layers on top of the bottom layer, the logic base [15] . Figure 4 shows the schematic structure of HMC. The vertical units called vaults correspond to memory channels in the traditional DRAM memory system, and are accessible in parallel. Inside a vault, each DRAM layer has several DRAM banks as with traditional DRAMs.
HBM also has several DRAM layers, channels, banks and a logic layer, similar to HMC [16] . The major differences between HMC and HBM are the number of channels, banks and memory bus width. HMC has more channels and banks than HBM, which means that HMC has more units that can be accessed in parallel. HBM has a wider memory bus, which makes it better-suited for applications such as graphic or image processing [17] , [18] and deep neural networks [19] , while HMC has packet-based high-speed serial links. Table 1 describes the device specifications of DRAM, HMC, and HBM, as available at the time of writing [15] , [16] , [20] . We can see that HMC has the most channels and banks per device. Therefore, we use HMC to accommodate the huge tables used by carrier-scale networks.
III. PROPOSED ARCHITECTURE AND MODELING
A. PROPOSED ARCHITECTURE Figure 5 (a) shows the schematic view of our proposed architecture. It consists of a multicore CPU, an FPGA, an HMC, a DRAM and network interfaces. Although the proposed architecture can have several CPUs, FPGAs, HMCs, and DRAMs, we describe and model the proposed architecture depicted in Figure 5 (a) for simplicity.
Basically, incoming packets are processed as follows. (1) Packets entering the network interfaces are directly sent to and buffered in the DRAM by using DMA. (2) A CPU core reads the header information of a packet buffered in the DRAM and looks up tables held in the HMC to determine the next action for the packet. A memory access request is generated for each packet. (3) After finishing lookup and determining the next action, the CPU core sends the packet outside the proposed architecture via the network interface that corresponds to the action.
A key to the proposed architecture is step (2) above. The HMC must hold huge numbers of into some partial tables inside a set of a vault and a bank so that the original table comprises partial tables in a vault. Then, the whole table data in a vault is copied to other vaults. The number of partial tables equals the number of banks in each vault, and the number of copies equals the total number of banks of the HMC.
An FPGA is used to connect the CPUs to the HMC as well as to distribute, using a hash function, memory requests to the appropriate vault/bank sets. The FPGA contains two custom circuits for the distributor and an HMC controller. The CPU and the FPGA are linked via inter-chip connections such as Intel Quick Path Interconnect (QPI) or Ultra Path Interconnect (UPI), which is now practical given recent CPU + FPGA device architectures [21] - [23] .
As shown in Table 1 , an HMC has up to 32 channels. However, the conventional system architecture such as the latest generation Intel Xeon CPU has up to only six channels [24] . In addition, the conventional system architecture has little room to increase the number of channels for DRAM devices due to complex electrical wiring between a CPU and DRAM devices. Therefore, the major advantage of the proposed architecture over the conventional systems is the number of memory channels, which enhances packet processing performance by simultaneously accessing partial tables through multiple vaults of the HMC. The proposed architecture is based on common characteristics among DRAM devices. Thus other types of DRAM devices such as HBM can be used to accommodate tables in the proposed architecture.
B. SYSTEM MODELING
We describe the behavior of the proposed architecture. When a memory request enters the distributor, the hash function in the distributor classifies the request to one of N queues by using packet information, such as destination IP address. The calculation in this hash function is as simple as to classify the result of an logical operation for some bits of packet header, which is usually finished in one clock clycle in FPGA. Requests entering queue n, where n ∈ [1, N ], are served in a first in first out (FIFO) manner. Queue n has S servers, each of which corresponds to a vault. The sth server for queue n, where s ∈ [1, S], is denoted by server (n, s). The maximum number of requests that can be accommodated, including all the queues and servers, is K , where K ≥ NS. A request entering the table lookup subsystem is blocked if the number of requests already being handled by the subsystem is K . The packet generating the blocked request is discarded. The memory resources are shared by N queues under the condition that the total number of accommodated requests in the subsystem does not exceed K . In the worst case, K − S requests are waiting for service in one particular queue and S requests are served by the corresponding S servers. The memory access rate by queue n to bank n at vault s is the service rate of server (n, s); each server serves one request. When more than one server, each of which corresponds to a different bank, at vault s are active, or more than one request is being served, bank interleaving is performed among them. Otherwise, no bank interleaving is performed. When bank interleaving is performed using w banks at vault s, we call the interleaving w-degree bank interleaving; no bank interleaving is performed when w = 1.
We describe our introduced analytical model. Each server is in one of three states, idle, busy without bank interleaving, and busy with bank interleaving. A server in idle state does not serve any request. A server in busy state without bank interleaving serves a request without bank interleaving. A server in busy state with bank interleaving serves a request with bank interleaving. When at least a server in idle state for queue n exists, a request at the head of line is served in the following server selection rule. If there is any server in idle state that moves to busy state without bank interleaving, it is selected. Otherwise, a server in idle state that moves to VOLUME 7, 2019 busy state with bank interleaving that has the least degree of interleaving, is selected. Figure 6 (a) shows a state transition diagram for each server. When sever (n, s) in idle state serves a request, the state moves to busy state with w-degree bank interleaving so as to minimize the value of w. When server (n, s) in busy state with w-degree bank interleaving finishes serving a request and does not serve any request, it enters idle state. When server (n , s) in idle state starts to serve a request, server (n, s) (n = n ) in busy state with w-degree bank interleaving moves to busy state with (w + 1)-degree bank interleaving. When server (n , s) finishes serving a request and does not serve any new request, server (n, s) (n = n ) in busy state with w-degree bank interleaving enters busy state with (w − 1)-degree bank interleaving.
We assume that a request arrives at the table lookup subsystem following a Poisson arrival process with average rate of λ, and the distributor based on a hash function distributes the request among N queues. Therefore, a request is assumed to arrive at each queue based on a Poisson arrival process with average rate of λ N . We assume that the service rate of server (s, n) follows an exponential distribution with average service rate of µ w for w-degree bank interleaving, where
A theoretical model can provide the accurate performance evaluation for the proposed architecture. The result from the theoretical model can also be the reference for that from a simulator. In this paper, we focus on analyzing the case for N = 2, as it is the simplest case that includes bank interleaving. The introduced methods can be used to build the analysis models for other cases with N > 2. This paper is the first work building two queueing models to analyze the performance of proposed architecture under two types of traffic models. For each traffic model, we describe all feasible states of system with the proposed architecture and analyze the transitions between them with considering the case of N = 2. Figure 6 (b) shows a state transition diagram for each server. When server (n, s) in idle state serves a request, the state moves to busy state with or without bank interleaving. When server (n, s) in busy state with bank interleaving finishes serving a request and does not serve any new request, it enters idle state. When server (n , s) in idle state starts to serve a request, server (n, s) (n = n ) in busy state without bank interleaving moves to busy state with bank interleaving. When server (n , s) finishes serving a request and does not serve any new request, server (n, s) (n = n ) in busy state with bank interleaving moves to busy state without bank interleaving.
IV. ANALYSIS OF PROPOSED ARCHITECTURE FOR RANDOM ARRIVAL OF REQUESTS A. STATES FOR SUBSYSTEM DESCRIPTION
We describe the analytical model of the proposed architecture with N = 2 to analyze its performance. Since we assume a Markov process for request arrivals and service times with and without bank interleaving, a state in the subsystem is expressed by (i, j, p), where i ∈ [0, K ] is the number of requests for bank 1, j ∈ [0, K ] is the number of requests for bank 2, and p ∈ [0, S] is the number of requests being served with 2-interleaving for both banks. The service rates for requests being served without memory interleaving and with 2-interleaving are different. There can be some states with the same (i, j) but different p, for each of which the outgoing transfer rates to the states with (i − 1, j) or (i, j − 1) due to the termination of service of a request depend on the corresponding number of requests being served with 2-interleaving for both banks. Therefore, p is required to be included to identify a state. Since the memory resources are shared by queues 1 and
We describe all possible feasible states to derivate the number of states. The states are divided into three cases for the values of i and j, two of which are further divided to several sub cases with considering the range of p. In case 1, i, j, and S are not equal to each other. In case 2, only two of them are equal. In case 3, all of them are equal. γ 1 , γ 2 and γ 3 denote the number of feasible states for case 1, case 2, and case 3, respectively. Γ denotes the total number of feasible states in the subsystem, where Γ = γ 1 + γ 2 + γ 3 .
In case 1, i, j, and S are not equal to each other (i = j, i = S, j = S). As the range of p depends on i, j, i + j, and S, case 1 is divided into three sub cases, case 1a, case 1b, and case 1c, which depend on the range of i. γ a 1 , γ b 1 , and γ c 1 denote the number of feasible states for case 1a, case 1b and case 1c, respectively.
VOLUME 7, 2019
In case 1a, i ∈ [0, S/2 ], where the symbol of x denotes the maximum integer that does not exceed x. When i = 0, we have j ∈ (0, S) ∪ (S, K ] and p = 0. Therefore, there are 
Therefore, the total number of feasible states for case 1 is given by
In case 2, only two of them are equal. There are six sub cases, which are i = j < S for case 2a, S < i = j for case 2b, i < j = S for case 2c, j = S < i for case 2d, j < i = S for case 2e, and i = S < j for case 2f. In case 3, all of i, j and S are equal (i = j = S). p is always equal to S and there is just one feasible state for case 3,
Therefore, by summing all the number of states for each case, the total number of feasible states in the subsystem is given by,
Based on the discussion on feasible states in the subsystem, we obtain the range of p as p ∈ [min(max(0, i+j−S), i, j, S), Figure 7 shows the state transitions incoming to and outgoing from state (i, j, p), where eight states are incoming to and eight states are outgoing from state (i, j, p). Table 2 describes the rate and condition for each transition. We number the cases from 1 to 16.
C. EQUILIBRIUM STATES
Let P(i, j, p) be the probability that the subsystem is in state (i, j, p). Let U be the set of states (i, j, p), where i ∈ X , j ∈ X , and p ∈ Y (i, j). In the equilibrium state, the total incoming flows to state (i, j, p) are equal to the total outgoing flows from state (i, j, p). The equilibrium equations for (i, j, p) ∈ U are given by, (q 1 + q 2 + q 3 + q 4 + q 5 + q 6 + q 7 + q 8 )P(i, j, p) = q 9 P(i − 1, j, p) + q 10 P(i − 1, j, p − 1)
where q c , c ∈ [1, 16] , equals the transfer rate of case c if the conditions of case c are satisfied and 0 otherwise. The condition that the sum of all state probabilities equals one is given by,
We can compute the probability of each state P(i, j, p) ∈ U by solving the multiple equations of (7) and (8).
D. BLOCKING PROBABILITY AND AVERAGE WAITING TIME
We define the blocking probability P R b as the probability that a request incoming to the table lookup subsystem is blocked with the condition of i + j = K , or the request is not able to enter the queue. P R b is given by
We define the average waiting time at the subsystem, W R , as the average duration time from when a request enters the subsystem until the request exits the subsystem. The average number of requests in the subsystem, L R , is given by
The first/second terms on the right hand side of the first equality indicate the average number of requests waiting at queue 1/queue 2 and those being served, respectively. The right hand side for the second equality is derived by using i∈X j∈X
By using Little's formula [25] , W R = L R λ . Let λ R e be the throughput and let W R e be the average effective average waiting time, which are defined by λ R e = λ(1 − P R b ) and
, respectively. Figure 8 . We assume that, in the ON state, packets are consecutively destined to the same bank until the ON state finishes. Let k ∈ [0, N ] denote the state of the arrival of a packet; k is set to n ∈ [1, N ] when it is ON state in which the packet is destined to bank n ∈ [1, N ], and zero otherwise. Consequently, for the table lookup subsystem with N = 2, a state in the subsystem is expressed as (i, j, p, k) .
V. ANALYSIS OF PROPOSED ARCHITECTURE FOR BURSTY ARRIVAL OF REQUESTS
The total number of feasible states of (i, j, p, k) is 3Γ . Equation (6) gives Γ , which is the total number of feasible states of (i, j, p). In IPP, for each (i, j, p), there are three states where k = 0, 1, 2. j , p, 0), (i , j , p, 1), and (i , j , p, 2) . Figure 9 shows the state transitions incoming to and outgoing from states (i, j, p, 0), (i, j, p, 1), and (i, j, p, 2). Tables 3, 4 and 5 describe the rates and conditions for states (i, j, p, 0), (i, j, p, 1), and (i, j, p, 2), respectively. We number the cases from 1 to 22.
B. STATE TRANSITION FOR
(i , j , p, k)
C. EQUILIBRIUM STATES
Let P(i, j, p, k) be the probability that the subsystem is in state (i, j, p, k). Let V be the set of states (i, j, p, k), where i ∈ X , j ∈ X , and p ∈ Y (i, j). In the equilibrium state, the total incoming flows to state (i, j, p, k) are equal to the total outgoing flows from state (i, j, p, k). The equilibrium equations for (i, j, p, k) ∈ V are given by, (r 1 + r 2 + r 3 + r 4 + r 10 + r 11 )P(i, j, p, 0)
(r 1 + r 2 + r 3 + r 4 + r 5 + r 6 + r 9 )P(i, j, p, 1)
(r 1 + r 2 + r 3 + r 4 + r 7 + r 8 + r 9 )P(i, j, p, 2)
where r c , c ∈ [1, 22] , equals the transfer rate of case c if the conditions of case c are satisfied and 0 otherwise. The condition that the sum of all state probabilities equals one is given by, By considering the symmetric feature of states (i, j, p, 1) and (j, i, p, 2),
is satisfied. In (11a) and (12), P(j, i, p, 2) is substituted by P(i, j, p, 1) with (13) . Then, (11c) can be omitted. The number of decision variables to be solved is reduced from 3Γ to 2Γ .
75508 VOLUME 7, 2019 D. BLOCKING PROBABILITY AND AVERAGE WAITING TIME Blocking probability P B b , which is the probability that a request incoming to the table lookup subsystem is blocked with i + j = K , or the request is not able to enter the queue, under the condition of ON state, is given by the following conditional probability.
note that
α+β is the probability of ON state. The average waiting time at the subsystem, W B is the average duration time from when a request enters the subsystem until the request exits the subsystem. The average number of requests in the subsystem, L B , is given by
The right hand side for the second equality is derived by using 
VI. EVALUATION
Based on the analytical results calculated with the model and its analyses shown in Sections IV and V, we observe performance dependency on each system parameter and arrival pattern of requests of the proposed architecture. The model can incorporate system parameters that reflect a real implementation.
A. NUMERICAL RESULTS FOR RANDOM ARRIVAL OF REQUESTS
We evaluate P R b , λ R e and W R e of the proposed architecture and investigate their dependency on ρ R and µ 2 , by using the analysis presented in Section IV. As a reference model, we use the M/M/S/K model. We set K = 100, and the arrival rate of λ is the same for both models. We set S = 32 for both models unless otherwise stated. In M/M/S/K, the service rate is µ = 1, and, in the proposed architecture, µ 1 = µ = 1. Let ρ R be the traffic load, which is defined by, ρ R = λ Sµ . The analytical results are obtained by using a computer with 3.60GHz Intel Core i7-7700 CPU and 32GB memory. In the case of S = 32 and K = 100, the average computation time to obtain P R b for each set of µ 2 and ρ R in the proposed architecture is 252 [sec] , where the number of states in the analysis is 10607. Figure 10(a) shows the blocking probability dependency on ρ R with different µ 2 . In the proposed architecture, the blocking probability increases with ρ R , and decreases as µ 2 increases. The blocking probability of the proposed architecture with µ 2 = 0.5 is close to, but slightly higher than, that of M/M/S/K. This is explained by the observation that, when two subsystems have the same value of the product of the number of servers and the service rate, the one with larger service rate outperforms the other. In addition, 32 servers with service rate µ = 1, all requests queued in the subsystem are served, outperform 64 (= 32 × 2) servers with service rate µ 2 = 0.5, half of which serve requests queued in one of the two separate queues; the former has greater statistical multiplexing effect than the latter. Figure 10(b) shows the blocking probability dependency on µ 2 with ρ R = 1.2. In the proposed architecture, the blocking probability decreases as µ 2 increases. The blocking VOLUME 7, 2019 probability of the proposed architecture with µ 2 = 1 is close to, but slightly higher than, that of M/M/S/K with S = 64. This is explained by comparing 64 servers with service rate µ = 1 and 64 (= 32 × 2) servers with service rate µ 2 = 1 as with the observation on Figure 10(a) . Figure 11 show the throughput dependency on ρ R with different µ 2 . The throughput in M/M/S/K increases with ρ R ≤ 1, and is saturated with ρ R > 1. On the other hand, the throughput of the proposed architecture saturates at a larger point than M/M/S/K. The saturated throughput increases with µ 2 . 
B. NUMERICAL RESULTS FOR BURSTY ARRIVAL OF REQUESTS
We evaluate P B b , λ B e and W B e of the proposed architecture for IPP and investigate their dependency on ρ B and µ 2 , by using the analysis presented in section V. We compare the proposed architecture for IPP with a Poisson arrival process. We set K = 100 and S = 32 unless otherwise stated. Let ρ be the traffic load, which is defined by, ρ B = λ Sµ = λβ (α+β)Sµ . For performance comparison, we use the same ρ B for different models. The analytical results are obtained by using a computer with 3.60GHz Intel Core i7-7700 CPU and 32GB memory. In the case of S = 32 and K = 100, the average computation time to obtain P b for each set of µ 2 and ρ B in the proposed architecture is 853 [sec] , where the number of states in the analysis is 21214.
We introduce parameters, h > 0 and l > 0, which are defined by l = Figure 13(a) shows the blocking probability dependency on l and h with ρ B = 1.2 for µ 2 = 0.7. As h becomes large, the blocking probability increases. As l becomes large, the blocking probability decreases. We observe that, when l → ∞, the blocking probability of IPP approaches that of the Poisson arrival process presented in Section IV for any h. The lower h is, the faster the blocking probability of IPP approaches that of the Poisson arrival process. h → 0 indicates that ON state probability is close to 1, and l → ∞ indicates that ON state period is close to zero. Each situation is equivalent to the Poisson arrival process, which is a special case of IPP. Figure 13(b) shows the blocking probability dependency on ρ B and h with l = 1.0 for µ 2 = 0.7. Figures 14 and 15 show the throughput and effective waiting time dependency on l and h with ρ B = 1.2 for µ 2 = 0.7, respectively.
C. PACKET PROCESSING PERFORMANCE
We can calculate the packet processing performance of the proposed architecture by using the numerical results shown in Sections VI-A and VI-B assuming that current carrier-scale packet processing performance is bounded by memory access performance. In the following evaluation of packet processing performance, we focus on IP routing as an example of carrier-scale packet processing. By using an IP address lookup algorithm such as DIR-24-8-BASIC [26] , packet processing of IP routing is finished within one or two memory accesses at most. Its lookup table has in nature 2 32 entries which correspond to the whole IPv4 address space. The content in each entry is the next hop information corresponding to the prefix of the entry. In detail, based on a real traffic pattern, the lookup tables comprise two lookup tables called TBL24 and TBLlong each of which has the entries corresponding to the upper 24 bits and lower 8 bits, respectively, where the longest prefix match is finished within two memory accesses. Therefore, we can calculate the packet processing performance of IP routing by using the number of achievable memory accesses per unit of time as derived from the numerical results in Sections VI-A and VI-B, as well as the number of necessary memory accesses to the IP address lookup per packet, and the additional delay of the distributor based on hash function. The packet processing latency is obtained as the number of required memory accesses times the sum of the average effective waiting time and the additional delay of the distributor. We assume that the hash function works as a pipeline, where it does not affect any other performance metric of the proposed architecture except for the latency.
In NFV-aware carrier network systems, multiple types of carrier-scale packet processing applications in the same system simultaneously. The DIR-24-8-BASIC algorithm requires 33 MB of memory for its routing table which corresponds to the whole IPv4 address space [26] . Therefore, cache memories inside the latest generation of CPUs, such as Intel Skylake, cannot accommodate the large tables for multiple carrier-scale packet processing applications.
For example, let service rate µ be 8 M services per second, which is an estimate since typical latency inside the HMC itself is usually taken to be between 100-180 ns with average of 125 ns [27] , [28] , and let the traffic load be 1.2. Table 6 lists the calculation results of M/M/S/K, proposed architecture with Poisson arrival, and that with IPP arrival. In this example, we assume the service rate of interleaved bank µ 2 = 0.7µ, l = 0.1, h = 0.5, and the additional delay of the distributor is 10 ns which corresponds to one clock cycle at 100 MHz circuits in the FPGA. We also assume that each IP address lookup requires two memory accesses each of which is associated with a request distribution based on hash function.
In the proposed architecture, memory access requests are served simultaneously by using multiple vaults of HMC.
This may change the order of egress packets from the processor, which affect the performance of upper layer such as Transmission Control Protocol (TCP) [29] . In order to eliminate the misordered packets, there are several approaches: to exchange signals among multiple processes or threads so that every packet can be served in order, or to buffer the packets and sort them before transmitted from the processor [30] , [31] .
VII. RELATED WORK
Several software-based packet processing schemes for COTS server implementation have been proposed [6] - [8] , [32] . RouteBricks [6] is the first software-based router application to leverage the parallel processing offered by modern multi-core CPUs. Lagopus [7] is a DPDK-enabled OpenFlow switch that can achieve over 10 Gbps performance at more than 1 M flow entries without any hardware modification. These approaches significantly improved packet processing performance compared to previous software schemes running on single-core CPUs and DPDK. However, their performance directly depends on the high-speed cache memory of the CPUs, which unfortunately is too small to support carrier-scale packet processing given the huge multiple tables involved. PacketShader [32] consolidates parallel processing by using Graphics Processing Unit (GPU) to achieve nearly 40 Gbps packet processing performance. However, their work makes the constraining assumption of homogeneous packet processing to leverage the GPU's Single Instruction Multiple Data (SIMD) performance, and so does not suit carrier-scale packet processing. Poptrie [8] is the latest and fastest software IP routing table lookup; it offers over 200 M lookups per second with just a single CPU core. The IP address lookup performance itself is sufficient for carrier-scale packet processing. However, this software is also dependent on the small cache memories inside the CPU.
A packet matching system using HMC was studied in [33] . However, no discussion is made on leveraging vault-level parallelism and bank interleaving of HMC, since the main problem targeted by the work is implementing a fast packet matching circuit in FPGA. There is a study that utilizes 3D-stacked DRAM devices including HMC as the main memory of a system [34] . The work mainly details the production process of 3D-stacked DRAM devices, and there is no discussion on evaluating system performance. CasHMC [35] is a cycle-accurate simulator for HMC. This simulator does not consider bank interleaving, and the simulation results are valid only current HMC devices.
The thermal feasibility of a system with 3D-stacked DRAM is studied in [36] - [38] . They consider thermal feasibility of Processing in Memory (PIM). PIM uses logic layer functionality more aggressively, which produces more heat and requires stronger cooling systems. They conclude that PIM use cases with 3D-stacked DRAM are feasible if the system has high-end active cooling. Therefore, the proposed architecture is more feasible since it uses HMC for simple memory access, and so can use the commodity coolers of COTS systems.
Dividing an original table into several partial tables is studied in [39] - [41] . They divide a table to make table data more memory efficient; the goal is to make the table fit inside a device with fixed memory capacity such as a TCAM, on-chip memories in FPGA, and external SRAM.
Table update schemes are studied in studies that introduce table data structures such as [8] , [26] , [39] . In our proposed architecture, the HMC has multiple vaults to accommodate whole tables. Tables inside the HMC are updated sequentially for each vault by using a table update scheme corresponding to the table data structure.
VIII. DIRECTION TO EXPANSION OF ANALYTICAL MODEL FOR GENERAL N ≥ 2
We consider the analytical model for a system with general N ≥ 2. A state in the system is expressed by a vector, that consists of the following components. First, i n ∈ [0, K ] is the number of requests for bank n ∈ [1, N ]. Second, p i n i n ∈ [0, S], where n, n ∈ [1, N ] and n = n , is the number of requests being served with 2-interleaving for banks n and n . The number of p i n i n is N C 2 , where n C m = n! (n−m)!m! . Third, p nn n ∈ [0, S], where n, n , n ∈ [1, N ], n = n , n = n , and n = n, is the number of requests being served with 3-interleaving for banks n, n , and n . The number of p nn n is N C 3 . In the same way, we define p nn n n ··· , which is the number of requests being served with (N − 1)-interleaving. The number of p nn n n ··· for (N −1)-interleaving is N C N −1 = N − 1. Finally, p 123···N ∈ [0, S] is the number of being served requests with N -interleaving for all banks.
IX. CONCLUSION
We proposed an architecture that allows an HMC to support carrier-scale packet processing. The proposed architecture enhances memory access concurrency by leveraging the vault-level parallelism and bank interleaving offered by HMC. The architecture uses a hash-function-based distributor of memory requests to among the sets of vault and bank, each of which accommodates a portion of the original huge (carrier-scale) tables. We introduced an analytical model of a bank-interleaved HMC subsystem for two traffic patterns where the arrivals of memory requests are either random of bursty. The analytical results for random arrival of requests showed the performance of the proposed architecture and its dependency on traffic load and bank interleaving. The analytical result for bursty arrival of requests detailed the performance dependency on the burstiness of the input traffic. The evaluation result of packet processing performance showed that our proposed architecture achieves around 80 Gbps in carrier-scale packet processing wherein main memory accesses are inevitable; CPU cache memory is too small to accommodate the huge tables even if request arrivals are bursty.
