FPGA emulation is a promising approach to accelerating Networkon-Chip (NoC) modeling which has traditionally relied on software simulators. In most early studies of FPGA-based NoC emulators, only synthetic workloads like uniform and bit permutations were considered. Although a set of carefully designed synthetic workloads can reveal a relatively thorough coverage of the characteristics of the NoC under evaluation, they alone are insufficient, especially when the NoC needs to be optimized for specific applications. In such cases, trace-driven workloads are effective. However, there is a problem with conventional trace-driven workloads that has been pointed out by some recent studies: the network load and congestion may be distorted because dependencies between packets are not considered. These studies also provide infrastructures for extending existing software simulators to enforce dependencies between packets. Unfortunately, enforcing dependencies between packets is not trivial in the FPGA emulation approach. Therefore, although there are some recent FPGA-based NoC emulators supporting trace-driven workloads, most of them ignore packet dependencies. In this paper, we first clarify the challenges of supporting trace-driven workloads with dependencies between packets taken into account in the FPGA emulation approach. We then propose efficient methods and architectures to tackle these challenges and build an FPGA-based NoC emulator, which we call DNoC, based on the proposals. Our evaluation results show that (1) on a VC707 FPGA board, DNoC achieves an average speed of 10,753K cycles/s when emulating an 8×8 NoC with trace data collected from full-system simulation of the PARSEC benchmark suite, which is 274× higher than the speed reported in a recent related work on dependencydriven trace-based NoC emulation on FPGAs; (2) Compared to BookSim, one of the most popular NoC simulators, DNoC is 395× faster while providing the same results; (3) DNoC can scale to a 4,096-node NoC on a VC707 board, and the size of the largest NoC depends on only the on-chip memory capacity of the target FPGA.
INTRODUCTION
Networks-on-Chip (NoCs) are becoming increasingly important elements in different types of computing hardware platforms, from many-core processors for supercomputers and datacenters [22, 28] to MultiProcessor Systems-on-Chip (MPSoCs) for embedded applications [1, 11] . They are also integral parts of many emerging hardware accelerators for critical applications such as deep neural networks and databases [5, 31] . In such a hardware platform, the NoC is responsible for connecting the other components together and thus has a significant impact on the overall performance. Stateof-the-art studies have shown that, as the number of components that need to be interconnected increases, the overall performance becomes highly sensitive to the NoC performance. Therefore, research and development of NoCs play a key role in designing future architectures with a large number of interconnected components.
Simulation is the de facto evaluation method in NoC design exploration, which is like in general computer architecture research. There are two types of workloads commonly used in NoC simulation: synthetic workloads and trace-driven workloads. Synthetic workloads are those based on mathematical modeling of common traffic patterns in real applications. Although a set of carefully designed synthetic workloads can reveal a relatively thorough coverage of the characteristics of the NoC under evaluation, they alone are insufficient, especially when the NoC needs to be optimized for specific applications. In such cases, trace-driven workloads that are created based on trace data captured from either a working system or a full-system simulation are effective. As NoCs are now key elements of many application/domain-specific MPSoCs and hardware accelerators, evaluation on trace-driven workloads becomes even more important. However, there is a problem with conventional trace-driven workloads that has been pointed out by some recent studies [13, 20, 24] : the network load and congestion may be distorted because dependencies between packets are not considered. Therefore, to enhance the effectiveness of trace-driven workloads, it is necessary to enforce dependencies between packets. This is the focus of our work in this paper.
NoC designers have mainly relied on software simulators like BookSim [15] and Noxim [4] . Unfortunately, these simulators are known to be very slow, especially for large-scale NoCs. This hinders NoC designers from making good design decisions. For example, in [7] , the authors report that the simulation speed of BookSim on a Core i7 4770 machine is reduced from 40K-80K cycles/s to just 60-180 cycles/s when the size of the target NoC is increased from 8×8 to 64×64. At these speeds, for example, running a simulation of one billion cycles would take around 3.5-7 hours for the 8×8 NoC and 2-6 months for the 64×64 NoC, which is not practical. Parallelizing a NoC simulator for multicore CPUs or GPUs is difficult because it would require frequent communication and synchronization between threads [26] .
FPGA emulation is becoming a promising approach to accelerating NoC simulation. We classify existing FPGA-based NoC emulators into two models: hybrid model and pure hardware model.
Hybrid model. FPGA-based NoC emulators built on this model consist of two parts: hardware and software. The hardware part contains a network of routers and is implemented using FPGA programmable logic. The software part is responsible for generating traffic workloads and controlling the emulation. It is implemented on a host PC or on one or several hard/soft processors on the FPGA board. Previous studies [12, 16, 17, 21, 25, 30, 32] adopt this model.
Pure hardware model. FPGA-based NoC emulators built on this model have all of their components implemented using FPGA programmable logic. Previous studies [7-9, 19, 29] adopt this model. Although the hybrid model is much easier to implement and may give more flexibility, the achievable emulation speed is severely limited by the processing bottleneck of the software side and the communication overhead between the software side and the hardware side. For example, DuCNoC [16] , a hybrid FPGA-based NoC emulator, achieves a speed of 200K-375K cycles/s when emulating a 5×5 NoC under a synthetic workload, which is only several times faster than the speed of BookSim for a larger (8×8) NoC on a Core i7 4770 machine [7] . The evaluation data reported in [16] also indicate that the emulation speed of DuCNoC decreases dramatically with respect to the NoC size, even at a much faster pace than BookSim. Therefore, we focus on the pure hardware model.
To the best of our knowledge, there has been no NoC emulator that is built on the pure hardware model and supports trace-driven workloads with dependencies between packets taken into account. In most early studies of FPGA-based NoC emulators, only synthetic workloads were considered. Some recent studies consider trace-driven workloads but most of them ignore packet dependencies. We found only one study by Drewes et al. [12] in which packet dependencies were considered. This study adopts the hybrid model where the packet dependency management is implemented in software. As a result, it achieves a low emulation speed, only 39K cycles/s for an 8×8 NoC with the PARSEC traces [2, 13] . This is only about 1.5× higher than the speed of BookSim that we measured. In contrast, we show through our experiments that a 395× speedup over BookSim is achieved by adopting the pure hardware model.
In this paper, we make the following key contributions.
1)
We clarify the challenges of supporting trace-driven workloads with dependencies between packets taken into account in the pure hardware model. 2) We propose efficient methods and architectures to solve the challenges pointed out in 1). [32] Hybrid No 64×48 N/A Chu et al. [7] Pure HW No 128×64 16,250K (a) DuCNoC [16] Hybrid [17] Hybrid No 32×32 30K-200K (c) DART trace [29] Hybrid No 9×9 N/A Papamichael [25] Hybrid No 4×4×4×4 N/A AcENoCs [21] Hybrid 3) Using the proposals, we build a NoC emulator called DNoC on a VC707 FPGA board. Our evaluation shows that DNoC achieves an average speed of 10,753K cycles/s when emulating an 8×8 NoC with the PARSEC traces [2, 13] . This speed is 274× higher than that reported by Drewes et al. [12] and 395× faster than BookSim running on a Core i9 9900K machine. We also show that DNoC can scale to NoCs with thousands of nodes and the scalability depends on only the on-chip memory capacity of the target FPGA.
RELATED WORK
Various FPGA-based NoC emulators have been proposed in the literature [7-9, 12, 16, 17, 19, 21, 25, 29, 30, 32] . Here we focus on those that support emulation with trace-driven workloads. Table  1 highlights the differences between these emulators and DNoC, our proposal. Note that the emulation speeds under trace-driven workloads of some emulators are not available. In these cases, for reference, we include in the table the emulation speeds under synthetic workloads (where available). Also, the comparison of emulation speeds is not strictly quantitative because the emulators are implemented on different FPGA boards and their modeled router architectures are not the same. As shown in Table 1 , the NoC emulators built on the hybrid model are significantly slower than those built on the pure hardware model. AcENoCs [21] and AdapNoC [17] use two Microblaze soft processors to overlap the trace load operation with other operations, thereby improving the emulation speed. DuCNoC [16] and the emulator proposed by Drewes et al. [12] have a similar approach but use two hard processors on a SoC FPGA. Despite this, the emulation speeds of these emulators do not exceed several hundred kilocycles per second. These speeds are only slightly higher than that of BookSim that we measured. Moreover, the data reported in the DuCNoC paper indicate that the speed of a NoC emulator built on the hybrid model decreases at even a much faster pace than BookSim when increasing the size of the target NoC. This is unacceptable since reducing simulation time is the primary reason that the FPGA emulation approach is used. Therefore, we adopt the pure hardware model.
As mentioned in Section 1, most of existing FPGA-based NoC emulators that support trace-driven workloads ignore dependencies between packets. To the best of our knowledge, the study by Drewes et al. is the only one that considers packet dependencies. However, this study adopts the hybrid model and thus achieves a very low emulation speed of only 7K-83K cycles/s. In contrast, our NoC emulator is based on the pure hardware model and, as will be shown in the evaluation section, can achieve a much higher speed of 8,312K-12,979K cycles/s. We can also scale to much larger NoCs, up to 4,096 nodes on a VC707 FPGA board.
The importance of taking into account dependencies between packets when performing trace-driven simulation has been pointed out by Hestness et al. [13] , Nitta et al. [24] , and Liu et al. [20] . They show that a trace-driven simulation without considering packet dependencies can lead to misleading results due to the distorted network load and congestion, and that when packet dependencies are taken into account, the simulation accuracy is substantially improved. Hestness et al. and Liu et al. also provide their collected traces and infrastructures for extending existing software simulators to enforce packet dependencies. In this paper, we use Hestness et al.'s traces that are captured from a full-system simulation of the PARSEC benchmark suite [2] on the M5 simulator [3] . We utilize the provided infrastructure to convert these traces to the format used by our FPGA-based NoC emulator.
PROBLEM FORMULATION
First, we define some technical terms used in this paper. For convenience of explanation, let us assume that there are two packets A and C. Packet C is dependent (or depends) on packet A if the receipt of packet A must occur before the sending of packet C; the dependency between A and C is cleared when A arrives at its destination. There is an important note here: the dependence of packet C on packet A does not mean that the destination node of A and the source node of C are the same. In general, a packet generated at node N (source node = N ) may depend on packets with destination nodes different from N .
Throughout the paper, we use the following three terms to refer to three types of packets:
• i-packet: a packet that does not depend on any others. • d-packet: a packet that depends on some others. • f-packet: a d-packet with all dependencies cleared.
A d-packet, say C, cannot be injected into the network until all of the packets that C depends on arrive at their destinations, that is, C becomes an f-packet.
In a system, dependencies between packets arise as a result of multiple causes: the interaction between architectural components such as cores, caches, and memory controllers; microarchitectural implementation details such as the cache coherence protocol; and program behavior. More in-depth explanations can be found in Hestness et al. [13] .
Dependencies between packets can be illustrated using directed acyclic graphs where each vertex represents a packet and each edge represents a dependence relation between two packets. Figure 1 and H are d-packets. In general, a packet may depend on multiple other packets and have multiple dependent packets. For example, in Figure 1 (a), D depends on two packets A and B and has two dependent packets G and H. In the PARSEC traces [2, 13] that we use in this paper, a packet may depend on four other packets and have up to 255 dependent packets. Next, we describe the mechanism of injecting packets into the network. Each packet has a timestamp indicating the first emulation cycle at which this packet can be injected. Let t be the current emulation cycle and T i be the timestamp of packet i. In the example in Figure 1 (a), packets A and B are i-packets and thus can be injected when t ≥ T A and t ≥ T B , respectively. However, this is not the case for the d-packets. For example, the injection of packet D depends not only on T D but also on when packets A and B arrive at their destinations. Let t A and t B be the emulation cycles at which A and B arrive at their destinations. So, the dependency between A and D is cleared at t A and the dependency between B and D is cleared at t B . D can be injected when
where l is a latency representing the processing time needed to prepare for the injection. For simplicity, we use a fixed l in this paper. In our experiments in Section 6, l is set to 8 cycles.
In this paper, we propose efficient methods and architectures to build an FPGA-based NoC emulator that satisfies the specifications described above.
CHALLENGES OF ADOPTING THE PURE HARDWARE MODEL AND OUR APPROACH
Since a trace is often much larger than the FPGA on-chip memory capacity, it must be stored in off-chip memory. In this paper, we assume that traces are stored in DRAM. We define a trace as a set of packet descriptors. Each packet descriptor encodes a message sent from one node to another in the system from which the trace was recorded. During an emulation, a packet is generated based on information encoded in its descriptor.
The first challenge of adopting the pure hardware model is how to track dependencies between packets. A natural approach is to keep the list of the IDs of all packets that a packet depends on in its descriptor. For example, in Figure 1 (a), since packet D depends on two packets A and B, the descriptor of D contains the IDs of A and B along with two dependency status bits indicating whether the dependencies between D and A and between D and B have been cleared. When a packet, say i, arrives at its destination, the dependency manager searches the trace for descriptors of all packets that depend on i and updates the dependency status bits to mark that the dependencies between i and these packets have been cleared.
The above approach has two major problems. First, the packet descriptor size is not fixed since each packet depends on a different number of packets. This makes the hardware design difficult. Second, searching the trace is a time-consuming operation in which all packet descriptors need to be examined.
Our approach to tracking dependencies between packets is as follows. For each packet i, we maintain a dependency list that consists of the IDs of all packets depending on i and a dependency counter for tracking the number of dependencies of i that have not yet been cleared. In Figure 1 Figure 1 (b) shows how the dependency counters change during the emulation. A and B do not depend on any other packets and thus their dependency counters are always 0. The dependency counters of C, D, E, F, G, and H are initialized with 1, 2, 2, 1, 3, and 3, respectively. When a packet arrives at its destination, first, the dependency list of this packet is checked; then, the dependency counters of all packets in this list are decremented. For example, when A arrives at its destination, since the dependency list of A is {C, D, E}, the dependency counters of C, D, and E are decremented. A d-packet becomes an f-packet when its dependency counter reaches 0.
To make the packet descriptor size fixed, if a packet has two or more dependent packets, we include in its descriptor only the ID of a dependent packet and an address to get the IDs of the remaining dependent packets that are stored in a different DRAM region. In the example in Figure 1 (a), packet A has three dependent packets C, D, and E. Assume that the ID of C is contained within A's descriptor and the IDs of D and E are stored in a different DRAM region. During the emulation, packet A will carry three pieces of information: the number of dependent packets (3 in this case), the ID of C, and the address to load the IDs of the remaining dependent packets (D and E) from DRAM when A arrives at its destination. In this way, not only the packet descriptor size is fixed, but the size of a packet can be independent of the number of dependent packets that it has. This helps to reduce the hardware overhead of implementing the NoC emulator. However, the emulation performance may be degraded substantially if it is frequently necessary to load packet IDs from DRAM.
Fortunately, we find that in most cases, the vast majority of packets have no more than one dependent packet. Figure 2 shows the statistics of the PARSEC traces that we use in this paper. Although a packet may have up to 255 dependent packets as mentioned in Section 3, 63%-93% of packets have no more than one dependent packet. Also, in every application, less than 0.04% of packets have 10 or more dependent packets. Our design is based on this observation. It balances between keeping small packet sizes, that is, a low requirement for FPGA on-chip memory, and reducing the number of DRAM accesses.
We now describe our idea for updating dependency counters with low time overhead. In our design, dependency counters are 0% 20% 40% 60% 80% 100% Packets with two or more dependent packets Packets with one dependent packet Packets with zero dependent packets Figure 2 : Breakdown of packets by the number of dependent packets in the PARSEC traces [2, 13] .
contained within packet descriptors. As described above, when packet i arrives at its destination, if the ID of packet j is in the dependency list of i, then the dependency counter of j will be decremented. Thus, we need to perform the operation of searching for the corresponding packet descriptor with a given packet ID. A simple way to achieve this is to include packet IDs in packet descriptors and compare the given packet ID with the packet ID stored in each of the packet descriptors. However, in this way, we have to look at all packet descriptors in the worst case, and thus the time complexity of the search operation is O(n). Our idea is to design packet IDs so that they can be used as addresses to access packet descriptors in DRAM, thereby reducing the time complexity of the search operation from O(n) to O(1). The proposed packet ID structure will be described below in this section.
The second challenge of adopting the pure hardware model is how to inject packets into the network. For conventional traces, at each node, packets are injected according to their recorded timestamps that are predetermined and do not change during the emulation. Specifically, if the timestamp of packet A is smaller than that of packet B, then packet A will be injected before packet B regardless of how the network behaves. However, this is not true when we consider dependencies between packets.
Out-of-order injections occur in two cases. First, an i-packet, say i, with recorded timestamp T i will be injected earlier than a d-packet, say j, with recorded timestamp T j (T i > T j ) if the current emulation cycle t is larger than or equal to T i and j has not yet become an f-packet. Second, let k and l be two d-packets with recorded timestamps T k and T l (T k > T l ); k will be injected earlier than l if k becomes an f-packet earlier. These out-of-order injections make the hardware design difficult. At each node, we cannot simply use an injection FIFO and load packet descriptors from DRAM to this FIFO in the recorded timestamp order like in the conventional trace-driven emulation. We have to track when a d-packet becomes an f-packet and perform the injection of this f-packet in a proper order with respect to the timestamps of the i-packets. Figure 3 shows an abstract of our approach to the out-of-order injection problem. For simplicity, here we focus on only one node of the network (node 0). To prevent d-packets from blocking i-packets, we store them in two different DRAM regions. Similarly, to prevent a d-packet with dependencies not yet cleared from blocking f-packets, we store their descriptors in two different DRAM regions. Since we have another DRAM region for storing packet IDs as described before, there are a total of four DRAM regions in our design. They are summarized below.
The first DRAM region stores descriptors of i-packets. Packet descriptors with the same source address are stored continuously from a fixed DRAM address and in the order of the recorded timestamp.
The second DRAM region stores descriptors of d-packets. Like in the first region, packet descriptors with the same source address are stored continuously from a fixed DRAM address and in the order of the recorded timestamp.
The third DRAM region stores IDs of d-packets that are used to track dependencies between packets. Figure 4 shows our proposed packet ID structure consisting of two parts: source address and sequence number. For example, if i is a packet generated at node 9 and the packet descriptor of i is the 100th packet descriptor with source address = 9 stored in the second DRAM region, then the ID of i will be {9, 100}. This design has two major advantages. First, packet IDs can be sent to where they will be processed without any additional information. Second, we can use packet IDs as addresses to load packet descriptors from the second DRAM region.
The fourth DRAM region stores descriptors of f-packets. Unlike the previous three regions, this region is empty at initialization.
During an emulation, as shown in Figure 3 , descriptors of ipackets in the first DRAM region are sequentially loaded to a FIFO. Upon the arrival of a packet, say i, at its destination, the dependency tracker decrements the dependency counters (stored in the packet descriptors) of all packets that depend on i. To do this, it needs to (1) access to the third DRAM region for the IDs of all packets that depend on i and (2) use these packet IDs to load the corresponding packet descriptors from the second DRAM region. When a dependency counter reaches 0, the dependency tracker updates the timestamp of the corresponding packet descriptor and write it to the fourth DRAM region (see formula (1) in Section 3 for the timestamp update rule).
To improve the emulation performance, packet descriptors loaded from the second DRAM region are cached for future updates. A writeback occurs when the dependency tracker needs to access a packet descriptor that is not in the cache. Figure 5 : The TDM method proposed by Chu et al. [9] .
When available, descriptors of f-packets in the fourth DRAM region are sequentially loaded to a FIFO that is different from the FIFO for descriptors of i-packets. When both FIFOs are not empty, the timestamps of the two packet descriptors at the front of these FIFOs are compared; the packet descriptor with the smaller timestamp is selected; in case the timestamps are the same, the descriptor of the i-packet is selected. If the selected timestamp is smaller than or equal to the current emulation cycle, the corresponding packet will be generated and injected into the network.
The dependency tracker described above is split into two separate units dependency checker and dependency manager in the proposed architecture described in Section 5.
PROPOSED ARCHITECTURE 5.1 Time-Division Multiplexing (TDM)
To scale to large NoCs, we adopt the time-division multiplexing (TDM) method proposed by Chu et al. [9] . Our dependency-driven, trace-based emulation architecture is tightly coupled to this method. Therefore, first, we briefly describe it.
As shown in Figure 5 , a network is emulated using a few nodes called physical nodes. The collection of physical nodes is called physical cluster. The physical cluster size in Figure 5 is 2×1 while the network size is 4×4. So, each physical node is responsible for emulating eight nodes of the network. Specifically, physical node P0 is in charge of emulating nodes 0, 2, 4, 6, 8, 10, 12, and 14 while physical node P1 is in charge of emulating nodes 1, 3, 5, 7, 9, 11, 13, and 15; nodes 0 and 1 are emulated in parallel, after that nodes 2 and 3 are emulated, and so on. A collection of nodes emulated in parallel by the physical cluster is called a logical cluster. The physical cluster emulates different logical clusters by switching its state to the states of these logical clusters that are stored in a memory called state memory. The in buffer and the out buffer store communication data between logical clusters. Let N phy and N log be the numbers of physical nodes and logical clusters, respectively. The network size is N node = N phy × N log . In Figure 5 , we have that N phy = 2, N log = 8, and N node = 16. Like in [9] , we emulate one network's cycle using 2N log FPGA cycles, 2 FPGA cycles for each logical cluster. It, however, takes more than 2N log FPGA cycles when we need to stall the emulation to access DRAM and manage dependencies between packets. With a fixed N node , increasing N phy will decrease N log , that is, reduce the number of FPGA cycles required for emulating each network's cycle. Thus, it is desirable that the physical cluster size be increased.
The state memory, the in buffer, and the out buffer are implemented using FPGA on-chip block RAMs (BRAMs). For simplicity, they are not included in the architecture figures below. Dependency counter: this field is initialized with the number of packets that this packet depends on and will be decremented during the emulation until it reaches zero (when all dependencies have been cleared) at which point the packet descriptor is moved to the fourth DRAM region and waits for being injected into the network. (8) Number of packets that depend on this packet: this field is used to control the number of IDs that need to be loaded from the third DRAM region. (9) Response latency code: this field is for enforcing latency between the time a d-packet becomes an f-packet and the time this packet can actually be generated and injected into the network. (10) ID of a packet (if any) that depends on this packet. (11) Address to load the IDs of the remaining packets (if any) that depend on this packet from the third DRAM region.
The two fields, DRAM region and source address, help to simplify the hardware design. With these fields, a packet descriptor loaded from DRAM can be sent to where it will be processed without any additional information. For the same reason, each packet ID stored in the third DRAM region is also equipped with these two fields. We call the combination of a packet ID with support information a packet ID descriptor.
5.2.2
High-Level Architecture. Figure 6 shows an overview of the proposed emulation architecture. For clarity, the figure shows only main signals in an abstract way. Each physical node in the physical cluster is connected to an i-packet descriptor loader, a dependency checker, and a dependency manager.
i-packet descriptor loaders are responsible for loading descriptors of i-packets from the first DRAM region and sending them to the traffic generators of the corresponding physical nodes. Since an i-packet does not depend on any others, it can be injected into the network when the current emulation cycle becomes larger than or equal to the timestamp stored in the descriptor. In our design, we adopt the efficient loader architecture proposed in [7] .
When a packet, say X, arrives at its destination, the traffic receptor at this node extracts the number of packets that depend on X, the ID of a packet that depends on X (if any), and the address to load the IDs of the remaining packets that depend on X from DRAM (if any), and sends these data to the corresponding dependency checker. Assume that X has three dependent packets Y, Z, and T, and that X carries the ID of Y. In this case, the dependency checker receives from the traffic generator the ID of Y and the address to load the IDs of Z and T from DRAM. After receiving these data, the dependency checker first sends the ID of Y to the dependency manager connected to the physical node that emulates the source node of Y. For example, let us consider the case in Figure 5 ; if the source node of Y is 5, then the ID of Y will be sent to the dependency manager connected to the physical node P1 since node 5 is emulated by physical node P1. The dependency checker then loads the IDs of Z and T from DRAM and sequentially sends them to the appropriate dependency managers.
Dependency managers are responsible for: (1) resolving dependencies between packets, and (2) loading descriptors of f-packets from the fourth DRAM region and sending them to the traffic generators of the corresponding physical nodes. As mentioned in Section 4, dependencies between packets are tracked using packet IDs. When a dependency manager receives a packet ID, it decrements the dependency counter of the packet with that ID. If the dependency counter reaches zero, that is, the packet becomes an f-packet, then the dependency manager will write the descriptor of the packet to the fourth DRAM region. This packet descriptor will be loaded to the corresponding traffic generator by a submodule of the dependency manager called f-packet descriptor loader.
In general, packet IDs from a dependency checker may be sent to any dependency managers. Therefore, dependency checkers and dependency managers are connected by a crossbar switch.
The DRAM access requests are arbitrated in two layers: locally at each physical node by local DRAM request arbitration logic modules and globally by a global DRAM request arbitration logic module. Packet descriptors and packet ID descriptors provided by the DRAM controller are sent to appropriate i-packet descriptor loaders, dependency checkers, and dependency managers based on the source address and DRAM region information embedded inside them. For example, let us consider the case in Figure 5 again; if the source address is 5, then the destination of the packet descriptors (or packet ID descriptors) will be the i-packet descriptor loader, dependency checker, or dependency manager connected to physical node P1 since node 5 is emulated by physical node P1; the destination will be the i-packet descriptor loader if the DRAM region is the first one, the dependency checker if it is the third region, and the dependency manager otherwise. Figure 7 : Design overview of a dependency checker. Figure 7 shows an overview of the proposed dependency checker architecture consisting of three submodules: packet ID sending logic, DRAM read generation logic, and stall generation logic. As described in Section 5.2.2, when a packet, say X, arrives at its destination, the traffic receptor at this node sends a notification to the corresponding dependency checker with three pieces of information: the number of packets that depend on X (denoted by number of packet IDs in Figure 7) , the ID of a packet that depends on X (if any), and the address to load the IDs of the remaining packets that depend on X from DRAM (if any). If the number of packet IDs is greater than 0, the packet ID sending logic first tries to send the packet ID received from the traffic receptor to the appropriate dependency manager as described in Section 5.2.2. After successfully sending this packet ID, the packet ID sending logic issues a request for packet IDs to the DRAM read generation logic if the number of packet IDs is greater than 1 and waits for response from the DRAM controller. When the response data (packet ID descriptors) from the DRAM controller come, the packet sending logic extracts the packet IDs from these descriptors and sequentially sends them to the appropriate dependency managers. For correct timing, we stall the emulation when the packet ID sending logic and the DRAM read generation logic are busy with sending packet IDs and accessing DRAM.
Dependency Checker.
We design the architecture as described above since the vast majority of packets have no or only a few dependent packets as mentioned in Section 4. Sending multiple packet IDs to multiple dependency managers at the same time can slightly reduce the number of processing cycles but does incur high hardware overhead and may significantly decrease the operating frequency.
Dependency Manager.
Dependency managers resolve dependencies between packets by processing packet IDs coming from dependency checkers. Our architecture is designed so that the time overhead of this operation is reduced as much as possible.
Our first idea for reducing the time overhead of resolving packet dependencies is based on the observation that we have at least a few emulation cycles (each emulation cycle is typically equivalent to 2N log FPGA cycles) for processing each packet ID. This is because even in the case a packet becomes an f-packet, there is a delay of at least a few cycles until the time the packet can be injected into the network as described in Section 3. Therefore, in a dependency manager, input packet IDs from dependency checkers are stored in a FIFO and processed one by one as shown in Figure 8 . Each packet ID is equipped with a timestamp indicating the emulation cycle at which it arrives at the dependency manager. This timestamp is used to determine whether it is necessary to stall the emulation for finishing processing the packet ID before the emulation cycle from which the packet with that ID can be injected into the network. In this way, a dependency manager can accept other packet IDs while processing a packet ID. Therefore, the waiting time of dependency checkers for sending packet IDs to dependency managers is significantly reduced.
Our second idea for reducing the time overhead of resolving packet dependencies is to reduce the DRAM access time by storing descriptors of d-packets loaded from the second DRAM region in a cache called d-packet descriptor cache as shown in Figure 8 . Recall that by designing the packet ID structure with two parts source address and sequence number as shown in Figure 4 , we can use packet IDs as addresses to load packet descriptors from the second DRAM region. The d-packet descriptor cache is direct-mapped and consists of N log cache lines, each for a node of the network. For example, in Figure 5 , since physical node P1 emulates nodes 1, 3, 5, 7, 9, 11, 13, and 15, the d-packet descriptor cache of the dependency manager connected to this physical node is organized as follows: line 0 for node 1, line 1 for node 3, line 2 for node 5, and so on. Which cache line to access is determined by signal logical cluster ID calculated by the address translation logic module from the source address (denoted by node ID in Figure 8 ) embedded in the current packet ID. The second part of the packet ID, the sequence number, is used as the tag and line offset. In the current implementation, the number of packet descriptors per cache line is four. Thus, the two least significant bits of the sequence number are used as the line offset and the remaining bits are used as the tag.
The basic operation of a dependency manager in resolving dependencies between packets is as follows. First, a packet ID (along with an attached timestamp) is dequeued from the input FIFO and written into a register for later processing. For simplicity of presentation, we assume that X is the packet associated with the dequeued packet ID. Next, the packet ID is used to access the d-packet descriptor cache as described above. Upon a hit, the descriptor of X in the accessed cache line is updated. Specifically, the dependency counter of X, which tracks the number of dependencies of X that have not yet been cleared, is decremented. Upon a miss, if the accessed cache line is not valid, a read request is sent to a module called DRAM access generation logic where requests to the DRAM controller are made; in case the accessed cache line is valid, the read request is preceded by a writeback request. When the requested cache line that contains the descriptor of X comes, we update the descriptor of X (that is, decrement the dependency counter of X) and then store the line with the updated descriptor of X into the cache. At the same time, the associated tag and valid bit are also updated. After that, if the dependency counter of X has become zero, that is, X has become an f-packet, the d-packet descriptor cache sends a write request to the DRAM access generation logic to write the descriptor of X to the fourth DRAM region. The write address is composed of the following three parts in the order from the most to the least significant bits: the ID of the fourth DRAM region, X's source address, and the number of f-packets with the same source address as X until now, which is stored in one entry of the #writes history table. The DRAM access generation logic sends an update request to the #writes history table by asserting a signal called DRAM write request accepted. Finally, the dependency manager goes back to the first step and repeats the operation described above.
As mentioned in Section 5.2.2, besides the task of resolving dependencies between packets, dependency managers have another task to perform. They are responsible for loading descriptors of f-packets from the fourth DRAM region and sending them to the traffic generators of the corresponding physical nodes. This task is performed by a module called f-packet descriptor loader. Like the i-packet descriptor loader, this module is implemented using the architecture proposed in [7] .
The #reads history table records the number of packet descriptors that have been loaded from the fourth DRAM region until now. By calculating the difference between #writes[i] and #read[i], the number of packet descriptors with the corresponding source address in the fourth DRAM region that have not been loaded is determined. If this number is larger than zero, the f-packet descriptor loader will be asked to issue a read request to the DRAM controller. Upon the acceptance of the read request, #reads[i] is updated. At the same time, the number of packet descriptors of the issued read request is stored into #pdescs[i]. This number is sent along with the packet descriptors to the traffic generator to let the traffic generator know how many of the incoming packet descriptors are valid.
As shown in Figure 8 , packet descriptors from the DRAM controller are directed to either the d-packet descriptor cache or the f-packet descriptor loader based on the DRAM region information embedded inside them. Specifically, packet descriptors of the second DRAM region are directed to the d-packet descriptor cache, while those of the fourth DRAM region are directed to the f-packet descriptor loader.
For correct timing, we stall the emulation when necessary to make sure that the processing of every packet ID is completed before the emulation cycle from which the packet associated with that ID can be injected into the network. In Figure 8 , the stall generation logic module is responsible for generating this stall signal.
5.2.5
Traffic Generator. Figure 9 shows an overview of the proposed traffic generator architecture. Descriptors of i-packets and f-packets are stored in two separate source queues. Thus, there are two packet sources, one connected to the corresponding i-packet descriptor loader (for descriptors of i-packets) and the other connected to the corresponding dependency manager (for descriptors of f-packets). At each emulation cycle, the generation timestamps of the packet descriptors at the heads of the source queues are compared 2 . The source queue having the smaller generation timestamp is selected. In case the timestamps are equal, the source queue for descriptors of i-packets is prioritized. If the timestamp of the packet descriptor at the head of the selected source queue is smaller than or equal to the current emulation cycle, this packet descriptor can be sent to a module called flit generator where a packet will be generated and injected into the network. The flit generator tracks the status of the network based on the incoming flow control credits.
For tracking the timing of the emulation, each packet source maintains a cycle counter. This cycle counter is updated each time the packet source extracts a packet descriptor from the input and puts it into the corresponding source queue. The value used to update the cycle counter is the generation timestamp of the extracted packet descriptor.
The timing of the emulation is guaranteed to be correct by a module called stall generation logic. We stall the emulation in the following four cases:
• The stall signal from the dependency checker is asserted.
• The stall signal from the dependency manager is asserted.
• Both of the following two conditions hold: (1) the source queue for descriptors of i-packets is empty, and (2) the cycle counter of the corresponding packet source is smaller than or equal to the current emulation cycle. • All of the following three conditions hold: (1) the source queue for descriptors of f-packets is empty, (2) the cycle counter of the corresponding packet source is smaller than or equal to the current emulation cycle, and (3) the number of f-packets that have the current node as their source address and have not yet been loaded from DRAM is larger than zero.
EVALUATION
Using the proposed methods and architectures, we build a NoC emulator called DNoC on a Xilinx VC707 board. Our design is written in Verilog HDL and synthesized using Vivado 2019.1. We send trace data from a host PC to the FPGA board via PCI Express. Our implementation of the PCI Express interface is based on the RIFFA framework [14] . Table 2 shows the parameters of our target NoCs. We implement a classical ASIC-style router architecture [10] . However, FPGAfriendly architectures like Hoplite [18] can also be integrated. The flit size depends on the implemented routing algorithm and the number of virtual channels (VCs) per port. All of these parameters can be changed by slightly modifying the RTL source code. We use the PARSEC [2] traces collected by Hestness et al. [13] to evaluate DNoC. These traces are for 8×8 NoCs. Although DNoC can scale to a 64×64 NoC as will be shown in Section 6.1, we could not find appropriate traces for evaluating it at this scale. For each PARSEC trace, we focus on the parallel region of the application. The packet length varies between 4-flit and 36-flit. Table 3 shows the hardware overhead and operating frequency of DNoC when emulating different NoC sizes using a 4×4 physical cluster. This result is obtained when the number of VC per port is 2 and the XY routing algorithm is used. We can see that the numbers of required LUTs and registers do not change much with increasing the NoC size. Emulating the 64×64 NoC requires slightly more LUTs than the other cases because we implement some memories in the i-packet descriptor loaders, dependency checkers, and dependency managers using LUTs. Although our design allows these memories to be implemented using BRAMs, we decide to implement them using LUTs because the target FPGA does not have enough BRAMs. We can also see that the number of required BRAMs does not change with increasing the NoC size from 8×8 to 32×32. This is because the size of BRAMs is fixed and they are underutilized when emulating the 8×8 and 16×16 NoCs.
Hardware Overhead and Scalability
The result in Table 3 indicates that our emulation architecture has a good scalability. The size of the largest NoC depends on only the on-chip memory capacity of the target FPGA. Since we use an 8-year-old FPGA board and the BRAM capacity on an FPGA is doubling every two years [27] , we can expect that much larger NoCs can be supported now.
DNoC can operate at a high frequency. As shown in Table 3 , the operating frequency is 110MHz when the emulated NoC size is 8×8, 16×16, or 32×32. When the NoC size is 64×64, it slightly decreases to 105MHz.
Correctness
We verify the correctness of DNoC by comparing it with BookSim [15] . To do this, we extend BookSim to support simulation with the same traces as DNoC.
Our extensive experiments and analysis show that DNoC and BookSim are identical in both final outcome results and intermediate states. This indicates that the NoC models of DNoC and BookSim are the same and our implementation of the proposed emulation architecture in DNoC is correct. Figure 10 shows the average packet latencies obtained when emulating the 8×8 NoC shown in Table 2 with the PARSEC traces. Here we implement a minimal adaptive routing algorithm based on the odd-even turn model [6] and denote it by odd-even below. For each trace, the latency data are collected over 20 million warmup cycles and 20 million measurement cycles. We can see that, in this experiment, increasing the number of VCs and changing the routing algorithm to odd-even have only a slight impact on the Table 2 with the PARSEC traces.
result. The NoC can perfectly handle the produced workloads with the configuration of 2 VCs per port and XY routing algorithm. As mentioned in Section 2, to the best of our knowledge, the FPGA-based NoC emulator proposed by Drewes et al. [12] is the only existing one that supports trace-driven workloads with dependencies between packets taken into account. However, this emulator has not been verified against well-known NoC simulators.
Emulation Performance
We first describe a performance model for DNoC. Let N node , N phy , and N log be the emulated NoC size, the physical cluster size, and the number of logical clusters, respectively. We define α (0 < α ≤ 1) as a coefficient reflecting the impact of accessing DRAM and managing dependencies between packets. α is smaller than 1 when the network is stalled during the emulation as described in Section 5. Since emulating each logical cluster takes two FPGA cycles, the emulation speed S (cycles/s) of DNoC is given by
where F is its operating frequency in Hz. Figure 11 shows the speed of DNoC when emulating the 8×8 NoC shown in Table 2 (configuration: 2 VCs/port, XY routing) with the PARSEC traces using a 4×4 physical cluster. The figure also shows how the coefficient α changes with different applications. DNoC achieves an emulation speed of 8,312K-12,979K cycles/s. We next compare DNoC's speed with BookSim. For this comparison, BookSim is run on a Core i9 9900K machine with 64GB DDR4 memory. Figure 12 shows the comparison result. We can see that the speedup of DNoC over BookSim varies with the applications, ranging from 191× to 1,334× with an average (geomean) of 395×. The higher network load an application exhibits, the higher speedup is achieved.
Finally, we compare DNoC's speed with the FPGA-based NoC emulator proposed by Drewes et al. [12] . As presented in Section 2, this emulator achieves an emulation speed of just 7K-83K cycles/s, which is only slightly higher than BookSim's speed. This speed is 274× slower than DNoC's speed. 
CONCLUSION
In this paper, we proposed efficient methods and architectures to build a fast FPGA-based NoC emulator that supported trace-driven workloads with dependencies between packets taken into account. Our NoC emulator, which we call DNoC, is the first one that is built solely on FPGA programmable logic and considers dependencies between packets. Our evaluation shows that, when emulating an 8×8 NoC with the PARSEC traces, DNoC achieves a speedup of 274× over a recently proposed FPGA-based emulator that relies on the support of the hard processors on a SoC FPGA, and 395× over a popular software simulator while providing the same results. DNoC also has a good scalability. It can scale to a 64×64 NoC on a VC707 board, and can be expected to emulate much larger NoCs on modern FPGAs with more on-chip memory.
ACKNOWLEDGMENTS
This work was supported by JSPS KAKENHI Grant Number JP-16H02794.
