Abstract. In this paper, we present a proposal of a new internal structure for NetFPGA cards and its analysis. We propose to use a switching fabric instead of a single pipeline.
Introduction
Nowadays, the IT world is very fast. There are many technologies offering a high throughput, high capabilities, high performance, and so on. Most of them are closed in industrial chips (ASICSs) and users are only users -they have no influence on the functionalities of such chips. But there exists a perfect technology for users who want to prepare their own functionalities in very fast chips. This technology is known as FPGA -Field Programmable Gate Arrays. FPGA chips are programmable digital chips that offer very elastic functionalities which can be realized very quickly. Professional companies and chip producers have prepared many development boards ready to be programmed and used. So, the user can have the perfect tool for prototyping and researching. In this paper, we focus on NetFPGA cards. They are designed and prepared by the Stanford University and the University of Cambridge and developed by Digium and HightechGlobal. A NetF-PGA card has a main FPGA chip and interfaces of the Ethernet network. Ethernet frames received on these interfaces are treated as digital data and processed in the main chip. According to this philosophy, a NetFPGA card works as a network node. Such a node is programmable in a very wide range and it realizes its functionalities very fast (in programmable hardware). In this paper, we describe our proposal of a new internal structure for NetF-PGA cards. The rest of this paper is organized as follows:
Section II describes NetFPGA cards, Section III presents our motivation for the work described here, Section IV and V describe the current and new structures of NetF-PGA cards and their analysis. Finally, we conclude the presented work and describe the future one.
NetFPGA Cards
NetFPGA cards are typical PC extension cards [1] . They are widely described in the literature and on websites, and As it was mentioned earlier, the cards were designed by members of the NetFPGA group composed of scientists and researchers from the Stanford University and the University of Cambridge [1] . Extension cards are offered at reasonable prices and the software is available on a BSD license. ware Description Language). Due to that fact, almost any idea can be implemented on that chip. It is only a matter of time and designer's skills. Consequently, it is a powerful tool for scientists, researchers and students alike [2] , [3] .
The code base provided for a NetFPGA card contains reference projects and contributed projects [4] . By using them, a student may become familiar with the framework and other tools, such as traffic analyzers and perl/python scripts.
The development of a new NetFPGA project involves a few obligatory stages. At the first stage, it is essential to make a functional design, then the functionality is encoded into an HDL language (such as Verilog or VHDL). Using modules from the NetFPGA code base is a good idea, because these pieces of code are already tested. There are many tools created to support HDL code preparation and analysis. Due to the fact that on the NetF-PGA card (both 1G and 10G), the main programmable device is made by Xilinx Company, the dedicated tools are also from this company: ISE (Integrated Software Environment) [5] for 1G and Xilinx Platform Studio (XPS) [6] for 10G, because of a different code architecture. These A NetFPGA reference pipeline consists of queues associated with physical ports (4 physical ports -> 4 input queues of ingoing frames and 4 output queues for outgoing frames -marked green in Fig. 3 ). This structure is also used for physical port projection on the PC side. These ports can be detected by the operating system via a driver.
Their default names are: nf2c0, nf2c1, nf2c2, nf2c3 for the 1G card or nf0, nf1, nf2, nf3 for the 10G card.
An incoming frame is placed in a suitable queue (RxQueues) and sent to the Input Arbiter module which chooses one of the input queues, takes the packet and sends it along the pipeline. Due to this fact, it is not crucial through which port/queue the packet is coming because all packets are sent into a single pipeline. Then the packets are processed in modules along the pipeline. At the end, the module Output Port Lookup decides to which output queue (TxQueue) the packet is to be sent.
A new project on NetFPGA can be started from scratch, which means preparing all the code for processing packets, or one might only prepare one module and insert it into a reference pipeline.
Reference NIC [4] (Network Interface Card) is an example of a simple project. All frames are sent along the pipeline and the Output Port Lookup is only changing the output port according to the input port. For example, a packet incoming to the first physical port is sent to the nf0 (for 10G card) logical PC port. And the frame incoming to nf2 is going to be sent to the third physical port.
NetFPGA cards, especially the ones with VirtexV, may also be used for more general purposes (not only for networking). In the FPGA chip, algorithms may be implemented, which seems to be very complicated [9] [10] [11] [12] [13] , but their hardware realization (due to parallelization of some operations) may be very efficient [14] [15] [16] .
Motivation
We used, analyzed and tested many existing reference projects, we also prepared several our ones. We know in some level of details NetFPGA cards and we are familiar with their internal structure (both, 1 and 10 Gbps).
We found that we can improve performance of data processing in the main chip. We decided to introduce relatively small changes in the manner of realization of the main pipeline. Through this modification, we would like to obtain the same throughput, but a smaller delay of Ethernet frames. It can be said that the implementation of our ideas introduces multi-core and parallel frame processing in hardware chips. We will prepare a model that allows us to simulate, investigate and compare the performance of both versions and also (what is more important) we will realize in practice and investigate a prototype of the new version in our laboratory. We are going to measure our prototype with professional industrial network analyzers. We also plan to use the results of our investigation in newer versions.
Current Structure
In reference projects, there is a singular main pipeline, which is also presented in Fig. 4 . When Input Arbiter chooses frames to process them, they are sent to the next modules one by one. Only after one module totally finishes processing a frame, the frame is sent to the next module. In the worst case, all input queues can offer frames at the same moment (this situation is presented in Fig. 6 ). In such a situation, all frames will be processed, but the last of them will have to wait until all previous ones are be processed. Generally, typical processing of a frame is realized in two logical phases, the first one for the header of a frame, and the second one for the payload (data of the frame). These phases are realized by different parts of hardware. In such an architecture, when the header is analyzed, the part of hardware for data transmission is in idle mode; analogously, when the data of the frame are transmitted, the hardware for the header is not used. The consequences of this are presented in Fig. 7 , where all mentioned facts are visible very clearly. 
Proposal of New Structure and Its Analysis
We are going to rebuild the parts responsible for header analysis and data transmission. We would like not to block the transmission for the time when the header is analyzed, and vice versa. It is possible, but it requires quite a modification of the pipeline. We propose to combine both functionalities in one, more complex, module. We will prepare a module that, just after analyzing the header of one frame, takes the next frame for analysis, and the data part of the just-analyzed frame is sent to transmission.
What is very important, in such a case, the transmission would have to be realized by a more complicated structure than a single pipeline. We propose to use a switching fabric which can transmit several frames at the same Fig. 5 . We are planning to investigate different types of switching fabrics. We will start from the basic one that can be realized by multiplexers. We are not planning to change the algorithm for the scheduler implemented in the reference design.
We are certain that we will obtain a smaller delay of Ethernet frames in nodes with the new structure, because we will be able to serve several frames at the same time.
This situation is presented in Fig. 8 . We are going to prove Time table" for processing four frames in parallel -modified pipeline it in several ways. First, we were going to prepare a typical simulation on a PC. Also, we decided to prepare an analogous simulation in hardware, i.e. to prepare a model of the old and new structure in the FPGA chip dedicated for simulations and realize this simulation in a very fast way. After we find the parameters of both versions, we will implement a prototype and compare real nodes. For this analysis, we will use both our own and professional and highly certificated network analyzers [20] [21] [22] . At this moment, we have prepared a hardware traffic generator and modules for the analysis of the traffic served.
A switching fabric is dedicated for the transmission of frames from its inputs to its outputs. It can simultaneously transmit several frames (that are not in conflict) at the same time. It can be said that there is one hardware block dedicated for transmission in each direction from all possible input-output pairs. We can also say that with such an approach, we obtain the parallelization of transmission implemented by a multi-block structure.
Future Work and Conclusions
We know that we have proposed a big change in the working mechanism. But we believe we do not work for nothing. We trust that even a small improvement of parameters affecting the efficiency of nodes with a high throughput can result in a great improvement in other places of the whole network.
Works and plans described in this paper allow us to modify the existing structure and obtain nodes with a better performance.
We also plan to expand our equipment by a new version of cards and implement our ideas and solution in them.
Acknowledgment
The work described in this paper was financed from the funds of the Ministry of Science and Higher Education for the year 2016.
32

M. Michalski, T. Sielach
We would also like to thank (in alphabetical order):
• 
