Abstract-A Virtual Private Network (VPN) encrypts and decrypts the private traffic it tunnels over a public network. Maximizing the available bandwidth is an important requirement for network applications, but the cryptographic operations add significant computational load to VPN applications, limiting the network throughput. This work presents a coprocessor designed to offer hardware acceleration for these encryption and decryption operations. The open-source SigmaVPN application is used as the base solution, and a coprocessor is designed for the parts of Networking and Cryptography library (NaCl) which underlies the cryptographic operation of SigmaVPN. The hardware-software codesign of this work is implemented on a Xilinx Zynq-7000 SoC, showing a 93% reduction in the execution time of encrypting a 1024-byte frame, and this improved the TCP and UDP communication bandwidths by a factor of 4.36 and 5.36 respectively compared to pure software solution for a 1024-byte frame.
I. INTRODUCTION
VPNs provide private communication between devices over a public link, as well as confidentiality and integrity of messages by encrypting and authenticating the traffic between communicating devices. In this work, a VPN device as a hardware-software codesign on a software and hardware programmable Zynq SoC is aimed. The device is designed to perform expensive cryptographic operations in hardware for a software based VPN and to provide higher communication bandwidth for the communicating parties. The point-to-point and client-server communication models are two common VPN architectures. Our solution follows the former, creating a direct tunnel between parties accessible to the designed VPN devices.
In this work, we introduce a VPN device design which has two ports: one is referred to as the public, and the other as the private network port. The main task of the device is to encrypt packets received from the private network, and sending them to a predefined destination VPN device over the public network. The inverse operation is decrypting the frames received from the other VPN device over the public network, and forwarding the decrypted messages to the private network. The goal is to enable the private networks of the VPN devices to communicate securely over a public network, so that no one can eavesdrop the communication. Note that this is done completely transparently to the communicating parties behind the VPN devices. Our approach is based on modifying a software VPN solution to construct a VPN device, and then providing hardware acceleration for it.
Our work introduces a VPN device based on a VPN software named SigmaVPN, in contrast to providing another hardware accelerator for IPsec VPN. SigmaVPN is an open-source Linux user space application, and uses the NaCl cryptographic library by Dan Bernstein [1] for its cryptographic operations. NaCl defines CryptoBox that uses a subset of NaCl's operations, and implements a full protocol for authenticated encryption. This work aims to provide hardware acceleration for the NaCl's CryptoBox. As is true for many cryptographic algorithms, NaCl's design relies on expensive arithmetic operations. Therefore, the additional cost of encrypting and decrypting messages limits the performance of the VPN application, reducing the available communication bandwidth. A dedicated cryptographic coprocessor, however, has many advantages over executing the cryptographic operations in software.
Our VPN device was implemented and evaluated on Xilinx's all-programmable Zynq-7000 SoC. The Zynq is a highly suitable SoC device for this work. It offers great processing power on the Processing System (PS) to run Linux at high performance; in addition, it offers FPGA fabric on Programmable Logic (PL) which is equipped with DSP48 slices [2] to execute arithmetic operations fast. SigmaVPN runs on the PS, with the NaCl's CryptoBox implemented in hardware. The coprocessor executes the cryptographic operations in a faster and more efficient way compared to the software implementation, and consequently greatly increases the device's network bandwidth. Providing a low latency communication channel between the coprocessor in hardware and VPN software was an important task in this work, since any overhead introduced by the communication negatively affects the performance of the VPN.
The main contributions of this work are: (1) A design of an efficient hardware architecture for accelerating the cryptographic operations of a software-based VPN. (2) An evaluation of the architecture on the Zynq SoC.
II. RELATED WORK
There are many examples of VPN accelerators, but they typically focus on providing acceleration to IPsec-based VPN applications. IPsec [3] is an alternate approach to softwarebased VPN solutions to establish confidential communication. While IPsec-and software-based solutions offer different set of features, one does not have a significant performance advantage over the other [4] .
An example implementation [5] Although, IPsec is often used to build a VPN, its specification includes a wide range of protocols, making it a complicated design. Therefore, we opt to base our approach on a softwarebased VPN application, SigmaVPN instead of IPsec. It has a very compact code base, a single, strong cipher suite (which makes implementation easier, without sacrificing security), it is modular and easy to integrate with hardware.
According to our knowledge "NaCl's crypto_box in hardware" [7] is the first and only paper to bring the NaCl's CryptoBox into hardware and presents a full implementation of the CryptoBox with a resource-constrained approach aiming Internet of Things (IoT) applications. The design is based on an Application Specific Instruction Set Processor (ASIP). There exists only a few standalone hardware implementations of the underlying cryptographic operations of the NaCl's CryptoBox, e.g. Salsa20 stream cipher operation. These are referred in evaluation section with comparison to our implementation.
III. BACKGROUND A. SigmaVPN
As mentioned before, a VPN connection tunnels private network traffic securely over a public network like the Internet. A typical software-based VPN takes advantage of the Operating System (OS) to establish the tunnel, by using OS functionality to create a virtual network (kernel) device. A virtual IP address and subnetwork mask are assigned to the virtual device, and the device provides access to Ethernet frames through a device file. When an application running on the OS sends an IP packet over the virtual network device, corresponding Ethernet frame is written to the corresponding device file. The VPN application captures the Ethernet frame from the file, encrypts it, encapsulates the encrypted frame in a new packet, and sends it to the physical address of the destination node over the public network. Therefore, all the communication is done over the public channel, but the transferred data is encrypted.
On the other side of the public network, the VPN application receives the encrypted frames, decrypts them, and forwards them to its own virtual network device. This whole process ensures that, when a packet is sent to destination's virtual IP address, it is transferred over the public network encrypted, and exits the destination's virtual interface, after being decrypted.
SigmaVPN is an open-source, lightweight VPN application that runs in Linux's user space, reading user-defined communication parameters from a configuration file, and implements the design described above. It enables communication between applications running on the two nodes, using the virtual network addresses. Its modularity allows users to create their own network interface or cryptographic modules. Users can then have SigmaVPN use these modules by changing the parameters in the configuration file.
SigmaVPN uses NaCl as its default cryptographic library, even though users can implement their own by exploiting the modularity. The NaCl library offers a scheme named CryptoBox, which consists of elliptic curve key establishment scheme based on an elliptic curve cryptography, together with authenticated encryption. The former is used to establish a shared symmetric session key, and the latter to protect the confidentiality and integrity of the messages. In this work, we focus on providing hardware acceleration for the NaCl's CryptoBox, specifically for SigmaVPN. 
B. Cryptographic Primitives
NaCl is a cryptographic library designed by Dan Bernstein [1] , and its CryptoBox implements all operations for authenticated encryption. Operations of the protocol are shown in Fig. 2 . The Curve25519-based Elliptic Curve Diffie-Hellman (ECDH) protocol establishes a session key K S from an asymmetric key pair. Salsa20 is used to encrypt the messages symmetrically with a unique key K D , derived with HSalsa20 from the session key K S and a nonce N . It is combined with Poly1305 to calculate an authentication tag which protects the integrity and authenticity of the messages. Only another party knowing the key pair and the nonce can derive the same key K S and message-specific key K D required to decrypt the message.
The Curve25519 ECDH operation calculates a session key K S from the asymmetric key pairs of the source and destination. On each side, the protocol uses its own secret key K SEC , and the public key of the destination node K PUB . The mathematical properties of elliptic curves make it possible for communicating nodes to calculate the same session key K S using their own key pairs, so it becomes the shared secret between the key pair owners. Session key K S , is called the first-level key in NaCl context. SigmaVPN statically manages the keys, reading them from the configuration file in plaintext format.
HSalsa20 [8] is a stream cipher, but used here as a key derivation function that generates a messages-specific key K D from the session key K S and a nonce N , which is referred to as the second-level key in the context of NaCl. Salsa20, also a stream cipher, is then executed with the derived key K D and the nonce N as its inputs, and outputs a keystream. In fact, it does not create a continuous keystream, but outputs 512-bit blocks after each run. To make each block unique, a counter value is appended to the nonce's most significant part, and it is incremented after each execution of Salsa20. The plaintext is then XOR'd with the keystream to give the ciphertext when encrypting, or the plaintext during decryption.
Poly1305 [9] , [10] is a Carter-Wegman [11] one-time authenticator used to calculate a Message Authentication Code (MAC) of the ciphertext. To calculate a 128-bit MAC, it uses the keystream's first 256 bits as its key, and processes the ciphertext in 128-bit blocks.
Transfer of the encrypted frames over the public network is done with UDP socket communication. Thus, the UDP header is always appended to the encrypted frame, which reduces the Maximum Transmission Unit (MTU), i.e. the maximum length of an Ethernet frame, for the communication between private network of the end nodes. This further requires the transfer of cryptographic metadata, i.e. the nonce and MAC, together with the encrypted frame reduces the MTU.
IV. ARCHITECTURE
The goal of this architecture is to provide hardware-based acceleration for time-consuming cryptographic operations of a software-based VPN. In this work, we specifically address the problem of accelerating SigmaVPN.
A block diagram of our proposed architecture is shown in Fig. 3 . The SigmaVPN application running on the CPU receives and sends encrypted frames via Ethernet. In order to perform cryptographic operations on a received Ethernet frame in hardware, the SigmaVPN application needs to pass the frame to the coprocessor. To maximize the performance of the system, we make use of Direct Memory Access (DMA) to transfer data between RAM and the coprocessor. This has performance advantages over using memory mapped communication between the CPU and the coprocessor. Whenever SigmaVPN needs to send an encrypted Ethernet frame, it first configures the DMA controller to transfer the plaintext frame from RAM to the NaCl coprocessor. After the NaCl coprocessor has finished encrypting the frame, it uses DMA again to transfer the ciphertext into RAM, and the SigmaVPN application then sends out the encrypted frame via the Ethernet port. Similarly, to decrypt a received frame, SigmaVPN uses DMA to transfer the encrypted frame from RAM to the NaCl processor. The coprocessor then decrypts it and uses the DMA to transfer the plaintext frame to RAM.
A. Encryption & Decryption
As mentioned in Section III-B; HSalsa20 and Salsa20 are used to perform encryption and decryption. Fig. 4 shows a block diagram of an encryption operation. Decryption follows a similar flow except in the final XOR operation where ciphertext is used as an input, and plaintext is produced. For each frame, the HSalsa20 operation is executed once to perform key derivation. Next, Salsa20 uses the derived key (K D ) to decrypt or encrypt a message. HSalsa20 uses a nonce N and session key K S as inputs while Salsa20 uses K D , a counter and the nonce N to produce a keystream which is used to generate ciphertext or plaintext.
Section III-B shows that HSalsa20 calculates the derived key from the first 128 bits of the nonce and a pre-calculated session key. Next, a new nonce value is required for consecutive Salsa20 operations to create unique keystream blocks. To do so, the last 64 bits of the nonce is concatenated with a 64-bit counter to create a new nonce for each keystream block. A new block of keystream is generated at each execution of Salsa20 with the new nonce, and these blocks are concatenated to produce a long stream. The keystream blocks are XOR'd with plaintext to create corresponding ciphertext blocks. Only for the first block, the beginning of the plaintext is padded with 256 zero bits; therefore, the output of the first XOR operation gives half a block of ciphertext together with half a block of keystream. This half block of keystream is later used as the key for the Poly1305.
HSalsa20 and Salsa20 have similar structures, which differ only in the finalization operation. In both, a round function is applied 20 times to a 4×4 matrix of 512 bits, which is created from a 128-bit nonce, 256-bit key, and a predefined 128-bit expansion constant. The rounding consists of column-wise and then row-wise operations to the matrix 10 times, followed by a finalization operation to create the output from the matrix.
B. Poly1305 MAC
Poly1305 is used to calculate a MAC from an arbitrary length input. The algorithmic listing of Poly1305 is given in Alg. 1. It uses a key K and ciphertext as inputs and processes the ciphertext in 128-bit blocks. Poly1305 generates a polynomial by iteratively performing an update operation with a 128-bit ciphertext block. A modular reduction with radix (2 130 − 5) is performed with the resulting polynomial. In the NaCl's CryptoBox, the key input of Poly1305 is assigned to the first 256 bits of the keystream produced by the first input block, as shown in Fig. 4 .
V. IMPLEMENTATION
In this work we made an implementation on the Zynq-7010 [12] System-on-Chip (SoC) platform. The Zynq-7000 SoC family consists of an integrated PS and a PL unit, providing an extensible and flexible SoC solution on a single die. The PS features a dual-core ARM Cortex-A9 processor and several ports to communicate with the PL. The PL is made up of reconfigurable fabric in the form of an FPGA. In addition, the Zynq has DSP48 slices [2] which offer fast arithmetic operations.
Algorithm 1 Poly1305 Operation
Input: K as 256-bit key, Ciphertext in 128-bit blocks: {CT0 · · · CTn−1} Define: R ← lower 128 bits of K with bitwise operations, S ← Upper 128 bits of the K, For any i:
128
We specifically use the Digilent ZYBO development board with a Zynq-7010 SoC, which is the smallest Zynq SoC at the time. To evaluate our design we realized two VPN devices using two development boards. Our proposed design requires two Ethernet ports, and we used a USB-to-Ethernet network adapter as the second Ethernet port. In this work, the ARM processors were configured to host the Linux operating system, with the SigmaVPN application running on Linux's user space. The hardware runs on the reconfigurable logic of the PL. The ARM processor communicates with the hardware using a memory mapped interface and large data transfers are performed with the Xilinx DMA IP Core [13] . Communication with the hardware is performed from Linux user space using a dedicated device driver.
A. Two-Port VPN Device
The standard version of SigmaVPN uses a single Ethernet port. SigmaVPN is typically used as an application on an end node to provide point-to-point encryption. However, we want to create a standalone hardware-based VPN device that can connect to an existing network. Therefore we require a device with two Ethernet ports. The two ports divide the network into private and public networks, and the VPN device becomes an edge device connected to both networks. The device encrypts frames received from the private network and forwards them to a destination node over the public side. The destination node decrypts the frames received from the public side and forwards these frames to the private network. This allows the two private networks to seamlessly communicate with each other as if on a single network.
The original SigmaVPN application uses a virtual and a physical port. In order to utilize two physical ports, the application was modified to replace the virtual port with a physical port that connects to the private network. This is a non-trivial task, as all traffic on the private side needs to be captured in order to be forwarded to the other end node. This problem was solved by introducing a new interface module that promiscuously reads frames on the private network by means of the libpcap packet sniffing library. In addition, the new interface module is capable of sending decrypted frames to the private network without re-encapsulation.
B. Data Transfer with DMA
On the Zynq, IP cores communicate over the AMBA interconnect, conforming to the AXI protocols. The DMA IP core uses three interfaces which are: AXI4-Full, AXI4-Lite, and AXI4-Stream. The AXI4-Full interface has the most advanced feature set, providing burst data transfers. It is used to connect the DMA and PS, offering memory-mapped access to data in RAM. The AXI4-Lite interface offers reduced functionality, and it is used to configure the DMA controller. Finally, the AXI4-Stream interface can be used for address-less communications between IP cores, pushing one word of data in each clock cycle. Software running on the processing system makes the DMA transfer to the NaCl coprocessor from the RAM, and after encrypting/decrypting it, the coprocessor makes the DMA transfer from the co-processor back to the RAM.
We use the DMA IP core for the transfer of data between PS and PL, compliant to the block diagram given in Fig. 5 . The DMA IP can be configured either by SigmaVPN in software, or by the coprocessor in hardware. In the former case, the DMA IP reads the frame data from memory-mapped RAM and writes it to the NaCl coprocessor as a stream. In the latter case, the reverse direction data transfer takes a stream from the coprocessor and writes it out to RAM in memory-mapped mode. Our NaCl coprocessor has an AXI4-Lite port to configure the DMA controller to transfer output data to RAM, as well as two AXI4-Stream ports to read and write data from and to the RAM.
C. Cryptographic Building Blocks
The operations of the so-called CryptoBox of the NaCl library are described in Section IV. The API separates the Curve25519 ECDH key exchange from the authenticated encryption operations, since it is used only once when the session is initialized. The coprocessor design maintains the same separation of the key establishment functionality. Furthermore, it was kept in software and not accelerated in hardware, as it is used only once, and the performance gain would therefore be limited. The other parts of the CryptoBox were implemented in hardware, which includes the following three components of the NaCl library: HSalsa20, Salsa20 and Poly1305 (see Section III-B).
In the implemented design a single hardware module was created for the HSalsa20 and Salsa20 operations, since only their final states differ. The rounding function of them was mentioned before in Section IV-A. In both these algorithms, the column-wise and row-wise round operations are each divided into four quarter-round operations. Each quarter-round operation consist of four step functions. These step functions in the quarter-round operation do not have data dependencies. In this work, we exploit this by performing the four step functions in parallel. Moreover, the step function was implemented to execute in a single clock cycle. This leads to a reduction in the number of cycles at the cost of an increased critical path. However, the critical path is still relatively small in comparison to the rest of the design. The finalization operations of HSalsa20 and Salsa20 were also implemented to execute in a single cycle.
Algorithm 2 Poly1305's Update Operation
Input: Input H, Key R H and R are parsed into 13 bit components such as:
Update Operation: (to implement line 4 of Alg. 1) Multiplication: 1: RES = 0 2: for i from 0 to 9 do
for j in from 0 to 9 do RESj ← RESj + R10−i+j × Hi × 5 A second hardware block is created for Poly1305, which requires an update operation for each 128-bit ciphertext block (see line 4 of Alg. 1). The operation requires modular multiplication of large numbers, which is a complex operation and can be simplified by using a divide and conquer method. Such a method is also used by Bernstein and Schwabe [14] , where the multiplications are performed on smaller blocks of 26-bits while following schoolbook multiplication approach. We modified their method for 13-bit blocks to benefit the DSP48 slices available in the Zynq, which provide 18×25-bit single cycle multipliers, offering a better multiplication performance when compared to implementing the multiplier with LUTs. The DSP slices were used with 13-bit blocks instead of 18, since the symmetry of the multiplication structure would be lost otherwise, complicating the hardware design. The modified multiplication procedure is given in Alg. 2. By pre-calculating H × 5, the algorithm's inner loop can be executed in parallel within a single clock cycle, only consuming 10 slices. In the update operation given in Alg. 2, the multiplication is followed by a reduction operation that performs radix 13 addition of the of RES i values. In other words, any bits of each RES i element after the lowest 13 are circularly added to the next element RES i+1 , always propagating the carry to block RES i+1 . Fig. 6 shows the 13-bit alignment of the RES i values for the reduction. Each column in the figure represents a 13-bit block and the reduction requires summing all the elements in each column, followed by carry propagation between columns. The gray parts correspond to the elements crossing the 130-bit boundary, which are multiplied with 5 and then moved from the most to the least significant side. This multiplication with 5 is done to preserve the mod (2 130 − 5) operation. In the implementation, the column-wise summations of the 13-bit blocks were also parallelized. Each of those ten columns are first summed using 13-bit adders, and the resulting ten carry bits are propagated in parallel to the results after. If one of the propagation additions again has a carry, the propagation process is repeated.
D. NaCl Coprocessor
In this work, the SigmaVPN application communicates with the NaCl coprocessor's operation using a buffer consisting of operation specific information together with an Ethernet frame. After the SigmaVPN application has prepared the buffer, the DMA controller is used to send the buffer to the NaCl coprocessor. While receiving the buffer, the coprocessor starts calculating the message specific key K D , and then performs the encryption or decryption operation. In order to discuss the scheduling of operations we use the term time slot for each operation on 512 bits of the frame. The scheduling of the coprocessor's operations is illustrated in Fig. 7 . In each time slot the coprocessor performs four basic operations in parallel. In one time slot we read a 512-bit block and calculate a keystream that will be used to encrypt/decrypt the block in the next time slot. The encrypted/decrypted block is then used in the next time slot to perform the partial calculation of the MAC together with writing the encrypted/decrypted block to RAM with the DMA.
To further increase the performance of the coprocessor two instances of both the Salsa20 and Poly1305 modules are used. The two Salsa20 modules work in parallel to yield two keystream blocks in each time slot. Each Poly1305 instance is used to perform parallel update operations on a different term of the polynomial listed in Alg. 1. The first instance handles even terms of CT while the second instance handles odd terms of CT . During the final time slot, the results of both update operations are combined in order to calculate the final MAC.
E. Linux Kernel Module
To access memory-mapped devices from an operating system that uses virtual memory, special care needs to be taken. In Linux physical memory is not accessible from user space [15] . In addition, cached memory cannot be used when performing DMA transfers from or to RAM. In this work we propose to solve the above problems by means of a Linux kernel module. The kernel module can access and configure the DMA controller with physical addresses, which is only possible from kernel mode. In addition, the kernel module can declare and access a buffer that is declared in coherent kernel memory, which is also only possible from kernel mode.
The existing SigmaVPN software cryptographic module was modified to perform tasks with the coprocessor by means of the Linux kernel module. During the initialization of SigmaVPN the kernel module performs the configuration of the DMA and declaration of the coherent buffer. When the softwarebased cryptographic module's encrypt and decrypt functions are called, the kernel module is used to transfer data to the coprocessor and to start the operation. The kernel module uses polling to wait for the hardware to complete the operation. When the coprocessor is finished with an operation, the kernel module returns to user mode with the result. The downside of using a kernel module is that switching between user mode and kernel mode has a high overhead.
F. Accelerator Coherency Port
The ARM architecture of the Zynq offers the Accelerator Coherency Port (ACP) port [16] that allows DMA to do data transfers with cache controller called Snoop Control Unit (SCU) [17] , [12] instead of doing transfers directly with RAM. The port is introduced as a low latency communication path between the PS and PL, providing cache-coherent memory access.
We considered using this port for transferring data between the software and hardware. The goal of this attempt was to lessen the communication overhead by reducing the access time to data buffer by maintaining cache coherence in hardware, instead of relying on software. However, our measurements showed that the use of ACP port does not offer a benefit by means of access time, compared to the initial implementation utilizing the device memory. It was also found that this is the same result reported by Sadri et al. [18] .
VI. EVALUATION
This section evaluates the implementation, its correctness and performance.
A. Implementation Results
The implemented coprocessor requires 3 333 slices which corresponds to 75.75% of a utilization of total number of available slices in Zynq Z-7010. Together with the DMA and the AXI4 interconnections, the total resource utilization of the hardware design reaches to 4 279 slices which corresponds to 97.25% of utilization.
The hardware designed for HSalsa20 and Salsa20 takes 83 cycles for executing both operations, while a single update operation of Poly1305 hardware takes 21 cycles. Since a full 512-bit output block Salsa20 needs to execute Poly1305 four times with 128-bit blocks, the bottleneck of processing one full block is determined by Poly1305, which sets the length of a time slot. The execution time results show that the coprocessor requires 172 cycles before the first output block is ready, and takes 90 cycles for processing the first 768-bit message, and a further 90 cycles for each of the following 1024-bit message blocks. The finalization operation takes another 15 cycles for calculating the MAC and sending it to RAM. The critical path of the design passes through the Poly1305's multipliers, which sets the max clock frequency to 81.25 Mhz.
We are aware of only one other hardware implementation of the NaCl CryptoBox [7] which executes a 1024-bit encryption operation in a minimum of 125 791 cycles at a maximum clock frequency of 12-18 MHz. Even though the same operation takes 997 cycles in our design, it is difficult to compare the designs. The design goal of their implementation is to achieve a low-resource design aiming at Internet-of-Things applications, while our design goal is to achieve the best performance aiming at maximum bandwidth in a VPN connection.
The performance of our design is determined by the Poly1305 hardware; therefore, we looked for another performance constrained hardware implementation of Poly1305 to compare with ours. However, we could not find such an implementation. As a result of not being able to find another performance constrained NaCl CryptoBox or a Poly1305 implementation to compare our design with; we provide a comparison of our hardware to related works by means of its Salsa20 implementation. Our Salsa20 hardware module has a throughput of 0.65 cycles/byte, and it achieves a maximum clock frequency of 155MHz. There are various implementations of Salsa20 [19] , [20] , [21] which can be compared to our design. Some of these implementations are given in Table I with their throughput and maximum clock frequency. In the paper of Henzen, et al [19] , several Salsa20 implementations are synthesized for a 0.18µm CMOS technology. Our implementation follows the same approach as their method named as 4xS − QR(), achieving a close performance to their VLSI circuit implementation. Moreover, their paper presents two more implementations offering higher performance which are named as 8xQR() and 4xQR(). These two implementations are unrolled version of previous implementations, offering a speed up of 4 and 8 respectively by means of their behavioral performance but resulting in synthesis of longer critical paths. It should be stressed that although there are other implementations offering better performance than our Salsa20 implementation, and even though our Salsa20 hardware can achieve a maximum clock frequency of 155 MHz, the performance of the our coprocessor is limited by Poly1305 hardware, which limits the maximum clock frequency to 81.25 MHz, and also determines the total execution time.
B. Performance
The execution time of encryption and decryption operations in software using the NaCl coprocessor were measured, and average of the measurements for different frame lengths are given in Table II . These measurement results are given in CPU cycles which were measured by reading the cycle-count register [16] [22] of the Zynq's ARM Cortex-A9 processor. The results show that the hardware-software co-design leads to a total execution time improvement by a factor of at least 4.88, and the performance gets better as frame length increases.
The results reported here include the overhead to perform context-switches, coherent memory accesses, and communication with the DMA controller. Our measurements show that on an average 800 CPU cycles is required to perform such a context switch, and extra 740 cycles at minimum for transferring the frame between user space memory and the coherent kernel memory. The overhead costs to 46% of the encryption operation for small frame lengths, and it reduces to 24% as the frame length increases to 1024 bytes. This affects the total execution time of the system as each communication with the coprocessor requires a context switch. Even though such an overhead is undesired, it was found to be unavoidable.
Waiting the NaCl coprocessor to complete an encryption or decryption operation is handled with polling in the software because polling has less overhead compared to using interrupts. While the NaCl coprocessor does encryption or decryption operation in hardware, the software using it waits in the kernel space for completion of encrypted or decrypted frame's DMA transfer from NaCl coprocessor to RAM. This wait is implemented by polling DMA controller's configuration registers. Our tests showed that polling is roundly 1000 CPU cycles faster than using interrupts, and that corresponds to 21.3% if execution time on worst case, and reduces to 6.16% as frame length increases.
C. Bandwidth Improvement
The effect on the communication bandwidth of using the NaCl coprocessor was measured using the test setup shown in Fig. 8 . In the setup two PC's were connected to private ports of two different VPN devices, and the two public ports of each device were connected to a router. The measurements were taken using the Iperf [23] network bandwidth measurement tool. The tool uses different approaches to measure the bandwidth of TCP and UDP traffic, and the results are shown in Table III. For TCP traffic the bandwidth was measured for different frame sizes between 128 bytes and 1024 bytes, and the results were averaged for each frame size. As a reference, we also report the results for using SigmaVPN with encryption turned off. For the UDP tests, Iperf doesn't report the bandwidth but checks whether the network can support a user specified bandwidth by determining whether losses occur. Therefore, the UDP tests were performed by increasing the desired bandwidth until losses appear. The results show that the coprocessor increases bandwidth by a factor of 2.9 for frame lengths of 128-bytes when compared to using SigmaVPN in software only. The bandwidth improves up to a factor of 4.36 when the frame size is increased to 1024 bytes. Similarly, for UDP traffic, the bandwidth is improved by a factor of 2 for 128-byte frames, and increases up to a factor of 5.36 for 1024-byte frames when compared to running SigmaVPN in software only. 
VPN VPN

VIII. CONCLUSION
In this paper we designed a hardware-based coprocessor to accelerate the cryptographic operations of SigmaVPN, an opensource software VPN solution. The design was implemented on a Xilinx Zynq-7010 SoC, with SigmaVPN running on top of Linux on the ARM processing system, and the coprocessor programmed into the programmable logic. Our evaluation shows that the coprocessor improves the performance of cryptographic operations with increasing gains for larger Ethernet frames. Our coprocessor encrypts a 1024-byte frame in 93% less time when compared to the software-only solution, even though it suffers from overheads in hardware-software communication. Its integration with SigmaVPN offers the TCP and UDP bandwidth increase by a factor of 4.36 and 5.36 respectively for 1024-byte Ethernet frames.
