The computational performance of Network-on-Chip (NoC) and Multi-Processor System-on-Chip (MPSoC) for implementing cryptographic block ciphers can be improved by exploiting parallel and pipeline execution. In this paper, we present a parallel and pipeline processing method for block cipher algorithms: Data Encryption Standard (DES), Triple-DES Algorithm (TDEA), and Advanced Encryption Standard (AES) based on pure software implementation on an NoC. The algorithms are decomposed into task loops, functions, and data flow for parallel and pipeline execution. The tasks are allocated by the proposed mapping strategy to each Processing Element (PE) which consists of a 32-bit Reduced Instruction Set Computer (RISC) core, internal memory, router, and Network Interface (NI) to communicate between PEs.
Introduction
As algorithms have become more complex and diverse, hardware IPs or SoCs which consist of devices such as CPU, DSP, memory, and co-processing components are used to implement them. In order to run a block cryptographic algorithm which is one of the computational intensive applications, a dedicated hardware design is used to execute the algorithm rapidly and effectively. However, a hardware implementation has disadvantages in flexibility and compatibility in contrast with a software implementation. One of the ways to maximize flexibility and computational power is using a multi-core platform such as MPSoC and NoC. This approach has problems in software scheduling and partitioning for application-level parallelism which increases overall performance effectively.
In this paper, a parallel and pipeline execution methodology for a software block cipher is implemented on NePA NoC platform [6] which has a general NoC architecture without specialized hardware logics typically used in traditional ASIC design for a block cipher. The proposed method supports task concurrency, balanced task distribution, and high flexibility for the NoC environment. Since this method adopts software block ciphers, it can decrease time-to-market and turn-around time to implement various cryptographic algorithms.
The main contributions of this paper are as:
• A software approach for block cipher algorithms on an NoC platform.
• Detailed modeling the approach based on SystemC and HDL level hardware platform.
The organization of this paper is as follows: Section 2 provides the related work about the implementation of block cipher algorithms on various systems. Section 3 introduces the parallel and pipeline processing method for block ciphers on NePA. Section 4 describes the implementation. Section 5 presents the experimental results. Section 6, summarizes and concludes this paper.
Related work
Several implementation methods for block cipher algorithms have been proposed. One of the methods is fully hardwired or an FPGA implementation of the block cipher algorithms [16] , [12] .
Another method is mixing DSP/RISC and dedicated hardware such as co-processing blocks, or inserting special instructions to improve not only flexibility but also computing power for calculation of cryptographic functions. For the improvement of overall performance, dedicated and specialized cryptographic hardware modules are used with DSP or RISC processors [11] , [15] . Other researchers have proposed specialized instruction set architectures to accelerate cipher algorithms on hardware and software co-design platforms [9] , [7] , [13] .
The third method is using MPSoC, NP (Network Processor), or NoC platform which has a multi-core processing architecture. The authors of [8] have suggested maximizing the throughput of a pipelined multiprocessor system by effective assignment of flow tasks to pipeline stages on an NP platform. In the paper [14] , they have introduced a methodology for profiling and scheduling networking workloads and applications on a highly parallel network processor architecture.
In their papers [16] , [9] , [8] , [14] , they have shown various parallel and pipeline task scheduling and mapping methodologies for block cipher algorithms. The proposed parallel and pipeline NoC implementation for block ciphers have been developed on their platforms. In order to meet the high-performance and low-power requirements, a scalable, flexible, and reconfigurable multiprocessor platform, Networked Processor Array (NePA) system which is a mesh-based multi-processor SoC is proposed as shown in Figure 1 . This reconfigurable multiprocessor platform includes multiple RISC processors, memory blocks, and several specific IPs. Each PE has a CompactOR, internal data/instruction memory, a generic network interface, and a router adopting both a normal routing algorithm and an adaptive routing algorithm. Open-RISC core [1] , one of open core processors, is used as the main processor of the NePA platform. In addition, generic network interface (NI) blocks and routing units [10] , [5] are integrated to the NePA platform for a mesh-based network. The routers allow to transport not only application data but also control data between PEs by a memory mapped interface protocol. Each PE executes a part of block cipher functions by the parallel and pipeline execution method on the platform without dedicated cryptographic co-processor modules or hardwired logics. NePA router has a 2-port north/south and 1-port west/east interfaces to interconnect between PEs on the 2D meshed network.
One of the main purposes of this research is to improve the computational performance of block ciphers by incorporating an NoC architecture. To achieve this goal, NePA is designed in SystemC as well as in Verilog HDL to simulate and verify the performance of the proposed parallel and pipeline processing method for block ciphers. Moreover, a set of tool chain is provided.
Parallel and pipeline processing for block ciphers 4.1. Overview of block cipher algorithms
Block cipher algorithms use symmetric and secret keys to encrypt and decrypt a plaintext composed of fixed-length block data with mathematical transformations. Each plaintext block is the same length as all of the other input blocks, for example 64 or 128 bits. In the operation of a block cipher, a ciphertext is encrypted by using a plaintext and a key block. The benefit of block ciphers is diffusion where bits are spread throughout the ciphertext such that a change of single bit in either the key or the plaintext causes a significant change in the ciphertext. The disadvantage is that algorithms take a long time to produce a ciphertext, compared to stream ciphers, and single bit error can be propagated to an entire block. Data Encryption Standard (DES [2] ), Triple-DES Algorithm (TDEA [3] ), and Advanced Encryption Standard (AES [4] ) are widely used in symmetric cryptography standard algorithms released by the National Institute Standard and Technology (NIST). Figure 2 illustrates the main steps of the proposed parallel and pipeline processing method for block ciphers on NePA. This approach starts at profiling block cipher algorithms written in C. The aim of the profile is to classify execution groups which can be performed concurrently or sequentially in instruction, function, and round (a group of iterative functions defined in each block cipher algorithm) level.
Methodology of parallel and pipeline processing for block ciphers
Guided by the profiling results, the groups of functions and rounds are determined, which are assigned and run as parallel and pipeline tasks in a PE. In [14] , the authors use an annotated directed acyclic graph (ADAG) to generate a group of pipelined tasks by the dynamic profiling and instruction tracing method. While in this paper profiling, grouping, and allocating tasks are carried out manually considering the overall performance of the NePA system and the workload which is scheduled to PEs.
As another issue in the proposed implementation, a scheduling and mapping procedure is suggested to assign the tasks to PEs. The scheduling and mapping step is obviously one of the key factors to execute the partitioned block cipher programs on NePA. It is performed by the information based on the organization of the concurrent and sequential tasks generated from the profiling step. Furthermore, the scheduling and mapping procedure depends on additional information like the number of PEs, pipeline depth, and pipelined PE groups. In summary, the method is performed in four steps as follows:
Profiling block cipher algorithms
In the first step, the block ciphers programmed in C language are profiled as function-level and roundlevel lists to extract sequential and concurrent features. In profiling, there are instruction-level, function-level, and round-level profiling methods. DES, TDEA, and AES block ciphers are profiled into function-level and round-level modules except instruction-level in order to exploit high level profiling approach instead of bottom level instruction profiling approach.
Determination of concurrent and sequential tasks
In the second step, the profiled groups of functions and iteration rounds are used to determine sequential and concurrent tasks. The profiled tasks in the previous step are classified according to sequential and concurrent features of the tasks to schedule and map them into PEs. For instance, round tasks are processed in a sequential order, therefore each round task can be regarded as an element of sequential tasks.
Scheduling and mapping tasks on NoC platforms
In the third step, the tasks classified into sequential and concurrent tasks are scheduled and mapped into PEs on NePA. Following parameters are required in order to define the scheduling and mapping tasks on NePA.
• the number of PEs
• the number of pipeline depths
• the number of pipelined PE groups 
Figure 3. The implementation of DES, TDEA, and AES by the proposed methodology on 4x4 NePA
NePA is differently configured by the number of PEs. For example, 2x2 NePA platform has four PEs and 8x8 platform has 64 PEs. In this scheduling and mapping step, one of the PEs is selected as the first PE which is in charge of a main control PE. And then one of the neighbor PEs which has the shortest-distance from the current PE is adopted as a next PE to build parallel and pipelined PE groups. Therefore, the nearest PE from current PE is selected as a next PE in order to schedule and map a task to the PE. This method allows reduction of communication.
Execution of the tasks
In the last step, the tasks mapped and scheduled in the PEs are executed in the parallel and pipelined way.
Implementation of block ciphers by the proposed methodology
In this section, the block cryptographic algorithms are implemented by the proposed method. As shown in Figure 3 , the profiled functions, the groups of functions, the groups of rounds, and the compositions of parallel and pipelined tasks are outlined for the scheduling and mapping step on 4x4 mesh-based NePA system. 4.3.1. Profiling block ciphers on NePA. 
Determination of concurrent and sequential tasks.
After profiling the block ciphers, the constitution of the concurrent and sequential tasks are determined in both function-level and round-level. For the DES implementation in function-level, G2 is grouped with PC1, RL, and PC2 as the key scheduling module. G3 is composed of E and K, and G4 has S and P. The goal of this step is to find the relationship among the groups of tasks such that G1, G3, G4, and G5 are processed sequentially, but G2 is executed concurrently. TDEA composed of DES encryption and decryption functions uses the same groups of DES. AES has 4 groups, G1, G2, G3, and G4. G2 is grouped with KE and KS, G3 includes BS and SR. G1, G3, and G4 are executed in a sequential order, but G2 is processed concurrently. For the determination of the concurrent and sequential tasks in round-level, different three round groups, G1, G2, and G3 are used for several block ciphers. The first group (G1) executes the routine of the initial round. The second group (G2) is annotated by the number of body rounds. For instance, G2-3 means that it belongs in the second group and has three body rounds. The last group (G3) executes the final round to generate encrypted output data.
4.3.3.
Scheduling and mapping tasks on NePA. In this scheduling and mapping stage, the groups determined from the previous step are allocated to PEs along with supplemental information such as the number of PEs, pipeline depth, and the number of PE groups. Figure 3 illustrates an example of the implementation on a 4x4 NePA, composed of 16 PEs. In this example, DES has 4 concurrent PE groups (PE-1,PE-2,PE-3,PE-4) and each PE group consists of 3 pipeline depth (G1→G3→G4,G5) with concurrent key scheduling task (G2) in function-level. In roundlevel processing, DES has 2 PE groups (PE-1,PE-2) and each PE group has 8 pipeline depth (G1,G2-1→G2-2→G2-2→G2-2→G2-2→G2-2→G2-2→G2-1,G3). The PEs of a PE group are executed by a pipelined sequence, thus, the PE groups are run concurrently.
TDEA is scheduled and mapped the same as DES implementation in function-level. In round-level, TDEA consists of 48 rounds (3xG1,42xG2,3xG3) for each PE group. AES encryption and decryption are described by 4 PE groups (PE-1,PE-2,PE-3,PE-4) composed of 4 pipelined PEs in a group for function-level processing. In round-level processing, AES is composed of 2 PE groups (PE-1, PE-2).
Experimental results

Simulation Environment
The proposed methodology is simulated and implemented on a cycle-accurate SystemC and HDL NoC environment called NePA. The platform is composed of a number of Compact OpenRISC processors, network interface modules, and routers. The C codes programmed for parallel and pipelined execution on several PEs are compiled by OpenRISC tool chains. 100 MHz operating clock frequency is utilized for the simulation, and the router model used in this system is the same as [5] . Figure 4 , 5, 6, 7 illustrate the result of the simulation on NePA. In order to determine the complexity of communication on NePA, the amount of transferred bytes between PEs in function and round level is determined as Equation (1), (2) . #Pipelines is the number of pipelined PEs in a PE group, #Parallels is the number of parallel PE groups, and #Rounds is the number of rounds corresponding to the block ciphers.
Simulation Results
T ranBytes f = (#P ipelines × #Rounds − 1) ×#P arallels × #InputBytes
(1) T ranBytes r = (#P ipelines − 1)
0.E +00
1.E +07
2.E +07
3.E +07
4.E +07
5.E +07
6.E +07
7.E +07
8.E +07
9.E +07 1x1 2x2 4x4 8x8 C ycle cou n ts # of PEs 0.E +00
5.E +05
1.E +06
2.E +06
3.E +06
4.E +06
5.E +06 1x1 2x2 4x4 8x8 C ycle cou n ts # of P E s F u n ction level R ou n d level 0.E +00 Tran sfer bytes # of P E s D E S TD E A A E S 0.E +00
1.E +04
2.E +04
3.E +04
4.E +04
5.E +04
6.E +04
7.E +04 1x1 2x2 4x4 8x8
Tran sfer bytes 
Conclusions
In this paper, a parallel and pipeline processing method has been presented to implement block cipher algorithms such as DES, TDEA, and AES through profiling, scheduling, and mapping exploration on an NoC platform called NePA. The method is used for the pure software implementation of block ciphers on NePA. The parallelized and pipelined tasks are allocated by the proposed mapping strategy to each PE which consists of 32-bit OpenRISC core, internal memory, router, and NI on the NePA platform. This proposed method has been developed and simulated by using the cycle-accurate SystemC and HDL description model. Using the cycle-accurate simulation, the simulation results show that the proposed method can be implemented on an NoC system effectively.
