ABSTRACT Embedded multi-core systems are implemented as systems-on-chip that rely on packet storeand-forward networks-on-chip for communications. These systems do not use buses or global clock. Instead routers are used to move data between the cores, and each core uses its own local clock. This implies concurrent asynchronous computing. Implementing algorithms in such systems is very much facilitated using dataflow concepts. In this paper, we propose a methodology for implementing algorithms on dataflow platforms. The methodology can be applied to multi-threaded, multi-core platforms or a combination of these platforms as well. This methodology is based on a novel dataflow graph representation of the algorithm. We applied the proposed methodology to obtain a novel dataflow multi-core computing model for the secure hash algorithm-3. The resulting hardware was implemented in field-programmable gate array to verify the performance parameters. The proposed model of computation has advantages, such as flexible I/O timing in term of scheduling policy, execution of tasks as soon as possible, and self-timed event driven system. In other words, I/O timing and correctness of algorithm evaluation are dissociated in this paper. The main advantage of this proposal is ability to dynamically obfuscate algorithm evaluation to thwart side-channel attacks without having to redesign the system. This has important implications for cryptographic applications.
I. INTRODUCTION
Concurrent and asynchronous computing is the future for modern high-performance computing systems. High performance computing for cryptographic systems is typically built as a system-on-chip (SoC) [1] . Furthermore, data communication between the modules is accomplished through a network-on-chip (NoC) [2] where the mode of communications is packets, in a store-and-forward fashion. Some of these systems operate in Globally Asynchronous Locally Synchronous (GALS) mode [3] . Cryptographic applications running on the high-performance platforms include Secure Hash Algorithm-3 (SHA-3) and Advanced Encryption Standard (AES). Parallel implementations of these algorithms are cumbersome when using the classic control-flow; von Neumann processors. On the other hand dataflow processing is more naturally suited to parallelize such algorithms [4] .
Design for security is mandatory for cryptographic processors to provide immunity to attacks especially sidechannel attacks [5] . Countermeasures employed for the classic control-flow processors included inserting dummy instructions [6] , randomizing instruction set execution [7] , clock randomization [8] , and power consumption randomization [9] . These countermeasures techniques require extra computing resources area, power, and time. The main advantage of using dataflow processing is the ability to frustrate side-channel attacks by randomizing the order of execution of the algorithm tasks without requiring any modifications in the software or hardware of the cryptoprocessor. This is easily accomplished by randomizing the order of feeding the incoming message bytes to the cryptoprocessor.
Another approach for countering side-channel attacks applied to communicating systems using multiple antennas and multiple amplify-and-forward (AF) relays [10] , [11] . The eavesdropper is able to overhear the communication of the relayed messages. In their work, transmission security is achieved by proper selection of the transmit antennas and the AF relays.
In this work, we merge the traditional von Neumann and the dataflow computing architectures. Our goal is to utilize the useful aspects in both computing models to achieve a secure cryptoprocessor design that is secure from sidechannel attacks with a minimal cost. The contributions of this work are a new dataflow graph (DFG) scheme that is more suitable to describe, simulate, and design concurrent asynchronous systems. We propose a novel methodology to obtain a dataflow multi-core computing (DMC) architecture for a given algorithm. This is a three-step methodology that starts with applying the DFG construcntion principles to the algorithm. The next two steps involve mapping the algorithm variables to memory modules and mapping the algorithm operations to the processing cores. We applied the proposed methodology to obtain a novel DMC architecture for the secure hash algorithm-3 (SHA-3). The resulting hardware was implemented in FPGA to verify the performance parameters. The DMC architecture of the SHA-3 algorithm has advantages such as ability to frustrate side-channel attacks, flexible I/O timing, as-soon-as-possible execution, and selftimed event-triggered operation.
This paper is organized as follows. In Section II we introduce the proposed DFG computing model. In Section III we focus on comparing the classic control flow; von Neumann processing and dataflow processing whereas section IV gives a brief description of SHA-3 functions. In section VI we discuss how to transform a given algorithm into a DMC architecture. In section VII we apply the DFG principles on the SHA-3 algorithm and discuss associating algorithm variables and functions to memory modules and processing elements, respectively. In Section VIII we present the implementation results. Finally, we draw our conclusions in Section IX.
II. PROPOSED DATAFLOW GRAPH COMPUTING MODEL
We introduce in this section a dataflow graph computational model that is more suitable to describe, simulate, and design asynchronous concurrent systems.
A. DATAFLOW GRAPH (DFG) CONSTRUCTION
The data dependency among different tasks comprising the algorithm can always be represented by a directed graph (DG). A directed graph is a collection of nodes representing the algorithm variables and directed arcs representing the dependencies among the variables. The graph can be expressed as the pair G = (N, A) [12] . The operations on the variables are implied. We propose in this work a novel representation of an algorithm as a dataflow graph (DFG) which is composed of three sets variables V, functions F, and directed arcs A instead of two like in the usual DG.
The proposed DFG is the tuple:
The set of variables V = (v 0 , . . . , v n−1 ), which stands for memories in hardware, is a finite set representing the algorithm variables, where n > 0. There are three unique types of variables, input, internal, and output. The three variables types were classified based on locations in the algorithm and the number of incoming and outgoing arcs. In the DFG of Fig. 1 , a variable node is represented by a circle , and a function is represented by a square . Notice that our DFG is a directed acyclic graph (DAG) because most algorithms we are interested in are causal [12] .
The arcs connect a variable to a function or a function to a variable. The arcs do not connect a variable to a variable or a function to a function. The start of an arc is an output of variable or function, and the end of an arc is the input of a variable or a function. The number of arcs that are leaving or coming to a variable or a function is based on the following rules:
1) The variable could have one or more output arcs and must have only one input arc. The output arcs represent sending copies of the variable to different functions that use that variable. A single input arc from a function to the variable implies that this variable is produced by the function.
2) The function could have one or more input and output arcs. Multiple input arcs imply arguments to that function. Multiple output arcs imply the function produces more than output variable. An example of this is the division function where the quotient and the remainder are produced. Another example could be the addition where the sum and carry out or overflow flag is produced. Figure. 1 shows a DFG example of an algorithm composed of 10 variables and 7 functions. The DFG graph illustrates VOLUME 6, 2018 dependencies among the variables and the functions of the algorithm. No information is indicated by the DFG Fig. 1 regarding:
1) Allocation of functions to hardware processors.
2) Association of variables with memories or registers.
3) The timing of availability of variables or execution of functions. To add the notion of time to the construction of the DFG we use tokens as in Fig. 2 . On the graph tokens are represented by black circles •. Tokens are assigned to the variables when they are valid and could be used. Function f i is ready to be evaluated when all its input variables v i 's have tokens.
Referring to Fig. 2 we note that at a given time instance t a set of input variables v 0 and v 1 have tokens which indicate that function f 0 has fired. As a result when a function has been fired a token will be placed at the variable node associated with its output to indicate the availability of this variable. This can be seen by an internal variable v 3 .
B. USEFUL DEFINITIONS
In this section, we define some useful terms.
Definition 1: A variable is an input variable if it has no incoming arcs. It represents one of the algorithm input variables. Figure 2 shows that the algorithm has three input variables v 0 , v 1 , and v 2 .
Definition 2: A variable is an output variable if it has no outgoing arcs. It represents one of the algorithm output variables. Figure 2 shows that the algorithm has three output variables that represent the output v 6 , v 7 , and v 9 .
Definition 3: A variable is an internal variable if it has incoming and outgoing arcs. It represents one of the algorithm intermediate variables. Figure 2 shows that variables v 3 -v 5 and v 8 represent internal variables. Figure 2 shows that v 6 , v 7 and v 9 variables are a child set of v 4 .
III. COMPARING CONTROL-FLOW VS. DATAFLOW PROCESSING
In the classic control-flow; von Neumann processing, the data flows across buses. Any other information, such as its type or identity, is inferred in the design itself such as control and address buses or registers. The validity of the data is implied upon the arrival of a clock edge. The operations carried out on the data are specified in the instruction register and control signals. In dataflow processing, data and operation are combined together in a packet and no clock is necessary to synchronize the system components. The inclusion of token in the packet indicates that data is valid and ready to be processed.
In control-flow processing, the processors are always active as long as there is a clock and it is very difficult to detect when the processor is idle. In dataflow processing, the processor is idle by default until a packet arrives. Dataflow processing is more suitable to green computing because it prevents unnecessary computations.
In a traditional von Neumann processing, changes to I/O timing or inter-module synchronization require complete redesign of hardware and software. This point can be illustrated by understanding the role of a system-wide clock in a traditional systems. The presence of a clock edge implies two things:
1) The presence of the edge indicates that the data is valid.
2) The location of the clock edge along the time axis indicates the identity of the data (i.e. which data sample is it). Parallel implementations of algorithms are difficult and error prone when using the control flow processing. In dataflow processing as soon as any packet arrives with a token, processing could commence. The contents of the packet indicates the data and the operations to be done. Thus data dependencies are included in the packet and correct processing is guaranteed.
In control-flow processing, words must be propagated through the system in a predetermined sequence which makes the platform vulnerable to side-channel attacks. In dataflow processing, the randomization of the order of execution of the algorithm tasks by randomizing the order of feeding the incoming message packets will thwart such attacks.
A packet transmission mode that replaces system buses and lack of a system-wide clock results in a concurrent asynchronous computing are unique features of systems-on-chip using networks-on-chip for communication. Thus, dataflow processing is a natural extension to such systems. Table 1 summarize the comparison between the control flow and DFG processing. 
IV. SECURE HASH ALGORITHM-3
This section is a brief review of the SHA-3 algorithm with the purpose of implementing it using a dataflow computer as per Section VII. Data integrity and authenticity is a crucial part of a secure system. Technology users become vulnerable to cyber attacks in many levels. Information integrity is a security concern for all involved parties. As a term, data integrity is used to describe the information accuracy and reliability. Exchanging a piece of information between two entities goes through many phases. This information could be altered in any phase such as processing, transforming or storing. Data alteration could be caused by malicious behaviour, system failure or errors by the user. To overcome these issues, cryptographers developed a data integrity vitrification mechanism. Cryptographic hash functions are developed to verify the data integrity. Secure Hash Algorithm-3 (SHA-3) is the latest verification algorithm. In 2004-2005 National Institute of Standard and Technology (NIST) held two hash workshops after cryptanalysis raise a series concerns about the security of the government approved hash function SHA-1. As a result of these workshops, NIST decides to build a new cryptographic hash algorithm for standardization. In 2007 NIST released a call for a new cryptographic hash algorithm SHA-3 family contest [13] . The competition runs from 2007-2012, in 2012 NIST announces the winner candidate who is Keccak [14] .
In the SHA-3 (Keccak) [15] family there are four fixed hash functions and two expandable-output functions (XOFs). These six functions share a common structured function that is the sponge functions [16] .
A hash function operates on a binary input and generates a fixed size output. The input to the hash function is called the message and the output called the digest. In a hash function, the digest is also called the hash value. The SHA-3 family consist of four hash functions SHA3-224, SHA3-256, SHA3-384, and SHA3-512. These four families represent the size of the output digest of the hash function, for instance, the SHA3-384 output 384-bit hash value. The last two hash functions the XOFs are SHAKE128 and SHAKE256. The output hash of those functions is flexible to meet the desired length of the applications.
The SHA-3 hash functions are designed to provide resistance against collision, preimage, and second preimage attacks [17] . Hash functions are a crucial part in many information security applications, such as digital signature, key derivation, and pseudorandom bit generation. The SHA-3 functions perform the same permutation. In the SHA-3 the permutation serves as a mode of functions to provide some flexibility in term of security parameters and size for future development.
All six SHA-3 hash functions perform the same permutations. In fact, those functions are just a different mode of permutations to provide some flexibility for potential applications. The SHA-3 is based on a new cryptographic hash approach, sponge function family [16] .
Two parameters are used in the KECCAK-p permutations. The first parameter is b which is the length of message in bits. The standard calls b the width of the permutation. The second parameter is R, which is the number of iterations, or rounds. The KECCAK-p permutation is denoted by KECCAK-p [b, R]. The bit strings b that are permuted form a state. The state has a fixed number of bits b. The state consists of two parts, rate λ and capacity c. The rate define the number of bist to be processed in each permutation block and the capacity is the remaining bits of the state. The width of the permutation is found by the summation of λ + c, which is restricted to predetermined seven values {25, 50, 100, 200, 400, 800, 1600} [17] . In SHA-3 the desired size of the hash output, denoted by d, determines VOLUME 6, 2018 the values of λ and c. For instance, for 512 hash output: b = 1600 bits and λ = 576 bits, and c = 1024 bits, where c = 2 × d are selected. In SHA-3 the state consist of a maximum 1600 bits organized as 5 × 5 × w matrix, w = 2 = (b/25) and = log 2 (b/25). The seven possible values of these variables are predefined in the standard, Table 2 below shows all different values. The SHA-3 inputs and outputs can be represented in two forms. The first form is to represent the data as a string S of b-bits indexed from 0 to b − 1. The second form is to represent the data as a three-dimensional array A[x, y, z] with three indices 0 ≤ x,y < 5, and 0 ≤ z < w. The mapping form S to A is given by:
(2) Figure 3 shows a state matrix in three dimensions. The state can be split into a plane or one lane as in Fig. 4 . The plane is a 5 × w bits, and the lane is a simple bit string of w bits. 
V. SHA-3 FUNCTIONS
The KECCAK-p permutation is iterative and each iteration is called a round. 
For all pairs (x,z) such that 0 ≤ x ≤ 4 and 0 ≤ z ≤ 63, let
where the addition and subtraction operations in the above equation are done modulo 5. This applies to all addition and subtraction operations in the succeeding functions in the following subsections. For all triples (x,y,z) such that 0 ≤ x, y ≤ 4 and 0 ≤ z ≤ 63, the output of the θ step is given by:
B. RHO (ρ) STEP
The output state A [x, y, z] of the θ step is an input state to ρ step. All lanes in the state are left rotated by a predefined length called the offset δ. For all pairs (x,y) such that 0 ≤ x, y ≤ 4), the output of the ρ step is given by:
where the value of δ associated with the indices x and y can be found in the standard document [18] .
C. PI (π) STEP
The input state to the π step is the outputs of ρ step. All the lanes positions in the state are rearranged except the lane with coordinates x, y = 0. For all triples (x,y,z) such that 0 ≤ x, y ≤ 4 and 0 ≤ z ≤ 63, the output of the π step is given by:
D. CHI (χ ) STEP
The χ step accepts the output state A of the π step. Each bit of the lane is combined with neighbouring bits along the x-axis using AND, XOR, and NOT operations. For all triples (x,y,z) such that 0 ≤ x, y ≤ 4 and 0 ≤ z ≤ 63, the output of the χ step is given by:
E. IOTA (ι) STEP
The output state of the χ step is input to the ι step. For all values of z such that 0 ≤ z ≤ 63, the output of the ι step is given by:
where RC is the round constant whose value changes for each round as explained in the standard document [17] . The ι step is the last step in the round, the output of the ι step is fed back as an input to the θ step until its reach the final round. The five steps mappings are repeated 24 times over the state matrix A, r = 0 : R − 1. Figure 5 shows the 24 rounds of the KECCAK functions. The θ step is broken into three steps θ 1a , θ 1b , θ 1c based on the step equations (1), (2), and (3). 
VI. DESIGN SPACE EXPLORATION METHODOLOGY FOR DATAFLOW MUTLICORE COMPUTING ARCHITECTURE
In this section, we discuss how to transform a given algorithm into a DMC architecture. We follow a systematic design space exploration methodology to obtain the DMC architecture. The methodology is divided into three steps:
1) Obtain the DFG associated with the given algorithm. This step is explained in Section II, Subsections VI-A and VII-A.
2) Define a memory architecture (distributed/shared) and a strategy for mapping the algorithm variables to the memory modules. This step is explained in Subsections VI-B and VII-B . 3) Define a multicore processor array architecture and a strategy for mapping the algorithm functions to the cores. This step explained in Subsections VI-C and VII-C.
A. DERIVING THE DFG OF AN ALGORITHM
We indicated in Sec. II that an algorithm is defined through sets of functions, variables and the dependencies between the pairs of variables and functions. Deriving the DFG of an algorithm starts with identifying and classifying the algorithm variables then examination of the dependencies among the variables. The transformations on the variables define the algorithm dependencies and functions with their associated input and output variables. These dependencies produce the DFG discussed in detail in Section II. From the DFG one can infer the algorithm properties such as workload, depth, degree of parallelism and presence of cycles, as discussed in more detail in [12] . The DFG reveals the types of variables as input, internal and output. This classification helps deciding the scheduling of input data, identifying critical paths and determining the delay of producing the outputs. In Fig. 6 , the DFG graph is modified into sequential equitemporal domains or stages of execution. The figure is obtained after making several idealized assumptions such as: 1) All inputs are available at time t = 0.
2) The are no constraints in memory and I/O bandwidths. The functions at each domain are evaluated at the same time. For example the functions f 0 -f 2 can be evaluated concurrently when all inputs v 0 -v 2 are available and can be read simultaneously by all the functions. VOLUME 6, 2018 Figure 6 is useful in determining the algorithm properties such as depth and degree of parallelism. The depth of the algorithm is the number of sequential stages, which is three in our case. This implies that the fastest completion time under ideal conditions is three stage delays.
The degree of parallelism is defined as the maximum number of functions associated with each stage. This defines the maximum number of cores that can operate simultaneously under ideal conditions. From the figure we determine that three cores can operate in parallel.
B. MAPPING VARIABLES TO MEMORY
We have two cross related mapping problems: map variables to memory modules and map functions to processors. There is a correlation between those two mapping problems. Functions depend on variables and variables are produced by functions. It is a circular relationship between variables and functions, both of them must be optimized taking into account hardware constrains and memory bandwidth. This is akin to the placement and routing problem in VLSI chips.
Communication capability of memory is limited by the number of I/O ports. A variable must be stored in memory but it must be accessed by one or more processors. Memories have large storage capabilities so many-to-one mapping between a set of variables and a single hardware memory is a suitable option. There are three memory architecture design options:
1) Allocate a single shared memory to store all the algorithm variables. This is an all-to-one mapping. This is not an attractive option, since it has the lowest memory bandwidth. Hence, parallelism will be difficult to achieve in such design, since only one variable could be accessed at a given time. 2) Allocate a single memory module to each stage and map the output variables of each stage to the memory assigned to that stage. This design option is classified as globally-distributed/locally-shared memory architecture. This is a many-to-one mapping. It is a suitable option that increases the memory bandwidth for the system. This option allows for parallelism among the stages. However, parallelism within the stage is limited due to the use of shared memory in each stage. 3) Allocate a memory module to each output variable in each stage. Since the stage output variables are associated with a function block, this design option is tantamount to a distributed memory architecture. This is a one-to-one mapping. This is the best option in term of memory bandwidth. Such design permits full parallelism.
C. MAPPING FUNCTIONS TO PROCESSOR
Each function in Fig. 6 will be executed only once during the execution of the algorithm. Hence, one-to-one mapping of a function into a single hardware processor is not practical in term of area and power needed to implement the processor. The many-to-one mapping between a set of functions and one hardware processor is more suitable for our DMC architecture. There are three design options for mapping functions to processors: 1) Associate a single processor core to map all the algorithm functions. Functions of all stages will be executed sequentially. This is an all-to-one mapping. It is very efficient for hardware utilization but does not allow parallelism. 2) Associate a processor core to each stage and map all functions of each stage to the processor assigned to that stage. This option allow multiple processor cores to execute functions in parallel at a given time. This is a many-to-one mapping. It shows good hardware utilization and also allows for parallelism. 3) Associate a processor core to each function of the algorithm sages. Functions that belong to a stage will be distributed among available processing units to be executed. This is a one-to-one mapping. It shows a low degree of hardware utilization but offers the most possible parallelism. The degree of parallelism exhibited by each processor depends on the design of that processor, e.g. whether it is superscalar or not. However, in this work we assume our processor to be capable of executing a single function at a time.
VII. CASE STUDY: DMC ARCHITECTURE FOR SHA-3 ALGORITHM
In this section, we discuss how to transform the SHA-3 algorithm operations described by equations (3)- (9) and Fig. 5 into a DMC architecture. We followed the methodology that was presented in Sections II and VI to obtain the DMC architecture. It starts with deriving the algorithm graph components using the DFG principles. Then, mapping algorithm variables and functions to modules and processing cores, respectively.
A. OBTAINING THE SHA-3 DFG
SHA-3 is a multiple rounds algorithm and each round is a collection of five hash functions. The five functions are sequential. A module of the SHA-3 algorithm using the DFG graph is as follow. The SHA-3 functions are divided into seven stages, where the θ-stage is represented by three sub-steps θ 1a , θ 1b , θ 1c . Each stage consist of two types of components: variables and functions. Figure 7 shows the seven stages of SHA-3 DFG. The SHA-3 algorithm deals with data in the form of a cube along the x-, y-and z-axes of size C = 5 × 5 × w. The algorithm also deals with a data in the form of a plane in the x-z-axes of size P = 5 × w. The value of w is determined from Table 2 .
We assume the dataflow processors use a word size g bits. The number of input or output data per-stage depends whether the data comes from a cube or a plane. This number can be found using the following equations: n = C/g for a cube (10) m = P/g for a plane (11) A. Alzahrani, F. Gebali: Multi-Core Dataflow Design and Implementation of SHA-3 FIGURE 7. DFG of SHA-3 algorithm modeling.
As an example, Fig. 7 indicates that Stage θ 1a deals with a cube as its input and a plane as its output. Hence the input variables are v 0,i and the output variables are v 1,j and the functions at that stage are f 1,j with 1 ≤ i ≤ n and 1 ≤ j ≤ m.
B. MAPPING SHA-3 VARIABLES TO MEMORY
Since the SHA-s algorithm is a multi-stages, we shall fallow the second alternative of mapping variables to memory modules. Distributed globally-shared locally mapping allows parallelism globally but does not allow it per-stage, locally it is sequential. To map the SHA-3 algorithm variables we used heuristics. One of the heuristics is the input variables, all n or m variables of a single stage will be mapped in one memory, which means a single output port. As a result we will have 4 and M 5 memories to store the SHA-3 states of each stage. The first mapping step is to map all input data into a single memory block m 5 . The data will be transmitted to the first processor on a single port, a packet at a time. Based on this decision, parallelization of θ 1a state is precluded. The data has to be accessed in packet serial format, even though it is all available at t 0 . Variables will be fed to the processor based on the scheduling policy. So the output will be packet serial from memory. No parallel outputs are permitted. The following lemmas result as a consequence of the procedure that all variables of one stage will be mapped in a single memory and the packet serial transmission format of all variables in memory.
Lemma 9: One variable can be read from input memory of a stage. No state parallelism or throughput at each stage.
Proof: The best known available memory is a dualported RAM that allows one read and write operations + simultaneously. Despite the large storage capabilities of memory at best one read and write operations can be done simultaneously.
C. MAPPING SHA-3 FUNCTIONS TO PROCESSOR
We adopted the second mapping option. A processor perstage that does not allow parallelism within stage but allow it globally between stages.
In term of the processing capability, based on the mapping schema that we applied over the variables, all n or m functions of each stage will be mapped in a single processor, which means a single operation at a time. As a result we will have P 1a , P 1b , P 1c , P 2 , P 3 , P 4 and P 5 processors to implement the SHA-3. The first mapping step is to map all input functions of stage θ 1a into a single processor P 1a . The functions will be executed by the processor sequentially. Functions will operate based on the scheduling of input data. In term of communication capabilities, the I/O limitations imply single input and a single output at a time t. The following lemmas result as a consequence of the assumption that the processor is a simple ALU processor and no parallel operations will occur while operating.
Lemma 10: The processor of every stage will produce a variable every x clock cycles where x is the number of input variables.
Proof: Our processor is a simple single processor at every stage. Thus, the processor can execute one function out of n or m at a given time t that will only produce one variable. Also, the limitation on memory bandwidth implies a single input variable at a time will be fed to the processor.
Functions that are associated with any of the seven SHA-3 stages will be mapped in a single processor. The output of those functions will be mapped into a single memory. Functions operation on one stage is identical but with a different set of input arguments. Fig. 8 shows the SHA-3 seven stages mapping. Each stage composes of a hardware processor and a memory block. The output arguments of θ 1a stage will be input arguments of the θ 1b stage and so forth.
D. SHA-3 OPERATIONS OF DMC ARCHITECTURE
The system consists of self-timed event-triggered operations. As we mentioned earlier, the process will start after the completion of writing the input arguments in the receiver memory. All the inputs arguments are available at t = 0 so the scheduling policy of reading the inputs is free of restrictions. Then the processor will start reading the memory based on a specific scheduling policy. We note in this system that the memory has three operations two reads and one write. A processor that write in VOLUME 6, 2018 memory can read it also, and the other processor can only read.
According to Section III, a packet is used to represent each algorithm variable. The representation includes the value of the variable, its unique ID, as well as its target functions (cf. Definition 4). These fields are illustrated in Fig. 9 . Packets are propagated throughout the system between nodes. A node that output a packet is a parent of this packet and the generated packet is a child of that node. The system will read the packet and extract the identity of the variable. The extracted identity then will be used to generate the child set identities of that variable. The source variable become a parent packet and the destination variable is a child packet. The child set will be determined on-the-fly while processing. The added set of destinations and identity bits will increase the size of the data in an acceptable ratio. We minimize the header size to maintain a small packet size by only adding the necessary fields.
Based on the generated child ID the processor has to check the token counter that counts the maximum number of required counts before firing the variable. The token counter is a mechanism developed to keep a record of the operations of the system. One of the requirements of the memory in this system is to serve as a temporary storage for the variables so every time a processor wants to update a variable in a memory it can retrieve it and operate on it. When a processor checks the tokens counts of a variable and finds it reach threshold token count, it fires it by writing the packet in a memory and indicating to the adjacent receiving processor that the variable here is ready to be read. The designing criteria of our processor form a special purpose processor.
E. SHA-3 PROPOSED DMC ARCHITECTURE
The best way to implement this design on System-on-Chip (SoC) is to build the cryptoprocessor based on network and memory architecture. Figure 10 shows a high-level diagram of our SHA-3 DMC architecture. The diagram consists of seven processors, seven memories, and seven routers. We note by the double arrows in the architecture that the memory is being written in only by one processor and being read by a pair of processors. So a memory does not belong to a single processor but belongs to a pair of adjacent processors. However, we need to communicate between processors, so the implementation we do on a SoC is to implement it in the form of a router in the following shape Fig. 10 , ring shape network-on-chip (NoC). We implemented this architecture, and it is a single direction routing. We adopt a ring architecture because it is suitable for an algorithm with round functions. In a round based algorithm, the last function output will be circled as an input to the algorithm. The circulation makes a ring shape suitable architecture option for a round-based algorithm.
Since communication is packet based, at this stage, we have two alternatives in our design implementation. Whether to add the children ID's in the packet and increase the size of the packet or since these processors have specific functions, then they know how to build the children on the fly, and we chose the second option to reduce the packet size. That will minimize the exchanged packet size; it will be a single transmission versus multi-words serial transmission.
Although it looks like a pipelined system, it is not a pipelined system, since data exchange is through routers. Pipelined system does not have a router, and the data is propagated without routing. The memory reading sequence of each round could be executed with different scheduling policy. This strategy means each round must be fully processed and stored in memory then the second round will start, which makes it a somewhat round pipelined system.
VIII. IMPLEMENTATION RESULTS AND RELATED WORK
Several states of the art hardware architectures were developed to implement the SHA-3 algorithm [19] - [30] . Hardware implementations are designed on Application Specific Integrated Circuits (ASICs) or Field-Programmable Gate Array (FPGA) to get a real-time results. FPGA based designs are preferable since their performance is approaching that of ASICs but more flexible and less costly.
Despite the lack in the literature of existing similar architecture techniques for implementing the SHA-3 algorithm, we provided a comparison between this work and previously reported implementations. The SHA-3 3-D representation of a state allows hardware designers to use different approaches to carry out the algorithm computations. Some of those implementations are lane-wise such in [19] and [20] . An alternative technique to perform the computation is slice-wise which was introduced by [21] then further efforts where made by other researchers to improve the throughput and reduce the area in [23] - [25] . Another approach is combining both lane and slice-wise computations into a unified design [25] . Other implementations approaches were also reported in the recent literature with focuses on a high throughput [26] , [27] , or utilizing embedded hardware resources of FPGA such as LookUp- Table (LUT) [20] , [28] , Block RAM (BRAM) [19] , [29] and Digital Signal Processing (DSP) slices [30] . Due to the flexibility of our computational model, our design could be implemented in slice-wise, lane-wise, or both slice and lanewise computations based on the scheduling function. Table 3 shows the results of proposed DMC using VHDL language and synthesized with Xilinx ISE v14.3 tool. The targeted FPGA devices are from Virtex-6 and Kintex-7 families [31] , [32] . The throughput in this work is estimated according to the following equation:
where w is the processor word size and f is the operating frequency. In our implementation we used w = 24 bits and f = 200 MHz for the chosen FPGA device. This gives throughput of 4.8 Gbps.
The table also compares our results with published results using conventional implementations of the SHA-3 algorithm. Our implementation uses seven BRAMs and adequate resources of logical slices. The design gives decent results comparing to the previously reported implementations results. Although this initial study indicated that hardware and clock speed can be optimized. However, our choice of design has a significant advantage; it can randomize the execution of the operation without the requirement of retiming, redesigning, and reprogramming.
This hardware implementation relies on dataflow concepts to compute concurrent asynchronous applications. We have applied this computing model to implement a SHA-3 cryptographic algorithm. Beside the hardware implementation, we developed object-oriented MATLAB models to verify and validate the correctness of this computing model, that is being prepared at this time for publication.
IX. CONCLUSION
In this work we proposed a methodology for implementing algorithms on dataflow platforms. The methodology can be applied to multi-threaded, multi-core platforms or a combination of these platforms as well. This methodology is based on a novel dataflow graph (DFG). The DFG is useful for: 1) Describe the algorithm by identifying its input, output and internal variables. The description also identifies the operations or functions required to effect the algorithm. Most importantly, the description identifies the data dependences, which help in studying the degree of parallelism of the algorithm as per [12, Ch. 8] . 2) Simulate the algorithm performance under different system parameters such as number of cores, memory architecture (shared vs. distributed), and input data scheduling strategies. The resulting performance parameters could include system throughput, delay, latency, and processor activities. This last feature proves very useful to study vulnerability or immunity of the algorithm against side-channel attacks. 3) Design concurrent asynchronous systems is guided by the DFG which identifies the maximum number of cores, the mapping of the algorithm variables to memory modules and the mapping of the algorithm functions to the cores. The DFG also helps identifies the possible input data scheduling choices.
