I. INTRODUCTION
A LTHOUGH there has been remarkable progress of a high performance computer technology in the last decade, the computer performance is still insufficient in industry applications of electromagnetic field simulations. The electromagnetic field simulations are frequently used for product design in industry. In that case, the computer platform of the numerical simulation is often prepared accompanied with a design software such as the CAD installed in a standard PC, and therefore electromagnetic field simulations are not always carried out on high-end computers. In addition, the electromagnetic field simulations themselves require very long computation time even if the highest end computers, such as supercomputers or graphics processing unit (GPU) clusters, are employed when very large numerical models of order of over 1000 × 1000 × 1000 grid space are simulated. Accordingly, for effective use of the electromagnetic field simulations in industry, it is required to prepare any portable high performance computation technologies that operate accompanied with standard PCs for the CAD.
As one solutions to such requirements of the portable high performance computing (HPC) in a field of microwave simulations, a dedicated computer of the finite difference time domain (FDTD) method was presented [1] - [9] . To construct hardware circuits and memory access architecture optimizing to the FDTD scheme using a reconfigurable largescale integration (LSI) such as the field programmable logic array (FPGA), high performance computation machines can be utilized by much smaller size and lower cost than the supercomputers or GPU clusters. Actually various kinds of hardware architectures of the FDTD scheme were proposed and installed in real hardware such as the FPGA, and it was shown that the designed hardware operated normally. However, the FDTD dedicated hardware could not achieve sufficiently higher performance operation than the high-end PCs or the GPU computers [9] . In the FDTD dedicated computers, the FDTD scheme and memory access themselves were efficiently executed by the dedicated hardware architecture, but the clock frequency of the FPGA had too slow a speed compared with PCs and GPU, which was <100 MHz. In those FDTD dedicated computers, the FDTD calculations are done basically grid by grid, which is intrinsically the same procedure as software processing as in the PCs and GPU computers. Accordingly, we need to adopt much higher architecture of the FDTD dedicated computers to employ parallel properties hidden in the FDTD scheme for achievement of the portable HPC, which can be practically used in industry. As one of possibility of such extremely high performance computation, a dataflow architecture FDTD machine was proposed [10] . But, the dataflow architecture was designed only for 2-D FDTD method owing to limited circuit size of those days FPGA. In this paper, beyond recent remarkable progress of LSI technologies, a conceptual design of 3-D FDTD method dedicated computer with the dataflow architecture is presented to aim to extremely high performance computation for microwave simulations.
II. DATAFLOW ARCHITECTURE FDTD DEDICATED COMPUTER
The dataflow architecture for the FDTD dedicated computer itself was proposed for 2-D microwave simulations in [10] . However, hardware size and I/O pins of the FPGA were insufficient for 3-D FDTD machine in those days. According to recent remarkable progress of FPGA technologies, we here present a detail design of 3-D FDTD dataflow machine. Fig. 1 shows a configuration of 3-D FDTD dedicated computer with dataflow architecture. Data registers are allocated in 3-D grid space in same manner as electromagnetic field components of Yee's grid, and then registers in the lowest layer are connected each other by digital circuits, which execute the FDTD scheme of three components in a single clock [ Fig. 1(a) ]. After the execution of circuit operation of the FDTD scheme, register data in all region are shifted down in cyclic manner by one layer [ Fig. 1(b) ]. To repeat this process for all vertical layers for both electric and magnetic fields, one time step FDTD calculation for the entire grid space is executed by 4 × Z clock cycles (Z is the number of vertical layers). In this architecture, there are no memory accesses that are biggest overhead in Neumann's architecture machines, and 0018-9464 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. its performance is purely proportional to the number of grids in horizontal plane and system clock frequency.
III. DETAIL SPECIFICATION OF FDTD DATAFLOW MACHINE FOR PRACTICAL SIMULATION
One of most important tasks in the development of the dedicated hardware is flexibility for various situations of simulations. For example, it is necessary to design the hardware to be commonly used for various numerical models including complicated 3-D shapes and any distributions of material constant without any modifications of the machine hardware. In addition, flexible allocation of absorbing boundary condition (ABC) should be available without any hardware modifications. To satisfy these conditions, we need to construct the digital circuit to be operated as both modes of normal and a perfectly matched layer (PML) grids. One more requirement for the dataflow architecture machine is reduction of the hardware size, because this architecture leads to huge hardware size compared with other conventional architectures.
We use the following Maxwell's equations to be applied to most general cases including the PML grids:
To introduce the following modified unknowns:
We obtain the following normalized Maxwell's equations:
where c 0 , ε 0 , μ 0 , ε r , and μ r are the velocity of light, permittivity, permeability in vacuum, relative permittivity, and relative permeability, respectively. In addition, discretization of (3) on the grid space yields the FDTD formulation for the 
where
where p z is a power input signal and ε rz , μ rz , and σ rz are relative permittivity, relative permeability, and conductivity for e z , respectively. Courant's condition is invoked in (4) as 
It is notified that all the terms in (4) and (5) have same order values each other by the normalization (2) since C z1 and C z2 are smaller than unity owing to (5) . This implies that we can avoid to use standard floating point expression for binary numbers and calculate (4) and (5) using fixed point expressions, which are stored by much smaller size hardware compared with the floating point expressions (4 bytes for single precision and 8 bytes for double precision). Accordingly, the digital circuit size can be reduced effectively. To achieve the flexible operation in the FDTD dedicated computer, the detail circuit of the arithmetic grids of Fig. 1(a) are designed as in Fig. 2 . Initial field values e x , e y , e z , b x , b y , b z , material constants C x1 , C x2 , C y1 , C y2 , C z1 , C z2 , and two 1 bit information of normal/PML and PEC (perfect electric conductor)/vacuum for all 3-D grids are downloaded from the host PC to the FDTD machine in advance of the calculation operation. To specify these download information appropriately depending on each grid, the circuit of Fig. 2 automatically executes (4) or (6) . That is, when 1 bit information of PEC/vacuum is 0, the final result of the circuit is being clear, which means the corresponding grid is a perfect electric conductor. For PEC/vacuum is 1 that corresponds to a vacuum or material grid, the final result of the circuit is set into the register. In addition, if 1 bit information of normal/PML is set to be 0, the circuit of Fig. 2  executes (4) , otherwise the circuit executes (6) . A conceptual operations of the normal and PML grids on the same hardware circuit are indicated in Fig. 3(a) and (b) . Accordingly, the FDTD dataflow machine automatically simulates microwave phenomena for any shapes of the PEC scatterers and any material constant distributions including the PML ABC by specifying appropriate download information by the host PC, without any modification of the hardware circuit.
IV. MODULE STRUCTURE OF FDTD DATAFLOW MACHINE
The 3-D FDTD dataflow machine was designed by the VHDL which is a kind of hardware description language (HDL). For flexibility in construction of the FDTD grid space, the VHDL program has hierarchy structure indicated in Fig. 4 . The arithmetic grid layers, which are located at the lowest level in 3-D grid space, consist of three layers, since the FDTD calculation needs to use one neighbor grids for all directions, therefore, upper and lower grids should be closely connected with the FDTD calculation circuit [ Fig. 4(a) ]. The upper layers beyond the arithmetic three layers include set of registers [ Fig. 4(b) ] in which field values, material constants, and two 1 bit information of the grid properties are stored and vertically shifted down. To combine these circuit grids, an array of a unit vertical grid is constructed as in Fig. 4(c) , and the entire machine of Fig. 1(b) is build-up to horizontally connect this unit vertical grid array both for x-and ydirections.
V. NUMERICAL EXAMPLES The circuit of 3-D FDTD dataflow machine designed by the VHDL was tested by the numerical example of a rectangular waveguide, including a metal tuning screw and a metal block (Fig. 5) . The entire grid space is defined by 12 × 7 × 40 size for x-, y-, z-direction, respectively. Then, the PML (four layers) is allocated at both edges in z-direction, and other outer boundaries are assumed to be PEC. The power input is imposed on x-y plane at 10th grid distance from the PML as continuous TE10 mode signal of E y component. The numerical model is simple and small size, but it is sufficient to confirm normal operation of the designed circuit of the FDTD dedicated computer since all functions of the dedicated computer are invoked in this numerical model. Fig. 6 shows distributions of amplitude of electric field on a middle vertical plane (Fig. 5 ) at 63th time step, which are calculated by C software simulation [ Fig. 6(a) ] and virtual operation of the FDTD dataflow machine by the VHDL logic simulation [ Fig. 6(b) ]. In the VHDL design of the FDTD dataflow machine, all field values are stored by the fixed point expression in 16 bit registers. We find good agreement in Fig. 6 although such the low resolution data format of 16 bit is employed, and can confirm that the designed circuit will operate normally by the good agreement. If we assume that the FDTD machine operates in 50 MHz clock, the machine performance is estimated as 1 G cell/s, which exceeds a typical performance of a single GPU, 240 M cell/s [9] .
VI. CONCLUSION In this paper, the dataflow architecture dedicated computer of 3-D FDTD scheme has been presented to aim the portable HPC of microwave simulations. It has been shown that efficient processing of the FDTD scheme can be achieved to fully use parallel properties hidden in the FDTD scheme and remove memory access architecture. The proposed architecture of 3-D dataflow architecture machine of the FDTD scheme was designed by the VHDL and its normal operation was confirmed by logic simulations.
To use small size FPGA, the hardware size of a single three arithmetic grid [ Fig. 4(a) ] is roughly estimated as 3300 logic elements (LE). For example, it is estimated that the highest end FPGA with 4 M LE can construct 32 × 32 x-y grid space at least, and this means that its performance will be about 12 G cell/s, which is 50 times higher than that of the typical GPU performance. For practical applications, more complicated and larger size numerical models have to be simulated. For such tasks, further efforts of reduction of hardware circuit size and consideration of parallel computation system will be done in near future.
