The paper presents an approach to the design and construction of central processing units for programmable logic controllers implemented in a FPGA development platform. Presented units are optimised for minimum response-and throughput time. The CPU structure is based on bit-word architecture and two types of control data exchange methods: with handshaking -control data are passed through the two flip-flop units with acknowledgement; without handshaking -control data are passed through the dual port RAM. Third unit -simple one processor -built to compare with the above two. The paper presents specific timers/counters hardware construction solution. Additionally it presents implementation results which show how many FPGA circuit resources are used to implement presented units.
INTRODUCTION
One of the main parameters (features) of Programmable Logic Controller (PLC) is scan time -execution time of one thousand control commands. Due to this fact designing and construction of the CPU should have an architecture that enables fast control program execution. It is a very important task. The most of developed CPUs of PLCs delivered by well-known manufacturers are constructed as multiprocessor units. Particular processor in such units executes the commissioned for it tasks. In this way one can obtain a unit, which make possible concurrent operation of a few processors. For such CPU the main problem to solve is the way of task assignment to particular processors and finding a structure of CPU be able to execute of such task assigned in practice as it was shown by (Michel, 1990) . The other important problem inseparable from hardware are programmatic tools. Those tools should enable easy and efficient creation of control algorithm. The programming toolbox should take benefits from all aspects of multiprocessor unit.
Apart from instruction execution time, the access time to internal (markers, counters, timers), and external (inputs and outputs) resources is a very important parameter. Another parameter which characterises PLC is throughput time. It is defined as the response time to the change of object signals. From the point of view of the object, this parameter is most important, which describes the quality of control that is directly derived from the central processing unit and programmatic toolbox (Chmiel, 2008) .
PLCs control mainly process of a binary nature. In some cases they are used for mixed control containing analogue signals (Koo et al., 1998) . There are a lot of objects where control can form independent tasks. The boundaries of independent tasks are often determined by their analogue or binary nature, as well as process set of signals and control conditions. This observation leads to the conclusion: bit-word structure of PLC CPU well matches typical processed data. The CPU structure is oftentimes optimised for very fast logic operations and for execution of complicated arithmetic operation (including floating point). To benefit from described architecture both processors must work in parallel as independent as possible. To make it possible, two processors must be equipped with specific hardware and software solutions.
The most effective and natural approach to the problem of task assignment is partitioning along the operation type (bit or word). The tasks operating on discrete input/outputs are executed by a bit-processor (Getko, 1983) . Nowadays such processors may be implemented in programmable structures like CPLDs or FPGAs. It brings the positive effects in user program execution time (the controller speed-up). On the other hand a word-processor is built on the base of a standard microprocessor or embedded microcontroller. It is used for word data processing in control of analogue objects, numeric data processing and operating system maintenance of the PLC (networking, diagnostics, control loop) (Donandt, 1989; Aramaki et al., 1997) .
As it was mentioned above an efficient and most promising platform for control unit implementation is a platform based on programmable logic devices. This platform may be based on Field Programmable Logic (FPL), especially Field Programmable Gate Arrays (FPGAs). System architects are offered powerful tools which ensure acceptable financial and time outlays in comparison to effects. The FPL enables easy prototyping, testing and evaluating different solutions.
FPGA PLATFORM -HARWADRE AND SOFTWARE SOLUTION
Large density FPGA devices offer a platform that enables using different approaches to construct a PLC CPU. The CPU can be constructed from off the shelf CPU IP-Cores. It can be cores that are compatible with standard microprocessors or microcontrollers. An alternative is to design your own CPU from the scratch. This can be designed to satisfy requirements of the application. Those requirements reflect on the instruction set and interface operation. This approach is much more laborious but results seem to be more optimal.
The authors have decided to use a development board with a Xilinx Virtex-4 (Xilinx, 2006) to perform experiments. Virtex-4 logic resources are sufficient to implement and evaluate dedicated PLC CPU (processors and required peripherals) implemented in FPGA structures. It must be mentioned that implemented central processing units work in classical manner. The central processing unit executes instructions in serial-cyclic manner, in opposite to parallel specific hardware processing of ladder diagram which is possible in reconfigurable logic devices (Ichikawa et al., 2006) .
In prior research works the comparison of basic structures were carried out (Chmiel et al., 2010) . Three different structures were designed. VHDL hardware description language was used for the design (Skahill, 2004 ).
There following structures were evaluated:
 dual processor where one processor waits for the results from the other;
 dual processor with fully asynchronous operation execution (no synchronisation between processors -no waiting for each other);
 single processor executes bit and word instructions.
Ideas presented in were used to build bit-word structure of CPU. Different ideas of concurrent execution of instructions, as well as the processor's synchronisation mechanism based on common data dependencies are presented in cited papers. The units with fully concurrent operating processors were used in experiments. Information between processors was exchanged in two alternative ways:
 by means of flags written to the flip-flops equipped with a handshake mechanism; one flag is written by each processor and made available for opposite one for reading (Fig. 1) ;
 by means of exchange memory, which was implemented in dual port RAM (Fig. 2) . One side has full access to the memory while opposite one is granted only reading.
In order to exploit specific features of designed units, specialised compiler was developed. Assembler was recognised as the best language to build compiler. Assembler may be compared to Siemens' STL languages for S7-300/400 (Berger, 2001 ) and S7-200 (Siemens, 2009) (Chmiel et al., 2010) .
From the point of view of experiments, possibility of introducing new commands is very important. The process of developing new commands influences the processor hardware. This is because each new command means that new functionality must be modelled in a hardware description language. Finally the new structure has to be synthese and implemented in the target architecture.
The assembler program is able to process macros. Macros enable creating sequences of instructions that perform specific operation (e.g. configuring I/O units or timer/counter units). Using macros simplified writing the programs and increase the level of abstraction.
PROCESSORS STRUCTURES
For experimental and evaluation purposes, three structures of central processing unit have been designed, described in VHDL and finally implemented. Two dedicated processors have been designed: for bit operations and for word operations. The third processor has been developed as a general processor equipped with word and bit operations. It has not been equipped with additional hardware support allowing for hybrid (bit-word) multiprocessor operation. 
Word Processor Hardware Implementation
The CPU construction designed around a standard word processor has reduced performance of word operation in comparison to dedicated custom designed bit processor. An attempt has been made toward implementing the word processor optimised for controller implementation. For reference purposes features implemented in Siemens' PLCs series S7-300/400 has been used. In presented controller families designers have assumed that each instruction can have only one argument. Above assumption (or design constrain) requires implementing two accumulator registers for word operations. One of them is default (ACCU_A) accumulator register. Taking into consideration experience of Siemens and our own research the block diagram of designed word processor is depicted on Fig. 3  Auxiliary registers store data required for proper operation of the unit and enable co-operation of a processing unit with other system components.
Word Processor Instruction List
The Fig. 4 shows schematically the data flow for arithmetic and logic operation on sets of bits.  Counter and timer instructions allow for initialisation of counters content, incrementing, decrementing and clearing the counter or timer registers content;
 I/O space and process memory configuration.
Bit Processor Hardware Implementation
The bit processor has been designed to perform logic operations quickly. In order to unify construction of word and bit processor some description and design concepts have been borrowed from a previously developed word processor. Using VHDL for design purposes enables flexible description and easy modification in functionality (Skahill, 2004) . General construction of the bit processor was derived from the word processor with some simplification possible to the specificity of logic bit operation. The main differences are:
 Accumulators size reduced from 16 bits to 1 bit. The Ac_a register is the default target register for all operations;
 Simplified ALU that is restricted only to logic operation.
It should be called a Logic Unit (LU);
 Reduced number of auxiliary registers;
 Bit co-processor has been removed as no longer required in this structure;
 An 8-bit bus is used for data transfer purposes instead of 16-bits.
Bit Processor Instruction List
Instruction list of the bit processor covers the following operation:
 Data transfer instructions allows exchanging data between I/O space, marker memory and inter-processor data memory;
 Logic operations are performed on accumulator's content. Result is placed in default location (Ac_a). Embedded result stack allows for nested operation and for easy maintenance of operation order;
 Reading of binary/Boolean output of counters and timers;
 I/O space and process image memory configuration.
Single Bit/Word Processor
The single bit-word processor has been implemented using components that enable implementation of word and bit operations. This architecture differs from typical microprocessors. The instruction list has been carefully selected to support operations performed by the PLC.
Timer and Counter Hardware Implementation
Timers and counters are accessed by both bit and word processing units. The access method to timers and counters unit should be considered in order to achieve high speed program execution. Operation of this unit is controlled and maintained by the word processor. Results of its operations are used mainly by the bit processor. Typical task assignment is as follows: word processor maintains operations of the unit while bit processor uses computed out results (Chmiel, 2008) . Proposed architecture solution minimises processor load for timers and counters servicing. The FPGA enables implementation of dedicated timer and counter units. The timer and counter units have been designed to operate with single and dual processor units. The timers and counters are not an integral part of either word or bit processors, they are autonomic units that operate under configuration control of the word processor. They only require initialisation for proper operation. The initialisation procedure can be performed during system start-up. Results of its operations (actual states of timers and counters) are available for both processors through dedicated bit outputs.
The timers unit has been equipped with 16 timers called T0 to T15. Designed controller functionality available to users should be compatible with commercially available solutions. Operating modes are identical to timers available in Simatic S7-200 (Siemens, 2009 ). The following operation modes for timers have been implemented:
 TON -timer-on delay;
 TONR -timer on-delay retentive;
 TOF -timer-off delay.
All timers can operate with resolutions of 1s, 100ms, 10ms and 1ms -programmed individually in time base unit. The maximum count is 16383 reference signal pulses.
Each timer and counter has an individual triggering input and output. This allows for direct access to timers and counters reducing on the system bus load. This also yields simultaneous access to the timers and counters by word and bit processing units.
All timers share common resources in the form of RAM that store information about operation mode, time resolution and initial state. During normal operation, timers are updated sequentially in 16 cycles. Each timer update cycle consists of the following operations:
1. Transfer of configuration memory content of processed timer to operating register.
2. The timer content is updated only if the time unit signals about time interval passing or clear request flag has been set.
3. Based on update procedure result the output state of the timer is determined.
4. The timer state is written back from operating register to the unit memory.
Each timer unit requires configuration before placing in run mode. The word processor transfers the configuration word. The configuration word contains a terminal pulse count that changes output state and time base information. Configuration word write operations are executed in two steps:
1. Configuration word is placed on the data bus together with the timer address.
2. The timer write enable line is activated that transfers the control word from the data bus to the timer unit.
The timer configuration word (Fig. 6) consists of 18 bits. The 4 most significant bits determine the time base of the timer and its operation mode. The remaining 14 bits determine the terminal count value. After initial timer configuration, they are ready for normal operation. The word processor can read the state of the timer at any moment of time. The read operation is performed in the same way as for other peripheral units. Bit and word units can read the state of binary output of the timer.
Apart from timers, the designed unit has been equipped with 16 counters called from C0 to C15. The operation modes are defined similarly to counters on commercially available programmable controllers. Counters do not require a time base unit. Each counter can operate in one of three modes:
 CTU -counting up;
 CTD -counting down;
Even though that counters unit is independent it should not be considered as a fast counters unit. Its input is not directly connected to the controlled object. This unit is controlled by the word processor.
Programmatic control over the counters unit and connections to the processing units is similar to the previously described timers unit. Counters, like timers require a starting up configuration. The configuration word consists of mode selection and initial state of the counter. Depending on the operation mode this value is considered as an initial value for CTD mode or output toggle value for CTU and CTUD.
The dedicated counters and timers units are possible to integrate with the PLC CPU thanks to using a FPGA implementation platform. Contrary to typical PLC where those operations are implemented in software layer of PLC. The presented hybrid method pushes intensive periodic operation to dedicated hardware. Only a small part of nonperiodic actions like initialisation are executed by the CPU.
IMPLEMENTATION RESULTS
After completing design and verification processes processing unit has been implemented in target device. For implementation quality, the number of required resources was collected for each block. The target device is a Xilinx XC4VLX25 that belongs to the Virtex-4 family (Xilinx, 2008) . The entire central processing unit with units required for proper operation consumes about 17% (1841 slices) of available logic resources. In the final central processing unit there are some additional blocks which were not described: I/O modules, debounce unit, Flash memory boot loader, and serial asynchronous interface. Table 1 gathers hardware requirements of the particular blocks, listing different logic resources by units like slices (general purpose logic components) and Block RAMs (full dual port 16kb memories).
During the design process and assembling the entire system, it was observed that some optimisations can be introduced and some functionality is replicated. To avoid replication, common components can be shared by the entire system. This situation has been observed in the debounce unit that requires measuring 100ms time intervals. This signal is worked out in the timer's prescaler unit. Instead of replicating the frequency divider, single clock prescaling unit with multiple outputs can be implemented that satisfies the requirements of the entire system. Two processors were developed (bit and word) to test different configurations of central processing units. Most of instructions of those processors are executed within 2 clock cycles. The development board was clocked by a 50MHz oscillator, equivalent of 40ns per instruction. All operations of the word-processor are carried out on 16-bit data.
CONCLUSIONS
The research and development works allow to obtain the bitword dual core central processing unit. This unit was fully custom design created from the ground. The purpose of the design was to compare obtained performance with that offered by general purpose microprocessors. In designer unit two different mechanism of inter-processor communication have been implemented. The inter-processor data exchange based on discrete flip-flops. The other one was based on the dual port memory where markers are exchanged between processors. This solution allow for fully parallel operation. The experiments have been carried out on designed units after their implementation in FPGA. The use of the VHDL and high density programmable logic devices allows designing, constructing and verifying completely new constructions. They can be verified not only in simulation but with use of FPGAs in real working device. Presented processing unit has its own instruction list that also required designing dedicated assembler.
Future work will be carried in following directions:
 testing of developed units with different benchmark programs;
 improvement of structures in term of soft logic resource efficiency and performance efficiency;
 new features implementation like other systems of processor synchronisation mechanism, improvement of existing one, event driven calculations etc.
