Abstract-With security and surveillance, there is an increasing need to be able to process image data efficiently and effectively either at source or in a large data networks. Whilst Field Programmable Gate Arrays have been seen as a key technology for enabling this, they typically use high level and/or hardware description language synthesis approaches; this provides a major disadvantage in terms of the time needed to design or program them and to verify correct operation; it considerably reduces the programmability capability of any technique based on this technology. The work here proposes a different approach of using optimised soft-core processors which can be programmed in software. In particular, the paper proposes a design tool chain for programming such processors that uses the CAL Actor Language as a starting point for describing an image processing algorithm and targets its implementation to these custom designed, soft-core processors on FPGA. The main purpose is to exploit the task and data parallelism in order to achieve the same parallelism as a previous HDL implementation but avoiding the design time, verification and debugging steps associated with such approaches.
I. INTRODUCTION
The emerging need for processing bigger data-sets such as high-resolution video footage demands faster, configurable, high throughput systems with better energy efficiency [1] . Field-Programmable Gate Arrays (FPGAs) can play a big role in this demanding market as they can provide configurability, scalability and concurrency to match the required throughput rates of the application under consideration [2] . They can allow the potential to distribute image processing to a processing platform which is located as close as possible to the image source. This distributed processing can act to reduce the need for bandwidth, and power on a large scale which in turn reduces the communication overhead and the amount of data that needed to be stored.
Typically FPGAs works very well with the applications that requires high concurrency, bandwidth and re-programmability. However, FPGA design and verification is a very delicate and time-consuming process [3] and requires that designers create Hardware Description Languages (HDLs) descriptions such as VHDL, Verilog and SystemC, of their designs. This HDL approach allows a precise description of the digital circuit to be defined and with performance timing defined, the design tools can then synthesise, map, place and route the HDL design accordingly. The major issues is that this design process involves numerous verification and debugging steps, which increases the time to market from weeks to months, depending on the complexity of the algorithm of interest. In order to reduce the required design time and effort, new High Level Synthesis (HLS) tools have been created which allow the designer to use high level languages like C or OpenCL [4] , to create algorithmic representations for FPGA implementation.
The paper proposes an alternative approach based on a dataflow-based toolset that targets reprogrammable custom designed soft FPGA-based processors. The benefits for using an adopted approach is that bespoke designed soft processors have guaranteed performance and resource usage; they are also easily reprogrammable and even allow potential support for runtime reconfigurability. The proposed approach uses the CAL dataflow language approach [5] and respective ORC synthesis flow [6] . The user can describe the algorithms in the CAL dataflow language and then compiles them whilst exploiting task and data parallelism [7] , [8] , to a series of small, reconfigurable soft-core RISC (Reduced Instruction Set Computing) architectures on FPGA called Image Processing PROcessor (IPPro) [3] .
The paper is organised as follows. Section 2 reviews related background work in the area of FPGA Design tools concentrating on High Level Synthesis (HLS) tools. Section 3 briefly describes the soft-processor architecture and its capabilities. Section 4 outlines the toolset, how the programming paradigm is achieved. Section 5 presents the case studies where the toolset is used with the designed soft-processor architecture, and their performance comparisons. Section 6 concludes and reviews the proposed approach.
II. BACKGROUND
The reprogrammable design methodology proposed in this paper aims to remove the requirement of HDL design, synthesis and place and route processes by replacing the reconfigurability property of FPGAs with the proposed reprogrammable model. In order to do this, an intermediate medium formed by programmable multi-core processors is proposed. The proposed system consists RISC architectures that support Single Instruction Multiple Data (SIMD), and various interprocessor communication methodologies, to provide the required flexibility and programmability. This reprogrammable medium has been designed to be as compact as possible to increase the efficiency of the use of the available FPGA logic but also achieving high performance [3] . The toolset that supports this platform uses the CAL Dataflow Language as dataflow languages, in general have the ability to express the parallelism, and they are easy to identify and resolve data dependencies to exploit concurrency as much as possible.
A. CAL Dataflow Language
A dataflow program consists of actors and its firing rules, where every actor describes the required arithmetic/mathematical operation to process the input streams before passing the result(s) to the output streams. The representation of actors in dataflow programming models are given by directed graphs where the nodes represent computations and in general, the arcs represent the movement of data. The main principles behind the dataflow design methodology are the concurrency, scalability, modularity and data-driven properties. The term data-driven is used to express the execution control of dataflow with the availability of the data itself. In this context, an actor is a standalone entity which defines an execution procedure. Actors communicate with other actors by passing data tokens, and the execution is done through the token passing. Combination of a set of actors with a set of connections between actors constructs a network. Within the defined networks, communication is made using infinite size FIFO components.
B. Dataflow Development Environment -ORCC
ORCC is an open source dataflow development environment and compiler framework, that allows the transcompilation of actors and generates equivalent codes depending on the chosen back-ends [6] . ORCC is developed within the Eclipsebased Integrated Design Environment(IDE) as a plug-in with graphical interfaces to ease the design of dataflow applications.
C. Soft-Core processors
In state-of-the-art, there are number of soft-core processors for FPGA architecture. These include FlexGrip [9] , IDEA [10] , and DSP48E-based MIMO (Multiple Input Multiple Output) processor [11] . FlexGrip maps pre-compiled CUDA kernels on soft-core processors which are programmable, flexible and scalable, can operate at 100 MHz. the IDEA processor and MIMO processor have a similar structure to the IPPro proposed here, as both use the DSP48E1 processing unit from Xilinx FPGA as their Arithmetic Logic Unit (ALU). IDEA processor uses a 8-stage pipeline to achieve 407MHz clocking frequency, and MIMO processor supports a very specific instruction set for Multiple Input Multiple Output (MIMO) communication systems and is able to work at a clock frequency of 265MHz.
III. RISC ARCHITECTURE -IPPRO
This section describes the custom designed DSP48E1 based RISC architecture, called IPPro which is described in detail in previous publications [3] . The IPPro is a hand-coded ISA architecture, which uses Xilinx primitives especially DSP48E1 block as an Arithmetic Logic Unit (ALU), for more efficient and faster processing and which supports a wide range of instructions and various memory accesses. It is capable of processing 16 bit operations, and uses distributed memory to build three different memory hierarchy, which can be listed as Register File, Data Memory, and Kernel Memory. The IPPro architecture uses a 5 stage balanced pipelined architecture as shown in Fig. 1 .
IPPro is capable of running at 526 MHz on Xilinx System on Chip (SoC) in particular XC7Z020-3 using one DSP48E1, one BRAM and 330 Slice Registers per processor. The main idea of the processor was to keep it compact, reprogrammable, and scalable as much as possible to achieve high throughput rates compared to custom-made HDL designs.
Overall IPPro has 3 addressable memory locations within the processor core:
The register file is used for regular memory locations, where separate data can be stored and processed. Data memory is the main data input and output location for IPPro where the input and output streams are stored. Kernel memory is a specialized location designed for window and filter operations for coefficient storage. Immediate memory is used to reduce number of the register files, as well being able to store the load and store operations for situations where one operand is constant.
An overview of the supported instructions can be seen in Table I . The IPPro instruction set capable of processing basic arithmetic and logical operations for different addressing modes. In addition to unary operations IPPro instruction set, it also has support for trinary expressions such as MULADD, MULSUB, MULADDK, and others shown in Table I . As will be discussed later, support for trinary expressions was added to the tool chain in order to benefit this feature. Given the limited instruction support and requirements from the application domain, it is envisaged that coprocessor(s) could be added to provide better support for more complex processes such as division and square root. Ongoing research is being undertaken to design such a coprocessor. This section describes the toolset that allows to represent interested image processing algorithms in CAL Dataflow Langugage [5] , and compiling the actors to work with the custom multi-core architecure.
The overall algorithm design and compilation scheme involves the following steps: 1) Algrotihm implementation in CAL Dataflow Language.
2) Profiling the algorithm, and spotting the required changes within the CAL description of the chosen algorithm to remove bottlenecks and optimise performance. 3) Compiling the algrotihm targeting the IPPro backend. 4) Loading the generated binary file to the development board through host operating system. The design approach shown in Fig. 2 , shows the mapping of actors to multi-core processors. During the compiling process each actor is compiled and mapped to a processing element and the interconnection between processing elements are assigned as FIFO channels. Given the architectural limitation, ie lack of stack infastructure, support for functions calls are limited to the size of the Instruction Register of the IPPro. As a consequence lexical nested routines are not implicitly supported, the detailed limitations and assumptions are explained in the limitations section (B).
A. Compilation Flow
The compilation flow within the ORCC tools is composed of two distinctive steps. The first step of the compilation transcompiles each actor to its Intermediate Representation (IR). The IR is used within the compiler to keep the modularity and be able to target different backends. The latter step of the trans-compilation is the conversion of the IR to IPPro assembly code. It should be noted that the required transformations for a specific backend should be done before the latter transcompilation.
In this concept, revisiting the IR for target specific optimizations should be done on IR level to have an optimized transcompilation for a specific backend. For instance, the IPPro 
B. Limitations
Given the CAL Language Lexical semantic properties, it is not possible to fully support the compilation process from CAL to IPPro. Consequently this imposes some limitation on the trans compilation process within the toolflow.
Given that, A x represents the every actor within the network, where 0 < x ≤ N, N ∈ Z. Assuming that every A x may contain variable number of actions, ac x,y , due to the purpose of the interested algorithm, where x is the actor, and y is the action index. As previously stated, since the stack infrastructure is not supported the function calls are limited with the size of the instruction memory (IM) and only possible access to IM is done by sequentially and in-order. This limited functionality arises the problem to re-factor and serialize ac x,y to one ac x per A x . At the current state, this refactoring is being done by the algorithm designer where one should consider the target hardware limitation and refactor the actor structure.
C. Profiling
Efficient implementation of interested algorithms on a specific computing platform requires niche expertise and knowledge. Especially in embedded platforms the algorithm designer should be aware of the capabilities and memory structure in order to achieve high performance implementations. In order to realize efficient algorithm implementations, toolset must be aware of the target embedded platform. In our current case studies all the algorithms are profiled by hand and optimized for the targeted platform according to the limitations and supported instruction set. There are various static, and dynamic profiling tools and techniques in open literature, and Simone et.al [12] proposed a very beneficial design space framework for profiling and optimizing algorithms, which is also designed to work with ORCC development environment.
V. CASE STUDIES
In this section case studies have been used to demonstrate the applicability of our approach. The CAL dataflow language is used to describe the designs and then they are compiled using the proposed tool-set.
A. Finite Impulse Response (FIR) Filter
In this case study, a 3-tap FIR filter with 8-bit fixedpoint coefficients and a 16-bit datapath was chosen. Two investigations were undertaken, one with single core and the latter with multiple IPPros connected in a streaming mode.
In case 1 shown in Fig. 3a , a single core IPPro was used. For this scenario, input values were loaded into the Register File and then processed by MUL and ADD instructions. The need for load and store for every input reduces the througput of the design, giving a rate of 430b/s at a clock frequency of 404MHz. In order to increase throughput, the same FIR filter was implemented for the streaming mode of the IPPro, where the Load and Store instructions are bypassed and FIFOs are used for processor communication, as shown in Fig. 3b .
In the streaming mode, 7 IPPros were used to increase the throughput of the design, where every IPPro instructions were transcompiled for one actor, and communication is undertaken through the FIFOs. In this case, it is possible achieve a throughput rate of 6Mb/s with clocking frequency of 404MHz with the clock limited by the possible clock rate on the Zynq 7020 FPGA and not the on-board processor. 
B. Histogram of Oriented Gradients Algorihtm
The Histogram of Oriented Gradients (HOG) algorithm is a well known algorithm used for human detection, used in pedestrian detection using the gradient orientation [13] . The HOG algorithm converts the pixel intensity information to the gradient information, where gradient consist of magnitude and direction. After this step, the vectors are formed using the extracted information from gradients. At the last stage of the algorithm, a Support Vector Machine (SVM) is used to achieve human detection from extracted vector information. The overall algorithm's processing blocks are shown in Fig. 4 , where parts that have been chosen for IPPro implementation has been identified in green and purple.
Kelly et.al [14] implemented the HOG algorithm using the IPPro platform and compared the design with handwritten design. They identified four processing blocks as a candidate for IPPro implementation which are Gradient & Magnitude (G&M) and the Binning & Cell Histogram (B & CH) calculation. The performance metrics for these blocks with comparison to hand-written HDL code are shown in Table II . In this particular study, interested blocks were implemented using for 1 core and then desined scaled up according to the algorithms throughput requirements. This IPPro implementation is done by hand, and optimizations are made as a result of algorithm profilings. The work showed that the HOG algorithm can be accelerated up to 3.2 times to give better performance, if 90 cores used within the system. This demonstrates that our approach can be highly scalable and gives a reduction in design effort and time. 
VI. CONCLUSION
A dataflow tool-set to program FPGA-based soft-core processors has been presented. Using two examples, a simple FI filter and a more complex HOG algorithm, we have shown its capabilities in reducing design time and effort. The overall design toolset with limited optimizations and limited memory access is currently operating and many optimizations and profiling are planned which will look to improve the mapping of future designs and thus improve efficiency, allowing us to get near the same performance of any hand-crafted design.
Given the target application area, which is distributed image preprocessing, one of the main limitations for current version of the processor is the lack of division and square root instructions. The support for these instructions is in our agenda for future work, along with the optimization and profiling support within the design toolset.
