Introduction
Constant growth of a frame rate due to an increasing image resolution together with an increase of the number of single pixel representations results in a fact that video bandwidth processing requires even more efficient computing struc− tures. Recent years brought different hardware architectures allowing for efficient implementations [1] . The capacity of general use of computers is growing mostly through the use of multi−core processors. Graphic processors also offer con− siderable performance. Supercomputers are used mostly for offline time−consuming calculations. Real−time tasks use signal processors or field programmable gate array (FPGA) circuits.
Thriving comparisons between different computer architectures indicate that each of the implementation plat− forms has specific advantages and disadvantages [2] . The choice of a suitable computing architecture depends prima− rily on processing performance required for an implemented algorithm, type of computing operations, way the system is powered, operating conditions and economic factors.
Implementing image processing on personal and indus− trial computers based on general purpose processors (GPP) is the most commonly used method. High hardware and software availability is the main advantage of such solu− tions. Equally important are: short execution time of the project which reduces implementation costs and availability of convenient tools for modelling, testing and final imple− mentation. The disadvantages are high energy consumption and limited processing performance. It is worthless to increase the processing performance of such solutions via increasing the number of processing elements, e.g. by means of using multi−core processors, at a cost of increased power consumption. Another obstacle in increasing the pro− cessing performance of modern multi−core and multiproces− sor platforms is the phenomenon of a decline of processing efficiency of a multiprocessor system that accompanies an increase of a number of processing elements. Effective parallelization is conditioned by a proper algorithm de− composition that guarantees an increase in the processing performance.
Signal and digital images processing systems have long been using signal processors (digital signal proces− sors -DSP). Based on the Harvard architecture and equip− ped with an additional multiplier accumulator (MAC), as well as a shifter or a floating−point division mechanism, they make an attractive platform to implement digital fil− ters and transforms. Due to task parallelization and data and program busses separation, they are typically able to compete with general purpose processors in terms of pro− cessing performance, without reaching their clock fre− quency and they are often used in consumer equipment. It is worth noting though that because of the lack of architec− ture standardization and due to competition among manu− facturers, software design tools for digital signal proces− sors are not unified. Therefore, fully flexible software migration is not possible.
Given the limited processing performance of generally used computers, it is several years now since a methodo− logy is being developed which uses specialized graphics processors (GPU) for computing tasks related to image processing. According to Flynn classification, in terms of architecture graphic processors correspond to the single instruction stream multiple data streams (SIMD) topology. Therefore, they can carry out certain classes of tasks whose decomposition for this type of architecture promise good acceleration [3] . Constant development of software prod− ucts that support parallel task implementation results in a fact that this type of parallel architecture is easy to use since the design style resembles general−purpose CPU pro− gramming, while the purchase of expansion cards offering the possibility to accelerate calculations does not necessa− rily mean a huge additional expense. In many cases, the results are satisfactory, while the processing performance is greater than in the case of general−purpose processors. One disadvantage of GPU−based solutions is the necessity to operate in the PC's environment, which limits the appli− cability of such solutions in the domain of embedded sys− tems. A bottleneck might occur due to insufficient perfor− mance of the computer's bus or its overload, when the demand exceeds the graphics card's available memory and it is necessary to communicate with the CPU's main me− mory. One must also remember that motherboards have limited bus throughput which is shared between several expansion cards.
High processing performance can also be achieved by using computing architectures based on the application specific integrated circuit (ASIC). They are specifically suitable for systems which require energy savings. Their use is conditioned by the cost−effectiveness of the manu− facture of a sufficiently large batch of the circuits. Due to the very high cost of project preparation and implementa− tion, only very large and rich companies -global tycoons in the production of electronic systems and devices -can introduce new ICs using cutting−edge manufacturing tech− nologies. According to Ref. 4 the cost of designing and manufacturing of the ASIC using the 22 nm technology equals to around USD 110 million. An alternative for the manufacturers is either to use somewhat cheaper and older technologies which do not guarantee anyequally satisfac− tory performance of ICs (achieving lower operating fre− quencies and characterized by higher energy consumption) or giving up its own IC design in favour of the competi− tors' solutions. ASICs prove very useful in well−known and standardized procedures, such as broadcasting and video encoding or image processing in consumer equip− ment [5] . On the innovative market of solutions related to the image processing and analysis a principal drawback of ASICs is their relatively long time that is needed from a concept to an implementation (which is about 6 months). ASIC circuits are characterized by little flexibility and limited programming possibilities.
Very promising image processing results have been achieved in the case of FPGA circuits [6, 7] . A growing number of scientific publications concerning the implemen− tation of image processing in FPGA (Table 1) reflects an increase of interest in these solutions. The number of hardware and software platforms to sup− port the design of algorithms for image processing on these platforms is also increasing [8, 9] . The most immediately noticeable advantage of the FPGA technology is the user's ability to configure the computing system at the processing unit architectural level. Moreover, such system reconfigu− rations may occur infinitely many times. In the case of image processing algorithms implementation, this advan− tage is not to be underestimated. The possibility to build FPGA−based systems is also within reach, however it is somewhat limited by the use of circuits installed in larger packages -ball grid array (BGA) and fine ball grid array (FBGA) [10] -which require a specialized equipment for the assembly of electronic elements and construction of multilayer printed boards due to a high density of I/O termi− nals on a relatively small area. A solution might purchase a ready−made FPGA system. Its variety makes it possible to choose a platform that matches the problem to be solved. While selecting a platform one must take several factors under consideration including: reprogrammable element performance and capacity, number of banks, access time and memory capacity, modes of programming of the FPGA, availability of a camera interface (in a particular standard) as well as vendor's software that supports the design and execution of image processing applications. Reconfigurable platforms are able to interoperate with personal and industrial computers acting as hardware accelerators.
The supercomputers' manufacturers have also noticed the possibility of accelerated computing based on FPGA cir− cuits. FPGA accelerators are structurally tailored to super− computer standards [11, 12] and used to implement compre− hensive complex applications for image processing and analysis. Due to much greater bus capacity (as compared to personal computers), they achieve high transfer speeds and can be used in systems that require considerable processing performance.
Apart from accelerated computing, another important trend in the use of FPGA circuits is their use in embedded systems, including systems for image processing and analy− sis. The concept of embedded systems encompasses both desktops and mobile devices, including strictly industrial sys− Parallel performance of the fine−grain pipeline FPGA image processing system tems, such as smart cameras [13, 14] , as well as consumer de− vices, e.g. gaming consoles [4] . Nowadays, the advantages of using FPGA circuits in embedded systems may more likely than to processing performance be related to lower energy consumption and compactness, whilst maintaining flexibility and programmability typically offered by GPP, DSP and GPU solutions. It is worth noting that some FPGA adapters comply with industrial standards, such as PC104 [15] .
High data rate images sources
Image acquired by various types of sensing elements can be stored in a camera's memory or in computing devices. Stor− ing uncompressed high data rate video stream requires the use of memory characterized by very short access time and huge capacity. This sometimes implies a limitation in the size of a memorized image sample.
Another solution is to process and analyse the pixel stream on−line. It is often used in embedded systems and enables to avoid saving image data to memory. Real−time processing of high data rate image stream without any loss of pixel data requires the use of systems with sufficient pro− cessing performance for the particular algorithmic task at a predetermined signal rate. Recent image processing and analysis devices most often transmit video data using digital transfer standards. Most popular digital formats include: The first three formats are video transmission standards where data are not compressed. DV and HDDV (high−reso− lution DV) encodings both use loss intra−frame compres− sion. MPEG −1, −2 and −4 encoding achieve higher compres− sion by using inter−frame compression algorithm, based on motion compensation. MPEG standards are principally uti− lized in TV signals transmission and in video recording in multimedia devices. Due to the inter−frame compression, their application in image analysis is difficult. USB standard is sometimes used for video transmission, as it enables transmitting plain video signal at high rates.
The choice of transmission standard is primarily deter− mined by the transmission channel's bandwidth and the compatibility of image processing system, as well as factors which condition whether image compression is tolerable or necessary. However, one shouldn't forget about economic factors, availability of a specific device type, their size or complexity of the structure.
Review of FPGA applications in image processing
Scientific publications and implementations of common image processing applications, using FPGA circuits, en− compass the following: Using neural networks in image processing. An important driver behind the choice of the FPGA plat− form as an implementation environment is an ability to achieve a real−time performance [6, 7, 16] . Real−time is often understood as a follow−up (lossless) processing of a stream of pixels at a specific rate.
The execution of basic image processing operations is well known. However, new and more efficient algorithms are investigated for their implementation [17] . Real−time move− ment analysis can be based on relatively simple differential algorithms, such as sum of absolute difference (SAD) [18] and more sophisticated and complex ones, such as "optical flow" [19] . New publications keep emerging to illustrate va− rious implementation options of frequency transforms [20, 21] . Several papers were published in the domain of seg− mentation algorithms describing the use of information from a disparity map in the stereoscopic image [22] and the im− provement of the classical component labeling algorithm [23, 24] . In recent years there have been many reports on the implementation of algorithms for feature extraction and ob− ject recognition [25, 26] , in particular, on the development of face detection methods [27] . A possibility of using repro− grammable systems for medical signal processing also gained considerable popularity in recent years [28] [29] [30] . Image space transformations (such as a transform into the log−polar space) have been developed and refined [31] . Several papers were published on the implementation of stereoscopic methods [32] [33] [34] . Neural networks have been used for image recognition [35] . Algorithms for object tracking and target detection have been implemented [36, 37] . A fundamental parameter of a computer system is the processing performance provided by a processing unit. Pro− cessing performance is defined as an ability of the process− ing unit to perform a specific number of operations in a given time. In the case of general−purpose processors and ASIC circuits (characterized by specific design and mode of Opto−Electron. Rev., 20, no. 2, 2012 M. Gorgoń operation) their processing performance can be measured by estimating the maximum possible number of operations performed in time
Theoretical foundations for building FPGA-based image processing systems
where PP is the processing performance, O is the number of operations performed by the algorithm, T is the time of exe− cution of sequential solution.
A processing performance of the FPGA may be esti− mated for a particular configuration of the device based on the analysis of an implemented calculation algorithm. In digital image processing, the smallest quantum of data equals to 1 pixel. The processing performance will therefore be a function of a number of operations performed on a pixel within the unit of time.
An analysis of initial image processing algorithms pro− ves that for most of them one can accurately determine the number of operations executed for each pixel. Operations carried out for each line and frame of the picture and associ− ated with the beginning and the conclusion of processing should also be considered. For such algorithms, Eq. (1) takes the following form
where N is the number of pixels in image line, O P is the number of operations for each pixel, O L is the number of additional operations for each image line, M is the number of image lines, O F is the number of additional operations performed for each frame, f R is the number of image frames per second, O I is the number of additional operations related to processing initialization and O Z is the number of addi− tional operations related to processing completion. In a system consisting of many processing units, the pro− cessing performance can be multiplied. One of the measures often used to describe the mathematical parameters of a multiprocessor system is speedup
where S P is the speedup, T 1 is the time of execution of the best sequential solution, T n is the time of execution of the parallel solution of the problem on "n" processing elements. An increase in system performance resulting from the increasing of a number of processing elements varies depending on a number of factors, e.g. architecture of the entire system, the structure of the algorithm being processed and on how well the algorithm matches the architecture. The degree of increase in the performance of a computing sys− tem consisting of n processing elements is defined by a measure called "parallel efficiency":
where E is the parallel efficiency and n is the number of pro− cessing elements.
The possibility to implement fine−grain parallelization in an FPGA allows for pipelined image processing [6, 12, 29, 38, 39] . According to Flynn's classification, this type of multiprocessor architecture is defined as multiple instructions streams single data stream (MISD). Such image processing organization corresponds to the way imaging data which is propagated from the CCD and CMOS sensors used in a vast majority of image acquisition devices. It would prove very expensive to implement fine−grain architecture in general purpose processors. Each operation would require a separate processing unit, e.g. a separate CPU core. However, in the case of FPGA fine−grain parallelization is much cheaper to obtain. Processing units required to perform operations on image data can be easily generated in the reprogrammable circuit's array. It is worth noting that a single FPGA pos− sesses resources allowing to implementa processor for pixel stream processing and analysis, composed of many process− ing elements. An instruction stream similar to one found in microprocessor systems does not exist in the system imple− mented in FPGA. Properly configured logical resources of the circuit are responsible for the performance of a computa− tional task. On the way through the circuit, data passes through subsequent processing units and the values are subject to change depending on the implemented functions.
Each of the processing units acts as part of a pipeline, performing its tasks as soon as it receives pixel values pro− cessed by the preceding processing unit. Using pipelines to gather the local context in the fine−grain implementation results in a fact that there is no need to store the entire image in memory after subsequent algorithmic tasks (such as con− volution, median, tresholding, etc.) are performed.
Once all the processing units are initialized, the pipeline reaches its maximum processing performance and main− tains this performance until the first processing unit pro− cesses the last pixel of the last image frame processed.
Since pipelined systems are especially useful in real− −time processing, it may be assumed that the processing is continuous, and the situation in which a pipeline operates at its highest performance is a kind of a steady state for this type of systems. The processing performance of a pipeline (working with maximum efficiency) will be equal to the sum of maximum processing performances of all processing units within the pipeline
where PP P is the processing performance of the pipeline, K is the number of processing elements within the pipeline and k is the processing element index.
Putting the expression (2) into Eq. (4) results in the fol− lowing formula
However, in accordance with the earlier assumption (concerning a stable state of a pipeline) O I = O Z = 0 is ignored. Additionally, all units of the pipeline run synchro− nously, processing the same number of frames per unit of time. Eq. (5) assumes the following form
The expression under the sum denotes the sum of all operations carried out by the pipeline for a single image frame.
The paper in Ref. 40 proposes a formula for determining the speedup, which differs from the one in Eq. (3). It is called the Gustafson's Law
where ¢ s represents the serial time spent on the parallel sys− tem.
The fine−grain parallelization achieved in FPGA allows for linear speedup in a pipeline, following the increase in the number of processing units. This observation can be derived from the Gustafson's Law while making two additional complementary assumptions: l propagation time of a single pixel is significantly smaller than image frame processing time (taken into ac− count in the course−grain operation), which is directly related to a large number of pixels within the image frame, therefore unsteady state of pipeline is negligibly short, moreover, as mentioned before, the pipeline pro− cesses continuously pixels of consecutive frames; l when the pipeline operates in a steady state, there is no serial time spent because all processing elements in the pipeline are working simultaneously on consecutive pi− xels and are 100% busy. The above statements allow one to assume that ¢ = s 0, therefore speedup expressed by Eq. (8) takes the value of n. This means that attaching successive processing elements to the pipeline can linearly increase the processing perfor− mance of FPGA−based image processing systems, however, assuming that the processing elements are synchronized by the same clock pixel. Another word, the pipeline is fully scalable architecture.
Consequently, if we substitute ¢ S P (8) in the place of S P , in Eq. (4), parallel efficiency E takes the value of 1, which is expressed by Eq. (9) E S n n n s n
Processing performance, speedup and parallel efficiency are important parameters allowing us to compare the perfor− mance of various hardware resources. Compared to gen− eral−purpose processors or signal processors, the clock fre− quency of fine−grain parallel FPGA architectures is often multiple times smaller. However, due to intelligent algo− rithm decomposition and operation parallelization, the pro− cessing performance may be several times higher than the one of a microprocessor system. In the light of rising expec− tations associated with the demand for processing perfor− mance in visual systems, it significantly influences the growing popularity of such architectures.
Conclusions
Image processing and analysis create a significant percen− tage of algorithms implemented in FPGA circuits. Techno− logical advancement in the manufacture of semiconductor ICs offers opportunities to implement a wider range of imaging operations in real time. Implementations of exist− ing ones are being improved. Due to the possibility of fine−grain parallelization of imaging operations, FPGA cir− cuits are capable of competing with other calculation im− plementation environments.
Despite the relatively low frequency of FPGA circuits (within the 100-300 MHz range), reprogrammable systems offer intensive parallelization (impossible in the case of conventional computing architectures) through fine−grain (pixel level) decomposition of the calculation algorithm, and the possibility to match the algorithm to the processing unit and the computing algorithm to the characteristics of the resources, which results in an ability to provide substan− tial processing performance.
The paper presents a method for calculating the process− ing performance of parallel image processing operations. An analysis was made of speed and efficiency of the pipe− line, proving that the pipeline enables achieving a linear speedup in parallel systems, and the parallel efficiency E=1.
