Abstract: This study focuses on embedded realisation of adaptive vision algorithms, and illustrates the challenges using mixture of Gaussian (MoG) background subtraction. MoG is a frequently used adaptive vision kernel, for example, for surveillance applications. It involves massive computation and communication demands, which renders a software approach infeasible considering a 1 W power budget. To address these challenges, the authors employ a systematic system-level design approach and first analyse the demands at high-level, explore opportunities for bandwidth reduction, and derive a customised systemlevel specification. Based on the system-level exploration, this study then proposes a communication-centric architecture template that simplifies implementing embedded adaptive vision algorithms. To achieve high efficiency, they propose to separate steaming and algorithm-intrinsic traffic. This allows customising the traffic handling based on role of the data, as well as simplifying interconnecting multiple heterogeneous nodes. The authors demonstrate the benefits of traffic separation and the communication-centric architecture template based on MoG. They realise MoG on the Zynq-7000 SoC processing 1080p 30 Hz stream in real-time. The MoG processing kernel consists of 77 pipeline stages operating at 148.5 MHz. The authors' solution is more than 600 × faster than an ARM Cortex-A9 with 666 MHz. It only consumes 151 mW of on-chip power operating in real-time.
Introduction
With the significantly growing processing capabilities, vision algorithms are increasingly targeted towards embedded deployments. Embedded vision refers to deploying visual analysis and computer vision algorithms using embedded systems [1] , covering a variety of markets with a notable algorithm diversity and conflicting requirements [1, 2] . Rapidly growing markets involving vision computing include advanced driver assistance system, industrial vision and video surveillance. The surveillance market alone was estimated to more than triple from $11.5 billion in 2008 to $37.5 billion in 2015 [3] . Embedded vision applications share similar top-level requirements for real-time video processing, including modelling of captured scenes, object analysis, detection and event classification.
Embedded vision is particularly challenging because of conflicting demands for very high performance and very low power consumption. An important algorithm class contains adaptive vision algorithms which track visual information through a continuously updated model. Adaptive vision algorithms are often based on machine-learning principles. Algorithm examples include mixture of Gaussians (MoGs), Lucas-Kanade optical flow and support vector machines (SVMs), which can, thanks to their adaptive nature, tackle complex tasks (e.g. object detection, tracking and classification). The market demand for higher resolution (e.g. Full-HD 1920 × 1080) drives the compute complexity well into many billions of operations per second (GOPs). In the result, embedded architects face tremendous challenges to realise high-quality vision processing solutions operating at HD resolution while consuming very little power (often <1 W).
Adaptive vision algorithms pose tremendous challenges for their embedded implementation: (a) a huge computation demand including conditional execution; (b) large storage volume (and bandwidth) for keeping and updating an internal adaptive model; and (c) simultaneous access to the internal model aligned with the incoming pixel stream. In many cases, the internal adaptive model (more generally: the algorithm-intrinsic data) is many times larger than the streaming traffic (e.g. MoG parameters account for 7.3 GB/s for a 0.2 GB/s video stream). In addition, stream and algorithm-intrinsic data have to be aligned (simultaneously available) because of the 1:1 correlation between a stream sample and its model counterpart. The significant bandwidth requirements and computation with tens of GOPs render an SW implementation clearly infeasible.
Existing heterogeneous approaches [4, 5] mostly focus on vision filters (illumination/colour extraction, convolution) which are generally less demanding than adaptive algorithms. In particular, vision filters require fairly small internal data storage, which is realisable in HW as local storage. As such, the existing approaches are not transferable for realising adaptive vision algorithms or cripple them to very small resolutions (contrary to market desire). New efficient heterogeneous solutions are required that simultaneously offer high performance and low power and by that enable embedded deployments of adaptive vision algorithms. A particular challenge to overcome is the algorithm-intrinsic data which result in tremendous memory traffic, yet requires strict alignment to the streaming traffic.
This paper introduces a systematic approach to tackle the challenges of adaptive vision algorithms, and uses MoGs background subtraction as an example. The first stage is a system-level exploration to analyse the adaptive vision algorithm for its demands (computation and communication) and identifying opportunities for bandwidth reduction. This yields a customised system-level specification as a blueprint for the heterogeneous realisation. To cope with the algorithm-intrinsic data, we propose a communication-centric architecture template with two key features: (i) separating algorithm-intrinsic and streaming traffic and (ii) autonomous control and synchronisation. The traffic separation (i) allows to compress algorithm-intrinsic traffic and with this makes adaptive vision algorithms realisable in an embedded system. In addition, the traffic separation simplifies interconnecting multiple heterogeneous nodes. The autonomous control (ii) simplifies synchronisation between streaming and algorithmintrinsic data independent from the host ILP. It minimises the system-level synchronisation overhead, and further simplifies embedded realisation. In effect, our proposed solution operates independently (as a peer processing element) on streaming pixels in parallel to a host processor.
We demonstrate the benefits of traffic separation and the communication-centric architecture template based on MoG. We realise MoG on a Zynq-7000 SoC operating on real-time 1080p, 30 Hz video stream. The MoG processing kernel consists of 77 pipeline stages operating at 148.5 MHz. Our solution is more than 600× faster than an ARM Cortex-A9 implementation and consumes 151 mW on-chip power. Comparing with the closest related MoG embedded implementation [6] , our solution operates at higher resolution while yielding much higher quality and memory bandwidth reduction.
The remainder of this paper is organised as follows. Section 2 discusses relevant related paper. Section 3 further explains the background and motivation of our paper. Section 4 describes our approach in detail. Section 5 presents the experimental results. Section 6 concludes this paper and touches on future paper.
Related work
The limited power budget starkly restricts deployability of embedded vision applications, especially when considering Full-HD resolutions. Therefore system architects move towards heterogeneous solutions combining embedded processors with specialised hardware accelerators. Examples include ADI ADSP-BF60× [4] and TI DaVinci [5] . They offload compute-intense kernels into specialised hardware accelerators while control and high-level analytic execute on embedded processors. However, current HW solutions mainly focus on basic vision filters, for example, Canny edge detection, with regular computation and communication patterns [7, 8] .
Few researchers [9] [10] [11] [12] have targeted adaptive vision algorithms (e.g. MoG, KLT, optical flow) for embedded HW accelerators. All approaches do not separate traffic types, emitting everything into a common infrastructure (e.g. network-on-chip [12] or customised [9] ). Thus, they either ignore the algorithm-intrinsic traffic or assume it hidden in the hierarchy. In the result, these HW accelerators are limited to very low resolutions (300 × 200 [9] ), while the market demands Full-HD or higher. Furthermore, existing HW vision accelerators are often organised as co-processors relying on frequent host ILP interaction for scheduling, synchronisation and data transfers. This burdens the ILP with significant synchronisation overhead and incurs unnecessary traffic for moving both streaming and operational data throughout system memory hierarchy. This overhead is exaggerated with adaptive vision algorithms because of alignment of intrinsic and streaming traffic.
Realising MoG background subtraction (in HW) for embedded heterogeneous deployment is extremely challenging. Only few approaches have been proposed [6, [13] [14] [15] [16] , all with specific restrictions or limitations. The architecture in [13] avoids the performance costly square root operation, which results in a measurable quality loss. On top of that, with the small resolution (320 × 240 at 30 Hz), many challenges including high traffic volume do not appear. The approaches in [14, 15] operate with similar restrictions, being limited to very low resolutions ( [14] even only 120 × 120). A moderately higher resolution (640 × 480) is targeted in [6] . Yet, Appiah and Hunter [6] use the same shortcut as in [13] and avoid the square root operation to reduce computation complexity at cost of some quality loss. Only the approach in [16] aims for HD resolution (1080p, 30 Hz). Ratnayake and Amer [16] also hint to the immense traffic for accessing Gaussian parameters and also propose applying a compression. However, Ratnayake and Amer [16] remain at simulation-level without an actual field programmable gate array (FPGA) execution in real-time. With focusing on simulation, Ratnayake and Amer [16] do not address major SoC integration challenges such as direct memory access (DMA) for updating Gaussian parameters, dealing with the system traffic and reaching timing closure under layout constraints. With this, Ratnayake and Amer [16] do not validate the applicability of its approach in a real-time environment.
Overall, current approaches accept a significant loss in quality either because of very low resolutions and/or often make shortcuts to manage computation complexity. To overcome computation and communication challenges, our approach employs an algorithm/architecture co-design approach. It offers a key insight of separating different traffic types, compressing one traffic type to tame communication volumes, and proposes an architecture template. With this, our approach performs MoG background subtraction at Full-HD (1080p, 30 Hz) in real-time on actual hardware. It outperforms current approaches which operate at lower resolutions [6, [13] [14] [15] and addresses the system-level integration challenges (which were not solved in [16] ). More importantly, our approach presents a more general solution and an architecture template. Altogether, our paper paves the path towards embedded deployment of adaptive vision algorithms by outlining a systematic approach for managing their immense communication demands, and introducing an architecture template.
Background
To give contextual information for our approach, this section briefly introduces adaptive vision algorithm properties, and then overviews the application example for this paper.
Adaptive vision algorithms
Vision algorithms can be roughly divided into two classes: (i) filter-based and (ii) adaptive. (i) Filter-based algorithms (e.g. convolution, Canny or Sobel edge detection and Harris corner detection) keep limited algorithm-intrinsic data (e.g. few hundred bytes for convolution) and mostly focus on one frame at a time with very limited interaction across frames. In contrast, adaptive algorithms mainly work across frames, often based on machine-learning algorithms (e.g. MoG and SVM) which track frames through a continuously updated model (e.g. background classification and motion detection). The frame model is often very large (e.g. 248 MB for MoG). Updating the model with every frame causes significant memory traffic. Nevertheless, the adaptive nature enables tackling more complex tasks (e.g. object detection, tracking and classification) at high quality.
Adaptive vision algorithms have been studied at a higher level (e.g. MATLAB, OpenCV). However, their embedded realisation is very challenging because of the continuously updated frame model (large traffic, synchronisation) and more complex computation. Therefore novel solutions are required to enable efficient embedded realisations of adaptive vision algorithms.
MoG background subtraction
For this paper, we select MoG background subtraction [17] as a representative for adaptive vision algorithms. MoG isolates foreground (moving) objects from a (static) background. It is part of many vision applications, such as video surveillance, industrial vision and patient monitoring systems. MoG, visualised in Fig. 1 , uses multiple Gaussian distributions, also called Gaussian components, to model a pixel's background. Each Gaussian component has its own set of Gaussian parameters: weight ω i,t , intensity mean μ i,t and standard deviation σ i,t . Each pixel in a video stream is tracked by an own set of 3-5 Gaussians components (background model) and its Gaussian parameters are updated with every frame. Gaussians components of a pixel differ by learning factor to account for varying permissible change rates.
For further insight, Algorithm 1 ( Fig. 2 ) outlines the reference MoG algorithm (see further detail in [18] ). The algorithm loops through all pixels in the frame (beginning in line 2). For each pixel, the algorithm first classifies the pixel's Gaussian components into 'match' or 'non-match' components (line 5). A component is a match component, if the component's 'mean' is within a match threshold Γ FG of the current pixel value. Gaussian parameters are then updated based on 'match' classification. If no component matches (line 12), a new Gaussian component (called virtual component) replaces the component with smallest 'weight' (line 13). Then, the components are sorted based on their 'weight' over 'standard deviation ratio' (line 15). Starting with the highest ranked component, the algorithm checks if the component sufficiently expresses the current pixel (i.e. if its weight is less than the FG threshold Γ FG and the 'mean' over 'sd' is less than match threshold Γ match , see line 19). Finding one component that sufficiently expresses the current pixel declares the pixel as background, and the algorithm proceeds with the next pixel. Overall, the MoG algorithm has significant computation with many control statements (e.g. if-then-else), and poses many challenges for an embedded realisation.
Approach
This section describes our approach starting from system-level exploration, computation and communication realisation, and finally system-level integration. Throughout this section, we use the earlier introduced MoG as a running example.
System-level analysis and exploration
Attempting to address all adaptive vision algorithm challenges at once deems too complex as seen from restrictions of previous work. To untangle the dependencies and design options, it is beneficial to start at a higher abstraction level and iteratively identify and solve problems one by one. To hierarchically address the challenges, system-level design principles can be employed starting with an executable specification model. Such model serves as golden model to reference against, and simplifies exploring the effect of design decisions. As such, it enables to evaluate and adapt both algorithm/architecture with respect to each other. Fig. 3 highlights the MoG specification model captured in SpecC system-level design language (SLDL) [19] based on the reference algorithm (Algorithm 1 (Fig. 2) ). The MoG specification model is part of an object tracking flow which receives a pixel stream from the camera and outputs object positions to a monitor. The specification model expresses coarse-grained parallelism, pipeline stages, as well as isolates communication channels. The pipeline stages are: 'Gaussian update', 'weight normalisation' and 'FG detection', all are operating on streaming pixels. In 'Gaussian updates' (line 3-line 10 of Algorithm 1 (Fig. 2) ) Gaussian components execute in parallel, independent of each other. 'Weight normalisation' is a synchronisation point for normalising the updated weight parameters (line 11) of Algorithm 1 (Fig. 2) . Finally, 'FG detection' determines the FG/BG status of the pixel (line 12-line 23 of Algorithm 1 (Fig. 2) ).
The abstract communication channels show two types of data access: 'Gaussian parameters' (algorithm-intrinsic data) and Gray pixels, FG/BG mask' (streaming data). Dedicated communication channels separate the traffic based on their data type. Algorithm-intrinsic data ('Gaussian parameters') directly hits the memory hierarchy -read from memory, updated in the pipeline and written back to the memory. Conversely, the streaming data are directly transmitted from one behaviour to another. MoG receives Gray pixels' from RGB2Gray, and streams 'FG/BG mask' out to ObjectTracking.
Specification profiling:
The specification model enables system-level profiling (in our case integrated into the system-on-chip environment [20] ) to analyse MoG computation and communication demands, in order to identify system bottlenecks before starting implementation. [4] , MoG requires 12 cores (8 cores) for a 32 bit integer computation even when using most optimistic assumptions to obtain a lower bound resource. MOG's computational complexity make it prohibitively expensive for software realisation, it conversely is an excellent candidate for hardware realisation.
In addition, specification profiling reveals very high communication volume of 7440 MB/s (4360 MB/s). Reading/updating the Gaussian parameters produces 60 times more traffic than the streaming data which only occupies 130 MB/s (16 bit input pixel, 1 bit foreground). Storing Gaussian parameters requires 248 MB (146 MB) which are read and written once for each frame. In the result, MoG would saturate an LPDDR2 memory interface. Even the more powerful DDR3 memory interface would operate at 88% of its theoretical peak performance making such an implementation unrealistic. Only operation on Half-HD resolution seems in the realm of feasibility. The very high demands of MoG at Full-HD prompt for more algorithm analysis and investigation of communication parameters.
Bandwidth/quality trade-off:
One approach to mitigating the high computation/communication demands is to reduce the resolution as done in previous work. However, the market demands higher resolution to improve recognising and tracking objects with different sizes. To maintain Full-HD resolution while reducing the architecture demands requires investigating into savings potential at cost of some quality.
Quality contributors can be roughly divided into two orthogonal axes: computation precision and communication precision. MoG computation is independent per pixel (i.e. pixel parallelisable) and not latency sensitive. Thus, it can be addressed throughout customised HW implementation. MoG communication, however, is the main challenge (also in power consumption as will be later shown). One possibility is to reduce the precision (bit-width) for data traffic (e.g. Fig. 3 MoG specification model including coarse-grained parallelism www.ietdl.org storing Gaussian parameters) to reduce the system-level traffic. This introduces a trade-off between quality and memory bandwidth. As the Gaussian parameters attribute to the most traffic, we focus on quality/bandwidth exploration in the context of Gaussian parameters. Conversely, only limited/no opportunity exists for streaming traffic as it is driven from standardised input/output (I/O) interfaces.
To explore the bandwidth/quality trade-off, support blocks are added for precision adjustment on Gaussian parameter read/write. Fig. 4a presents MoG specification with precision adjustment. For simplicity, precision adjustments perform simple discretisation focusing on N most significant bits. The precision adjustment for Gaussian parameters introduces a trade-off between output quality and memory bandwidth.
We have exhaustively explored the trade-off. Quality is evaluated using MS-SSIM [21] , which focuses on a structural similarity, comparing against a ground-truth obtained from the reference algorithm (32 bit fixed point operation and no discretisation). Fig. 4b illustrates the trade-off as bandwidth (bits per pixel) on the x-axis and quality (MS-SSIM) on y-axis. The red Pareto curve in Fig. 4b plots the maximum achievable quality over Gaussian bits per pixel as a metric for measuring the volume of operational data per pixel. As an example, 70 bits/pixel maximally achieve 0.64 quality, whereas 244 bits/ pixel already can reach maximal quality. Note the full-length parameters require 480 bits/pixel. Fig. 4b yields two important messages. First, maximal quality is already reached with 244 bits/pixel demonstrating a significant potential in bandwidth reduction. Second, identical bit-width sizes with different bit discretisations lead to different quality. More details of trade-off analysis can be found in [22] .
In summary, the high-level exploration achieved about 51% reduction in memory traffic. This moves Full-HD MoG into the realm of feasibility for heterogeneous implementation. The following sections describe how the top-level specification is converted to an actual implementation. The main focus rests on communication as it reflects the inherent challenges in the context of adaptive vision algorithms.
Computation realisation
For the MoG computation realisation, we focus on a manual RTL implementation guided by the system-level specification. Alternatively, high-level synthesis (HLS) tools, for example, Xilinx Vivado, could be employed. They are promising especially when a system specification model is available. However, compared with hand-crafted design, HLS are typically less efficient. Furthermore, for the case of MoG targeted at Full-HD resolution, HLS tools are unable to meet the timing requirements for 148 MHz clock frequency. In the result, we chose the hand-crafted approach using Verilog HDL to capture our RTL model.
Deriving from our specification model which captured the coarse-level parallelism (Fig. 3) , we iteratively refine the MoG RTL model to meet the timing properties of the design. The first stage of RTL design is a direct translation of specification model into a behavioural RTL model. The initial RTL has only three pipeline stages with parallel execution of Gaussian updates and operates at only 9 MHz (far away from the target of 148 MHz). To realise Full-HD MoG processing, we explore three different optimisations: (i) algorithm tuning; (ii) operation width sizing; and (iii) deep pipelining. At algorithm tuning stage (i), we modify the algorithm to match it better to parallel HW execution. As an example, we replace ranking and sorting with parallel checking of Gaussian components. At operation width sizing stage (ii), we identify the optimal quality/width point for high-latency arithmetic (SQRT, divide and multiply). At deep pipelining stage (iii), we further break down individual coarse-level pipeline stages into the finer micro-pipeline stages to meet the timing requirements. Each optimisation has been first validated for its quality impact in the system-level specification before RTL realisation. This allows to quickly evaluate the benefits (ruling out inefficient attempts), and keeps the top-level specification in synch with realisation. The final RTL model includes 77 pipeline stages operating at 148.5 MHz clock frequency.
Communication realisation
The specification-level optimisations (Section 4.1) identified options to tame the memory bandwidth into an implementable range. In addition, two main traffic types were observed: streaming traffic and algorithm-intrinsic traffic. Fig. 5 generalises these types for adaptive vision algorithms, separation between streaming data and algorithm-intrinsic traffic. Streaming traffic is data under processing (pixels in case of vision) and is typically read from input ports and written to output ports or system memory. Streaming traffic deals with I/O of a module independent of the algorithm selected for realising the functionality. Conversely, algorithm-intrinsic traffic is because of data used for realising the algorithm (e.g. kernel density histogram or Gaussian parameters). Although different algorithms may achieve the same functionality, they may use vastly different internal data structures causing different algorithm-intrinsic traffic. For background subtraction, the streaming data consist of grey pixel and BG/FG mask. The algorithm-intrinsic data for MoG are Gaussian parameters. A different algorithm, for example, mean shift [23] requiring a complete history of N-frames, would produce different algorithm-intrinsic traffic.
The algorithm-intrinsic traffic volume of many vision applications, particularly in adaptive vision algorithms, is dominant and dwarfs the streaming traffic (e.g. 60× in MoG or 8× in component labelling). This poses significant bandwidth and consequently power challenges, sometimes even rendering an implementation infeasible. In MoG, 65% of on-chip power and more than 90% of overall power (combined on-chip and off-chip memory access) is consumed by algorithm-intrinsic data accesses. Separating traffic types enables trading off quality for bandwidth independent of the streaming pixels. This requires identifying the traffic in the specification, as well as architecture support. Therefore the underlying hardware architecture has to offer traffic separation optimising the memory bandwidth as well as managing system-level traffic based on the role and nature of data on computation.
Communication-centric architecture template:
Current related work lacks support of adaptive vision algorithms with significant algorithm-intrinsic traffic. They either avoid this class of applications or intermix the traffic types and consequently being limited to very small resolutions. To overcome this gap, we propose a communication-centric architecture template that (a) provides a framework for algorithm-intrinsic data access; (b) resolves data alignment between streaming and algorithm-intrinsic traffic; and (c) offers design options for trading bandwidth against quality. Fig. 6 outlines the essential components of our architecture template. It consists of two clock domains: computation domain and communication domain. The computation clock is driven by streaming data (pixels) clocking the adaptive vision kernel. The communication clock is set by the bus/ interconnect for accessing operational data. A different design choice is to unify both clock domains and re-time the input stream. However, the separation between computation and communication clock aids in efficiently managing the traffic of operational data access independently.
Streaming and algorithm-intrinsic data access are separated by using individual ports. Streaming data most efficiently enters the system through a system interface (e.g. HDMI from a camera). This avoids central processing unit interaction with the traffic. This also simplifies chaining across multiple instances, as streaming traffic can be directly forwarded without hitting the system memory. Our design uses this direct connection for receiving grey pixels from RGB2Gray model (see later introduced Fig. 8a ).
Using the system memory for algorithm-intrinsic data is unavoidable because of its volume (up to 248 MB for MoG). Dedicated DMA channels continuously read/write back algorithm-intrinsic data at a data frame. DMA channels operate in parallel, but synchronised to preserve www.ietdl.org correct read after update sequence. DMA channels connect through system interconnect to the memory interface. The DMA channels operate in circular mode (auto repeat) to restart with each frame independent of the host. This eliminates unnecessary synchronisation with the processor, freeing up computation cycles, for example, for a downstream application.
By using separate ports, streaming and algorithm-intrinsic data are transferred in parallel, but need a tight alignment. With each incoming pixel, the according pixel's model data (i.e. algorithm-intrinsic data) is required at the same time. Any misalignment will make the algorithm fail because of operating on the wrong model. Initial alignment is achieved by correctly configuring the DMA's start address (pointing to the model in memory). However, to guarantee continues alignment, with each new pixel its model data needs to be delivered without any interruption. Our architecture template maintains alignment using Async. FIFOs that (i) bridge the clock domains and (ii) compensate for burstiness of bus traffic.
In the result of the findings in Section 4.1, precision adjustment blocks reduce the bandwidth requirement by re-size algorithm-intrinsic data (Gaussian parameters) before being delivered to/from the vision algorithm. With a simple focus on MSB, their implementation is straight forward, yet has a profound effect on system performance.
In current heterogeneous systems, the host processor is responsible for controlling of vision accelerators imposing a considerable load to the host processor while potentially leading to a low accelerator utilisation. To avoid this overhead, the architecture template employs a control unit (CU) to minimise or even eliminate the need for host processor interaction. The CU offers set of memory-mapped registers (MMRs) to the outside for initialising and configuring the vision processing. In our architecture template, software is only responsible to initialise the DMAs and CU's MMRs. After configuration/initialisation, the architecture template executes independently from the host processor on many numbers of frames. The CU responsibility includes controlling of data alignment between the streaming and algorithm-intrinsic, quality adjustment in precision units and keeping read/write DMAs synchronised.
Our architecture template offers a set of configurable knobs (design choices) to the designers. The design choices are Async. FIFO depth, DMA inline buffers as well as DMA channels and communication bus/interconnect width and frequency. The template has some limitation, as well. We assume same width size between bus, inline buffers and FIFO width. Given a desired quality and bandwidth, as well as the interconnect parameters, the architecture template can be properly dimensioned.
System integration
We consider pairing of our architecture template with other streaming cores (either in SW or HW) to realise larger applications such as object tracking vision flow. This utilises combined strengths of high throughput low-power execution of compute-intense adaptive vision processing (e.g. MoG background subtraction) in hardware, while a processor offers top-level adaptive control and intelligence (e.g. for tracking objects across frames).
Traditionally, vision kernels have been implemented in HW as accelerators, that is, a co-processor which is called from and synchronised by a host processor. illustrates an MoG co-processor similar to [6] . In the co-processor arrangement, both streaming traffic and algorithm-intrinsic traffic occupy the interconnect and system memory. Pixels are received by the processor and forwarded to memory (In ⇒ Mem (1)). After being triggered by the processor, MoG reads the frame from memory (Mem ⇒ MoG (2)), starts the background subtraction processing. By finishing the background subtraction, and writes back the FG mask to memory (MoG ⇒ Mem (3)). Finally, the FG mask is read by the processor for post-processing or directly forwarded to the output port (Mem ⇒ Out (4)). In total, four transfers are necessary (In ⇒ Mem ⇒ MoG ⇒ Mem ⇒ Out) buffered through the system memory. The cycle repeats for each frame. All transactions are scheduled by the processor, leading to high overhead and consequently an inefficient solution. Concurrently to streaming, algorithm-intrinsic data (Gaussian parameters) hits the system memory as well, creating contention and increasing bandwidth demands.
A more efficient solution is a peer-processor arrangement enabled by our architecture template (see Fig. 7b ). The architecture template has direct access to system I/O interfaces for input and output of data without requiring constant processor interaction. In addition, streaming nodes can be chained as illustrated by 'RGB2Gray'. Separating streaming and algorithm-intrinsic data allows keeping the streaming data on-chip without the costly memory interaction (and enables streaming node chaining). Only algorithm-intrinsic traffic (after precision adjustment based on quality constraints) hits the system memory. The host processor only performs first initialisation, after which 'RGB2Gray' and 'MoG' operate completely independently, eliminating the synchronisation overheads. Hence, more cycles remain available on the processor for higher-level processing.
Experimental results
To demonstrate the benefits of our approach, this section describes an instance of our architecture template running MoG on a Xilinx Zynq platform. It evaluates the performance, power consumption and as resource utilisation, as well as highlights advances over the closest related work.
MoG realisation on Zynq platform
Our realisation targets the Zynq-7000 XC7Z020-CLG484-1 SoC [24] , which combines two ARM Cortex-A9 processor cores with programmable logic (Artix-7 FPGA). Processor cores and logic are interconnected through AXI and share I/Os, as well as off-chip memory interfaces (DDR3, LPDDR2) with a peak bandwidth of 4.2 GBs. We use an HDMI I/O FMC module for streaming the video into and out of the Zynq. We have used Xilinx Vivado 2014.2 design suite to synthesize and implement our solution. 1920 at 30 fps) . The MoG operates on pixel stream receiving video input from one HDMI input interface and outputting foreground pixels to another HDMI output connected to the monitor. Our system can also operate at 1080p 60 Hz (supported by our 148 MHz MoG processing pipeline). However, because of DDR3 bandwidth limitations, operation is limited to a window of interest (i.e. subset the frame) of 50%. In future work, we will improve the discretisation quality to increase the window size with constant quality. Tables 2 and 3 highlight resource utilisation and on-chip power consumption for one instance of our architecture template to achieve 100% quality. It is configured with 8 K FIFO depth, 125 MHz 256 bit wide bus interface and 148 MHz computation frequency [Many other configurations are possible (e.g. see Fig. 9) ]. The tables also compare with a closely related, recently published approach [16] , which also targets MoG background subtraction at high resolutions (720p and 1080p). However, Ratnayake and Amer [16] have only shown simulation results without an actual FPGA implementation running in real-time (please see Section 2 for more details). As such, Ratnayake and Amer [16] do address some system integration challenges. This makes a direct comparison of the approaches difficult. To ease comparison, we list in addition to our complete solution a version that does not include the DMAs for system integration. Table 2 highlights that the utilisation is dominated by DMAs for system integration. Our implementation uses two DMAs: one two-channel DMA for Gaussian access, and one DMA for the FG mask. They consume 8300 registers, 12 200 LUTs, 42 random access memory (RAM) blocks. Our system integrated MoG uses 23 787 registers, 15 475 LUTs, 74 RAM blocks and 30 DSP slice. Ignoring the DMAs (MoG without system integration), our solution utilises much fewer resources. The approach in [16] stays at simulation level and does not consider DMAs. It consumes Table 3 presents the power consumption itemised per macro component. We used Xilinx X-Power Analyser after synthesis and mapping of the entire design to Zynq platform to determine power consumption. Our solution consumes 151.1 mW on-chip power in total. MoG by itself consumes only 23% of the total, with 36 mW. The remaining power is attributed to communication components for accessing streaming and algorithm-intrinsic data (Gaussian parameters). The communication subsystem consumes 69% (105 mW) of on-chip power for handling the algorithm-intrinsic data. Within communication components, DMA and FIFO have the highest contribution with 42 and 39 mW, respectively. The precision adjustment block with its simple MSB selection consumes with 0.1 mW only very little (but has tremendous effect on reducing the bandwidth). Table 3 also lists the power consumption reported in [16] . Note that Ratnayake and Amer [16] only provide a detailed power break down for 720p resolution. In contrast, we report 1080p results to demonstrate the potential. Comparing both approaches, major portions of power is consumed by system components (e.g. memory controller, I/Os, DMAs and interconnects) while MoG kernel itself consumes only few milliwatts (36 mW in our solution, against 24 mW in [16] ). Overall, our solution consumes less power while supporting higher resolution (151.1 mW at 1080p compared with 596 mW at 720p). It has to be noted although that the Xilinx X-Power reports do not include the memory controller.
Resource utilisation and power consumption evaluation
To obtain an indication of total power consumption, we include an estimation of off-chip power because of memory access. We base our estimation on measurements published in [25] , which reports 40 pJ/bit transfer for an LPDDR2 memory interface. Combining both on-chip power and off-chip power for memory accesses totals to 1246 mW for 1080p at 30 Hz. This is much lower than [16] reported total power for 1080p of 1776 mW. Although a direct comparison is difficult given the different assumptions and stages in implementation, the results indicate a much better power efficiency of our approach.
Quality evaluation
Our communication-centric architecture template allows user to adjust architecture configuration knobs (FIFO size, bus width and bus frequency) to achieve required bandwidth and thus quality. Table 4 summarises some architecture configurations that support selected quality points. As the quality demand raises, the bus width or bus frequency has to increase as well to support the required bandwidth. With this, the quality requirement dictates the bandwidth and thus impacts the system power.
To illustrate the quality to power correlation, Fig. 9 plots power consumption for selected quality levels. As indicated in Section 4.1.2, we use MS-SSIM [21] to quantify quality. It computes the quality by comparing the structural similarity of an image with a ground-truth image. The ground truth was obtained through a software implementation with maximal length Gaussian parameters (32 bits each), using five Gaussian components, yielding 440 bits/pixel in total. In fact, the software implementation is our MoG system specification captured in the C-based SpecC SLDL. Fig. 9 reports both on-chip power and off-chip power for memory access.
Overall, we observe a significant reduction in system power through Gaussian parameter compression. The power for off-chip memory accesses dominates and changes with quality. Even with maintaining 100% quality, the parameter compression reduces system power from estimated 2.5 W [Note that operating with uncompressed parameter storage is infeasible as it exceeds the memory interface' peak bandwidth.] to 1.2 W. Gaussian parameter sizing while maintaining 100% quality cuts the traffic down to 51% and the off-chip power is halved as well. Fig. 9 shows that lowering quality constraints reduces the volume of algorithm-intrinsic data, thus reduces power consumption. The power for off-chip memory accesses directly correlates with bandwidth. Relaxing the quality requirements to 90% already drops the total power to just below 1 W. Accepting a visible loss in quality, for example, Fig. 9 Power/quality trade-off with 70% (60%), dramatically reduces power to only 590 mW (443 mW). Overall, precision adjustment has a profound effect of making an HW implementation feasible (given the memory interface), but also significantly reduces power consumption. Conversely, on-chip power consumption mostly stays flat at about 151 mW and only drops marginally.
To better illustrate the computation/communication challenges, we contrast against an MoG software (SW) approach. We have realised MoG in C, compiled it using GCC with −O3 and enabled NEON optimisations. One Cortex-A9 core at 666 MHz (part of Zynq) takes 610 s (360 s) to process 60 frames in Full-HD (Half-HD). In other words, SW execution is 610× (360×) slower than real-time. Even if real-time execution could be achieved through massive parallel execution on 610 (360) A9 cores, power consumption would be unacceptably high. Assuming each core (with cache) consumes 0.6 W at 666 MHz [26] , this SW approach would consume 360 W (216 W). Even hand-crafted NEON SIMD optimisations cannot push this approach into a solution envelope. An unrealistic 4 × SIMD speedup still needs 90 Cortex-A9 cores which are exceeding the power constraints.
Comparison with SW and HW implementations
The proposed concept of precision adjustment can be also applied to SW. To quantify its effect, we have implemented our findings from the quality/bandwidth exploration in SW. Following Section 4.1.2, 230 bits/pixel (through precision adjustment) already achieve 100% quality. Applying the precision adjustment in SW reduces execution time from 610 to 340 s, yielding a 1.7× speedup. This illustrates that the ARM's memory interface contributed as a bottleneck. Overall however, SW execution is still much slower than real-time.
Comparing against the closest approach with HW implementation, [6] , reveals tremendous performance differences operating at 640 × 480 [6] against 1920 × 1080. Unfortunately, Appiah and Hunter [6] do not report power or quality, making comparison difficult. Nonetheless, the main approach of Appiah and Hunter [6] is to assume a constant standard deviation to reduce computation complexity and bandwidth. To estimate the effect of this simplification onto quality, we implemented this approach at specification level. Assuming a constant standard deviation reduces bandwidth by 33%. However, the output quality drops to 40% [The quality results vary based on the complexity of the scene under evaluation] (MS-SSIM evaluation). In contrast, our proposed precision adjustment performs much better. It reduces bandwidth to half while still maintaining 100% quality. In addition to quality advantages, our solution consumes less power and is more scalable than [6] which use a co-processor arrangement (similar to Fig. 7a ). In contrast, our architecture template, thanks to its peer-processor mode (Fig. 7b) , avoids memory access for streaming data, as well as simplifies node-to-node streaming (see 'RGB2Gray') improving scalability.
Conclusions
Adaptive vision algorithms, which use machine-learning principles, are powerful and attractive for tackling complex vision tasks. However, they pose tremendous challenges towards an embedded implementation (both in terms of processing and communication). Current approaches operate on low resolutions and/or make quality impeding assumptions.
This paper introduces an approach to manage the immense communication demands of adaptive vision algorithms, and paves the path towards their embedded realisation. We introduce a key insight of traffic separation between streaming traffic and algorithm-intrinsic traffic. Streaming traffic refers to the functional inputs/outputs (e.g. image pixels). Conversely, algorithm-intrinsic traffic refers to data accesses needed by algorithm itself, for example, for updating a machine-learning model, and is implementation dependent.
Separating these types enables application-specific traffic optimisation (e.g. compression, topology and prioritisation). We also proposed a communication-centric architecture template which takes advantage of the traffic separation and simplifies embedded realisation of adaptive vision algorithms.
We have demonstrated the benefits of traffic separation based on a MoG background subtraction. We introduced a lossy compression (precision adjustment) of its algorithm-intrinsic data: the accesses to Gaussian parameters. The lossy compression dramatically reduces traffic, and creates a trade-off between communication bandwidth and quality. For the case of MoG, we demonstrated a bandwidth reduction down to 51% without loss in quality (as quantified by MS-SSIM [21] ).
Utilising the architecture template and parameter compression, we demonstrated the benefits of our approach by realising MoG background subtraction operating on a 1920 × 1080 30 Hz stream in real-time. Mapped to the programmable logic of the Zynq-7000, it consumes 151 mW on-chip power, including off-chip power for memory access it totals 1246 mW. With this, our solution operates at higher resolution, consumes less power and yields much higher quality than the next comparable HW solution.
More general, the quality/bandwidth trade-off has been mainly studied in context of streaming data to save communication bandwidth in networked systems. This paper argues to shift attention towards algorithm-intrinsic data to make complex algorithms (such as adaptive vision algorithms) implementable in hardware. The paper also opens new research avenues, for example, for studying more complex compression schemes to maintain quality with even further reduced bandwidth. This would enable processing higher resolutions on memory bandwidth constrained systems.
