 The current evolution in the semiconductor industry brings the possibility of introducing significant complex modules onto system-on-chip design platforms. Traditional verification processes are software-based simulations or physical verifications. They have limited expansions and reuse capabilities [1]. However, an architecture-level design space exploration in hardware/software co-design and co-evaluation based on FPGA-SoCs offers sufficient performance to run application software on a hardware system orders of magnitude faster than software simulators. The FPGA-SoCs allow engineers to explore the high level of integration with a quick and convenient co-verification, which helps engineers to determine whether both software and hardware are executed correctly on microcontroller processors and customized hardware.
 The current evolution in the semiconductor industry brings the possibility of introducing significant complex modules onto system-on-chip design platforms. Traditional verification processes are software-based simulations or physical verifications. They have limited expansions and reuse capabilities [1] . However, an architecture-level design space exploration in hardware/software co-design and co-evaluation based on FPGA-SoCs offers sufficient performance to run application software on a hardware system orders of magnitude faster than software simulators. The FPGA-SoCs allow engineers to explore the high level of integration with a quick and convenient co-verification, which helps engineers to determine whether both software and hardware are executed correctly on microcontroller processors and customized hardware.
The FPGA-SoCs platforms are still in the early stage of development [1] . The difficulty of A Highly Integrated Hardware/Software Co-Design and Co-Verification Platform
Shufan Yang

University of Glasgow
Zheqi Yu
University of Wolverhampton
Editor's note:
This article presents a platform for hardware/software co-design and co-verification with a flexible hardware/software interface. The platform has been applied to verification of a pedestrian tracking application to demonstrate its effectiveness.
-Wen Chen, NXP Semiconductors NV integrating software and hardware is how to create a flexible interface between hardware implementation and software algorithms. Since those integrations have a long learning curve for engineering, it causes less design productivity, even using an advanced design tool.
In this article, we present a buffering scheme and memory mapping interfaces for hardware/software co-design and co-verification platform. A verification process for image and video multimedia products with a universal scalable codec platform has been evaluated based on this platform. It also supports architecture-level verification for software/hardware co-design, allowing design optimization before an embedded design is committed for chip fabrication.
Related work
FPGAs are commonly used in image and video multimedia applications to ensure that vision systems are capable of being run in real time [1] . Those applications have been reported with carefully considered hardware complexities and limited hardware/software co-verification methods. 
IEEE Design&Test
General Interest
to meet the requirement of image data transferring among individual image processing modules. Image segmentation applications have been implemented using a ZYNQ chip. However, their work is directly coded using VHDL with limited configurable features to wrap a user logic function; these configurations of CIDA also need experienced hardware designers to carry out. A software-and-hardware integrated design has been demonstrated by Cerezuela-Mora et al. in [4] , but their work has limitations as it relies on an Advanced eXtensible Interface (AXI)-stream bus interface to transmit image data in DDR system memory. Similar to Mefenza's et al. [2] , this article might be incompatible with different DDR memory settings among various boards.
Although several image processing systems (PSs) can be used for real-time image processing, most current work uses a highly optimized hardware implementation with less modification flexibilities to add additional features. In addition, no works have considered providing a platform for hardware/software co-verifications, since hardware and software are only built independently with plug-and-debug methods. We present a novel framework to integrate universal video codecs with Linux-based drivers for communicating with FPGA logic. Taking advantage of the Xilinx Vivado high-level synthesis (HLS) flow, an architecture-level test bench can be integrated with embedded software and customized hardware. Any developers can mitigate our framework into a sophisticated image and video multimedia PS to accelerate time to market (TTM) and improve product readabilities.
Principles and system architectures
In this article, we developed an AXI wrapped buffer driver and a memory mapping scheme to allow simultaneous designing processes with integrated hardware and software designs. This method also allows designers to obtain a flexibility of integrate hard IP blocks into systems using a video streaming method. An integrated co-evaluation platform is also introduced to help our multimedia system to run on the real-time hardware platform. Figure 1 shows our system overview and prototyping environment (Xilinx 702C board). The electric I/O circuit interconnections between FPGA IP modules and AXI system buses for both the PS and the programmable logic (PL) portion in the system platform are shown in Figure 1a . As illustrated in Figure 1a , a dedicated DMA controller (one of the Xilinx IP cores provided by Xilinx [7] ) provides memory-mapped access to the DDR memory through the AXI_MM2S and AXI_S2MM buses, where MM2S stands for "memory-mapped access to streaming access" and S2MM stands for "streaming access to memory-mapped access." This DMA controller also transfers data from the FPGA side to one of the ARM cortex A9 cores through the AXI-lite bus. The AXI streaming buses, AXIS_MM2S and AXIS_S2MM, can source a continuous stream of image data into Cortex A9 with a configurable block size [8] . An extra AXI bus between the AXI data buffer and cortex A9 is used for transferring data from cortex A9 to an High-Definition Multimedia Interface (HDMI) display module for playing back videos. Our device driver allows interrupts handling for the commencement of VDMA to transfer data between FPGA buffers to user space, which enables the hardware IP to run as a peripheral within an software operating system. The version of Linux kernel used for our published source code is 3.10.0. A multiplatform supported open-sourced U-boot software [12] is used to load operating systems for ARM cortex-A9 processor with a C++ enabled tool chain and virtual memory management support. An HDMI monitor is used to playback the postprocessed video. Figure 1b shows a person walking through a corridor with a green bounding box for pedestrian detection in current frame.
System overview
Compared to the traditional bare-metal hardware system designs, this architecture integrates a software development cycle, including kernel configuration and compilation, boot loader, and finally the generation of the hardware description file using HLS steps with architecture-level co-verifications.
Universal codec for reconfigurable SoC systems
To enable a co-design system between hardware and software, a particular memory mapping scheme is also developed to enable soft multimedia decoding. 
General Interest
V4L2 framework and FFmpeg library, both actively maintained by a software developing community [9] , are used to build a universal codec platform on reconfigurable SoCs. In this article, we use FFmpeg 3.1 to support more than 90 encoders. A cross compiler tool to transplant FFmpeg from X86 architecture into ARM architecture has been used based on GUN automakefile. All I/O routines and coding/decoding is handled very efficiently using off-the-shelf components from the V4L2 API [9] . Figure 2a shows how to use memory mapping to direct access video direct memory access (DMA). System call mmap( ) is used to map the video encode applications into video DMA physical location. It has the benefit of increasing the computational efficiency since ARM processors do not need to copy data into kernel spaces. For other control signals, the system call ioctal( ) is used for control signaling configurations. Linux system function call dma_alloc_coherent binds the physical address of buffers and its virtual kernel space address to allocate chunk of memory, which provides data to user space encoder running at a cortex A9 processor. The memory mapping is applied to the user program to correspond with the AXI streaming bus in the DMA with transmission rates range from 4096 × 16 bits to 4096 × 64 bits at a time.
Processes in application user spaces write data into the write buffers at the source end, calling the kernel space driver to start DMA, and then successfully transmits the data into FPGA connected AXI FIFO. For the reverse behavior of the data transmission, i.e., to read data from the FPGA, the driver is first called to start DMA so as to read the FPGA data and write them into the read buffer. Using this strategy, the main program can evaluate the memory requirement and dynamically allocate buffer spaces to facilitate data transferring between processing logic (FPGA) to PS (Cortex A9).
Hardware/software co-evaluation platform
In addition to our new driver buffering scheme and memory mapping approach in soft multimedia decoding, we also develop a verification flow for the integrated multimedia PS (as shown in Figure 3) . We use the HLS tool in Xilinx Vivado HLS (2015.4) generated a hardware description for the module of pedestrian detections based on histogram of oriented gradient (HOG) method [7] . The behavioral Verilog test bench that wraps around the HOG design top level is also provided by Vivado. This test bench provides clocking and reset stimulus to the HOG design top level and run simulations, which is useful for getting familiar with the signaling on the FPGA core modules by observing simulation waveforms. After the behavior of the HOG design is satisfied, a transistor-level hardware description is generated. At next stage, a behavior model with C-wrapper interface is automatically generated for fast system verification to explore the feasibility of the HOG algorithm in terms of throughput and latency, which we refer to as architectural-level verification (as shown in Figure 1 ). This C-wrapped test bench can work with all simulation outputs from a behavioral RTL through post-implementation timing. After the HOG-based pedestrian detection module has been solidified through the Vivado physical-implementation flow, it can be integrated into the design using IP integrator (available from Xilinx Vivado Tools). Finally, the image processing results are observed using a monitor connected with the HDMI interface connected to the ZYNQ 702c board.
A case study of pedestrian tracking Pedestrian tracking algorithms
We implement a video pedestrian tracking based on an HOG algorithm to demonstrate the functionality of our verification platform. A feature descriptor generated by the HOGs is used to describe the local objects appearance and shape [10] . Following the calculation of the local histograms of the image, a normalization process is used to reduce the impacts of changes that were introduced by various illuminations and shadowing in the image. Once another HOG descriptor is calculated, the descriptors are fed into a classification system based on a supervised learning algorithm.
In this article, we use a support vector machine (SVM) as a baseline classifier. The SVM is trained with the INRIA training set and the classifier is based on AdaBoots. The value of the regularization coefficient in SVM is an important value since this parameter controls the degree of over learning. Thus, a small value of coefficient allows a large separation margin between classes, which reduces over learning and improves generalization. In our experiments, we chose 1.5 as the coefficient value.
Hardware implementation platform
Pedestrian tracking algorithms that are implemented in Zynq-7000 contains an Artix-7 FPGA and an ARM Cortex-A9 processor on the same chip together. In our hardware/software co-design platform, a small part of computations (color to gray) is performed on the FPGA, while pedestrian detection and SVM-based classification are ran on one of the ARM processor cores; the other processor core is used for universal codec functions.
The HOG-related processing module (as shown in Figure 2a ) is generated by Vivado HLS software. Although this is a relatively less computational expensive function, we aim to demonstrate that the facility of exploring design space for any given image processing applications and increasing reusability. The hardware resources required by Xilinx Zynq 7000 chip for HOG and SVM-based pedestrian detection only used 29% of LUT and 14% of BRAM. It is noted that the hardware utilization forecast from Vivado HLS will be changed after RTL (generated hardware description language) level generation; hence, it is necessary that we keep a margin for hardware device choice for a design optimization.
Experiment setup and evaluation
The performance of our system is evaluated by measuring the encoding rates and the precession rates of pedestrian tracking tasks. All timing and profiling procedures for measurements are implemented with the ARM cortex A9 clock frequency, 800 MHz. An HOG algorithm-based pedestrian tracking system was implemented using a Xilinx ZYNQ-7000 chip. For evaluation processes, we adapted the most popular pedestrian tracking data sets to evaluate our hardware/software co-design platform [11] . Three typical real-life scene sequences with four video codec is used to evaluate the speed of encode and detection precision for pedestrian tracking. The detail of the image data sets can be found in Appendix A. The three video sequences with four video formats were chosen to test real-life scenario with video resolution's range from QVGA (320 × 240) to VGA (640 × 480).
Although Wu et al. [11] proposed a benchmark for testing visual pedestrian tracking: precision rates, success rates, and robustness, in this article, we focus on testing the real-time image processing performance instead of testing the tracker performance. Therefore, we only analyzed the precision and success rates. The precision rates indicate the ratio of frames with an average center location error below a threshold to the ground truth from a visual tracker benchmark [11] . The success rates indicate the ratio of successful frames at the thresholds varied from 0 to 1 compared to the ground truth.
In these experiments, we compared three approaches: the results from software only single ARM cortex-A9 processor, the dual ARM cortex-A9 General Interest processors approach, and the FPGA co-design approach. In Figure 4 , the ARM15 curves indicate the experiment using duel ARM Cortex A9 processors. The Dcore curves indicate the experiments of using single ARM cortex core and the Codcore curves indicates the hardware/software co-design approach for pedestrian tracking. Overall, the best result was from the gym sequence (Figure 4c ) in those three experiments. Since the size of pedestrian objects in the video sequence blur_body, dance2, and gym scene are 10% smaller than the size of detection windows (64 pixels × 128 pixels). It is noted that the results of video sequence of gym and dance2 scene have the same increasing rates for all three platforms. The precession rates of FPGA co-design platforms (with legend Codcore15 in Figure 4 ) are better than a single ARM cortex-A9 system (With legend ARM15 in Figure 4 ) and dual ARM cortex A9 processors (with legend dcore15 in Figure 4 ) due to the FPGA hardware acceleration advantage. Resultant success plots for those three scenes are shown in Figure 4d -f, which shows our implementation is in both the scale and aspect ratio adaptability.
Taking advantage of HLS implementation, a further optimize can be explored on memory utilization. The performance of different video encoders are presented in Figure 5 with DAM block transferring provides 2× further speed up compared to initial version of our FPGA co-design approach. Our measurements were carried out with four popular video codec formats: H.263/MPEG-4 Part2, H.264/ MPEG-4 AVC, Microsoft codec, and Google (on2) VP8 codec. The average decoding frame FPS_avg is computed as FPS_avg(n) = f_cpu/c_frame(n). Here, f_cpu is the CPU frequency, and c_frame(n) denotes the average decoding cycles per VGA frame. The bit rate reported in Figure 5 is relatively stable on 30 frames/sec, which meets the real-time image processing need.
A comparison for TTM with an event-based simulator to our co-design and co-evaluation platform can demonstrate a factor five times faster, since our framework is able to offer sufficient performance to run complete application software on top of the targeted real-time operation system based on ARM cortexA9 and FPGA PL directly.
With a rapidly changing system-on-chip design industry, fast verification is needed to enable a higher productivity for embedded system products. Hybrid FPGA-SoCs co-verification architecture is an effective solution, providing designers the benefit of early stage hardware/software co-verification and integration between hardware and software, which in addition can be used as a direct test in the real environment.
Since the main concern of this article was the feasibility and scalability of the verification platform for image and video multimedia processing, our C-like program was not compatible with standard C compilers. Since C language is not intrinsically concurrent, a special design HLS compiler is required. We also did not use a hard-wired codec for maximum performance, as it is important that our platform efficiently elevates the design cost of video codecs function using an interface to elegantly share video processing needs. Based on our current framework, any new codec can be added into the platform with minimal efforts.
nevertheless, this project presents a universal codec-enabled verification platform with a software and hardware codesigned framework. We used a real-time case study to demonstrate the capability of evaluating real-time image processing algorithms using open access image data sets and, while doing so, we also overcame the problem of memory mapping from ARM cortx-A9 processor cores to FPGA buffers and successfully transplanted video streaming drivers into Linux V4L2 multimedia framework based on a Xilinx open source Linux operating system. Although we only demonstrated pedestrian tracking as a case study, any other embedded vision algorithms can be ran on this platform since the interfaces between general POSIX software interfaces and hardware accelerator are well defined. In this project, we also demonstrated how to take advantage of POSIX standard Linux software with vast amount of software libraries and traditional hardware to speed up TTM for system-onchip designs.  Appendix A: Video data sets 
