Video resolutions used in variety of media are constantly rising. While manufacturers struggle to perfect their screens it is also important to ensure high quality of displayed image. Overall quality can be measured using Mean Opinion Score (MOS). Video quality can be affected by miscellaneous artifacts, appearing at every stage of video creation and transmission. In this paper, we present a solution to calculate four distinct video quality metrics that can be applied to a real time video quality assessment system. Our assessment module is capable of processing 8K resolution in real time set at the level of 30 frames per second. Throughput of 2.19 GB/s surpasses performance of pure software solutions. To concentrate on architectural optimization, the module was created using high level language.
Introduction
Nowadays, in addition to traditional Quality of Service (QoS), Quality of Experience (QoE) poses a real challenge for Internet audiovisual service providers, broadcasters and new Over-The-Top (OTT) services. The churn effect is linked to QoE impact; the end-user satisfaction is a real added value in this competition. However, QoE tools should be proactive and innovative solutions that are well adapted to new audiovisual technologies. Therefore, objective audiovisual metrics are frequently dedicated to monitoring, troubleshooting, investigating, and setting benchmarks of content applications working in real-time or off-line.
The so called Full-Reference (FR), Reduced-Reference (RR) and No-Reference (NR) quality metrics are used for models standardized according to International Telecommunication Union -Telecommunication Standardization Sector (ITU-T) Recommendations. Most of the models have some limitations as they were usually validated using one of the following hypotheses: frame freezes last up to two seconds; there is no degradation at the beginning or at the end of the video sequence; there are no skipped Table 2 : Synthesis of FR, RR and NR Mean Opinion Score (MOS) models (based on [1, 8] ).
FR RR NR
5*Resolution HDTV J.341 [6] n/a n/a SDTV J.144 [2] n/a n/a VGA J.247 [3] J.246 [4] n/a CIF J.247 [3] J.246 [4] n/a QCIF J.247 [3] J.246 [4] n/a frames; video reference is clean (no spatial or temporal distortions); there is minimum delay supported between video reference and video (sometimes with constant delay); and up or down-scaling operations are not always taken into account [1] .
In the past, metrics based on three historical video artifacts (blockiness, jerkiness, blur) were sufficient to provide an efficient predictive result. Consequently, most models are based on measuring these artifacts for producing a predictive MOS. In other words, the majority of the algorithms generating the predicted MOS show a mix of blur, blockiness, and jerkiness metrics. The weighting between each of these Key Performance Indicators (KPIs) could be a simple mathematical function. If one of the KPIs is not correct, the global predictive score is completely wrong. Other KPIs are usually not taken into account (exposure time distortion, interlacing, etc.) in predicting MOS [1] .
The ITU-T has been working on KPI-like distortions for many years (please refer to [9] for more information). The history of the recommendations is shown in Tab. 1, while metrics based on video signal only are shown in Tab. 2, both based on [1] .
Related research in [10] addresses measuring multimedia quality in mobile networks with an objective parametric model [1] .
ITU-T Study Group 12 (SG12) is currently working on modeling standards for multimedia and Internet Protocol Television (IPTV) based on bit-stream information. Q14/12 work group is responsible for the projects provisionally known as non-intrusive parametric model for assessment of performance of multimedia streaming (P.NAMS) and non-intrusive bit-stream model for assessment of performance of multimedia stream-ing (P.NBAMS) [1] .
P.NAMS utilizes packet-header information (e.g., from IP through MPEG2-TS), while P.NBAMS also uses the payload information (i.e., coded bit-stream) [11] . However, this work focuses on the overall quality (in MOS units), while monitoring of audio-visual quality by key indicators (MOAVI) is focused on KPIs [1] .
Most of the recommended models are based on global quality evaluation of video sequences as in the P.NAMS and P.NBAMS projects. The predictive score is correlated to subjective scores obtained with global evaluation methodologies (SAMVIQ, DSCQS, ACR, etc.). Generally, the duration of video sequences is limited to 10 or 15 s in order to avoid the forgiveness effect (the observer is unable to score the video properly after 30 s, and may give more weight to artifacts occurring at the end of the sequence). When one model is deployed for monitoring video services, the global scores are provided for fixed temporal windows and without any acknowledgment of the previous scores [1] .
Generally, the time needed to process such metrics is long even when a powerful machine is used. Hence, measurement periods have been short and never extended to longer periods. As a result, the measurements miss sporadic and erratic audiovisual artifacts.
The concept proposed here, partly based on the framework for the integrated video quality assessment published in [12] , is able to isolate and focus investigation, set up algorithms, increase the monitoring period and guarantee better prediction. Depending on the technologies used in audiovisual services, the impact of QoE can change completely. The scores are separated for each algorithm and preselected before the testing phase. Then, each KPI can be analyzed by working on the spatially and/or temporally perceived axes. The classical metric cannot provide pertinent predictive scores with certain new audiovisual artifacts such as exposure distortions. Moreover, it is important to detect the artifacts as well as the experience described and detected by the consumers. In real-life situations, when video quality of audiovisual services decreases, the customers can call a helpline and describe the annoyance and visibility problems; they are not required to provide a MOS.
There are many possible reasons for video disturbance, and they can arise at any point along the video chain transmission (filming stage to end-user stage). The main concern of the authors of the papers is an efficient hardware implementation of proposed solution. This is addressed using hardware development techniques decreasing latency and throughput of the system which is a challenging task partially covered in the following papers [13] [14] [15] [16] [17] .
Related work
Automated video quality assessment has been an issue addressed in many papers in previous years. Ligang Lu et al. in paper [18] presented a no-reference solution for MPEG video stream measuring quantization error and blocking effect. Their solution showed positive correlation with other methods. However, because of the technology available at the time of publication, their system throughput is far from modern requirements. Marcelo de Oliveira et al. [19] successfully implemented Levenberg-Marquardt method in low end platforms using VHDL. They showed that hardware implementation results maintain a strong correlation with software solution, despite reduced precision due to usage of fixed point arithmetic. Neborovski et al. [20] implemented field-offset detection, blurring and ringing measurements in Field-Programmable Gate Array (FPGA). Their language of choice was Verilog, using platform based on Virtex 4 they achieved real time processing for fullHD resolutions.
Video quality assessment
This paper addresses a challenging task of building a module capable of accelerating the metrics computations. Consequently, the designed module produces video quality assessment in real time for each video frame. The selected four metrics were implemented in hardware:
The choice of the metrics was driven by their performance and hardware implementation feasibility.
The authors designed and implemented a single module for all the four metrics. Such an approach enables hardware units sharing among the metrics architectures and it boosts the overall throughput of the video assessment quality module.
Blockiness and the exposure metrics are presented in [21, 22] , respectively. This section presents an overview of all the metrics and the algorithms used in this work. Notation used in equations is presented in Tab. 3.
Blocking
Blocking is caused by independence of calculations for each block in the image. While many compression algorithms divide frames into blocks, this is one of the most popular and visible artifacts. Because of the coarse quantization, the correlation among blocks is lost, and horizontal and vertical borders appear. Another reason might be the resolution change, when a small picture is scaled up to be displayed on a larger screen.
Blockiness metric used in this work is based on [23] . This metric assumes constant block size, which was chosen to be 8 × 8 pixels. Metric value depends on two factors: Symbol Description BLX number of horizontal blocks in a frame BLY number of vertical blocks in a frame sortMeanBL ordered sequence of the average luminance of blocks sortSumBL ordered sequence of the luminance sums calculated for each block (ii) InterSum is a sum of the absolute differences between pixels located on the border of two neighboring picture blocks, Eq. (1). (iiii) IntraSum is a sum of the absolute differences between pixels located directly next to the neighboring pixel of the picture block, Eq. (2).
Computing scheme of InterSum and IntraSum is depicted in Fig. 2 , along with the pixel numeration scheme. b x,y (i) used in Eq. (1) and (2) means i-th pixel of x, y block. Blockiness metric is the ratio of IntraSum to InterSum, as presented by Eq. (3).
Exposure time distortions
Exposure time distortions are visible as an imbalance in brightness (frames that are too dark or too bright). They are caused by an incorrect exposure time, or recording a video without a sufficient lighting device. It is also possible to cause this distortion by improper digital enhancement. Various exposure levels for the same image are presented in Fig. 3 . Histograms of luminance for each of those images are presented in Fig. 4 .
Mean brightness of the darkest and brightest parts of the image is calculated in order to detect the distortion. Exposure metric is presented in Eq. (4), where L d , Eq. (5), represents three darkest blocks, L b , Eq. (6), represents three brightest blocks. 
The results of the metrics mentioned above were mapped to the Mean Opinion Score (MOS). The thresholds were referred to the MOS scale, determining the score below which each distortion is noticeable.
Blackout
It is manifested as the picture disappearing; a black screen. It appears when all packets of data are lost, or as a result of incorrect video recording. Image blackout detection is independent of the frame color, i.e. detection result is positive (equals '1') if the frame has a uniform color, otherwise the result is '0'. Comparison of all the pixels of the frame under consideration seems to be the most straightforward approach. However, this is the greedy method which requires n comparisons, where n is the number of pixels within the frame. The authors came up with an alternative method which utilizes partial results of the exposure time distortion method. This results in a significant reduction of the metric implementation cost.
The novel metric description: A frame is split into blocks of 8 × 8 pixels. Sum of the luminance is calculated for every block. If the difference between the block of the highest luminance and the lowest is lower than the thBlout threshold the detection result equals '1', otherwise it is '0'; thBlout is set to a constant four. 
Interlace
Interlace is a technique where a single frame is a composition of two half-frames, each of which contains half of the information. Odd half-frames contain odd rows of pixels, while even half-frames contain even rows of pixels. Resulting frame is created by interlacing both of them. The idea of interlace is presented in Fig. 5 . Interlace distortion becomes visible when two half-frames are not properly aligned. It is especially visible for videos including motion.
The authors proposed their own solution for interlace distortion metric. It is calculated independently for each micro 4 × 4 pixel block and then subsequently combined into a complete metric. Given block is marked as a block with interlace distortion if change of luminance of the first row relative to the second row is in the same direction for all pixel pairs, change between second and third is in opposite direction, and change between third and fourth is in the same direction. All comparisons are presented in Fig. 6 .
Eq. (8) determines if a given block has interlace distortion, where d i, j is the jth difference between luminance values of the i-th micro block. Eq. (9) calculates metric value for the whole frame. Fig. 7 illustrates detection of interlace in a sample frame. The effect is the most visible in shapes containing sharp vertical lines. Presented solution shows positive results.
High level hardware design -tools and methodology
The module was implemented using Impulse C language. Impulse C is a high level language based on Stream-C compiler, which was created in the Los Alamos National Laboratories in the 1990s. The idea evolved into a corporation named Impulse Accelerated Technologies Company (2002), which is now a supporting vendor of Impulse C and holder of the Impulse C rights. The main intention of the language designers was to bridge the gap between hardware and software and facilitate the process of system level design. It was achieved through abstracting most of the language constructs, so the designer can focus on the algorithm, rather than low level details of the implementation [24] .
There is a whole set of high level languages such as Dime-C, SystemC, Handel-C, Mitrion C available nowadays, which enable specification and implementation of the system at the module level. However, most of them introduce their own structures (e.g. Mitrion C), expanding or modifying existing standards of high level languages. On the one hand, such an approach helps to establish a design space by imposing a strict language expression set. On the other hand, designers have to comprehend a whole range of the language structures, along with their appropriate application schemes, which may be pretty tedious. Such an extra effort is justified in the case of people, who expect to use the tool for a reasonably long time (professional digital logic designers). Unfortunately, most of the FPGA High Level Language (HLL) users are people familiar with programming languages (e.g. C, C++, Java, Fortran), who need to port some part of their application into hardware. Therefore, it seems reasonable to leverage one of the well-known standards such as ANSI C. Moreover, ANSI C allows for access to low level details of an application, which is very useful in some cases. It can be said, that C gives the lowest possible level of abstraction among the high level languages. The aforementioned ideas prevailed in the design of Impulse C.
There are several features of the Impulse C language, which in the authors' view are superior to other currently used HLLs. First of all, Impulse C allows designers to reuse their HDL code by providing mechanisms, which facilitate the incorporation of existing modules. Furthermore, three different architectures are supported: combinational, pipelined and asynchronous, which cover a complete range of existing design scenarios. Secondly, C compatibility makes it easy for software engineers to switch from General Purpose Processor (GPP) programming to FPGA design, as well as providing a platform for software-hardware integration within one design environment. Finally, Impulse C comes with a range of Platform Support Packages (PSPs), which provide a communication interface between FPGA and GPP computational nodes. Furthermore, PSPs usage provides portability of an application across different platforms. In fact, PSPs are packs of files that describe a system's profile to the Impulse C compiler [25] . The compiler uses this information to generate interface components needed to connect hardware processes to a system bus and interconnect them together inside the FPGA, and also to establish the software side of any software/hardware stream, signal, memory, etc. connections [25] [26] [27] .
The language enables both fine-grain and course-grain parallelism. The former one is implemented within a process, whereas the latter is built as multiple-process structures.
It is worth noting, that algorithm partitioning must be handled by a programmer: this stage is not automated by the compiler, which means that it is up to the designer to classify different sections of an application. However, due to portability of a code, it is possible to migrate between hardware and software sections, if adequate language structures are employed. With respect to this, it is recommended to avoid using language constructs which confine a given part of the code to software or hardware solely.
A designer should keep a number of control signals and branches low, since the primary goal of HLL FPGA algorithm implementation is to increase throughput at the expense of latency (trade latency for throughput). Using control signals may compromise this effort and should make a designer rethink a concept for the architecture.
Impulse C compiler automatically generates test benches, software-hardware interfaces and synthesisable HDL code; it automatically finds parallel structures in the code as well. However, it is good coding practice to explicitly point out sections which are to be paralleled. Both hardware and software parts of the code can be compiled with GNU Compiler Collection (GCC).
Impulse C can be characterized as a stream-oriented, process based language. Processes are main building blocks interconnected using streams to form an architecture for the desired hardware module. From the hardware perspective, processes and streams are hardware modules and First In, First Out (FIFO) registers, respectively. The Impulse C programming model is based on the Communicating Sequential Processes model [27] and is illustrated in Fig. 8 . Every process must be classified as a hardware or a software process.
It is the programmer's responsibility to ensure the interprocess synchronisation. Like most of the HLLs, Impulse C does not provide access to the clock signal, which relieves the designer from implementing cycle synchronization procedures. However, it is possible to attach HDL modules and synchronize them at the level of RTL using clock signal.
FPGA-based platform
The module was implemented on Pico M503 platform [28] , connected through PCIe to server with Intel i7-950 processor and 12 GB of RAM. The Pico platform (Fig. 9) consists of two components:
(ii) EX-500 board with a Gen2 PCI-Express controller which enables connecting up to six FPGAs to the motherboard (iiii) M503 FPGA boards [28] Communication between a CPU and the FPGA is realized with eight lines of PCIe interface âȂŞ-full-duplex connection streams. If more than two boards are used, the throughput is limited to 5 GB/s; in case of using only one board the maximum throughput reaches about 3 GB/s. Another limitation is the width of the stream, which is equal to 128 bits.
Impulse C implementation of the module
This section is a description of implementation of hardware version of the video quality module. Subsection 6.1 shows a general concept of the module, next the description is divided into two parts according to two parts of projects in Impulse C language: software (6.2) and hardware (6.3). 
Architecture of the module
The block diagram of the video quality assessment module is shown in Fig. 10 . It consists of three subblocks:
(ii) Producer -âȂŞ reads video data from a file and sends it to the vqFPGA block using the InputStream (iiii) vqFPGA âȂŞ-reads data from the InputStream, executes video quality metrics and sends results to the Consumer process using the OutputStream (iiiiii) Consumer âȂŞ-reads data from the OutputStream, analyses it and sends them to standard output stream.
Width of the Input and the Output stream is 128 bits, which is the maximum width of Pico M503 platform stream. The scheme described above is parallelized sixfold, in the real module there are six producer, vqFPGA and consumer processes.
Every Impulse C project is composed of a software and hardware part, and so is the video quality assessment module.
Software part of the module
Software part is composed of three functions: producer, consumer and the main program function which is used to launch the FPGA-based accelerator and all the application-related threads. It is also responsible for programming the FPGA with a bit file. Producer function opens input stream to FPGA and sends pre read video data. Pico module input stream is 128 bit wide, thus it is recommended to organize the data in such chunks so the best possible throughput is achieved. Every 8 × 8 block is divided into four microblocks. Every microblock contains 16 values, eight bits each. Such structure allows for sending whole block in four bus clock cycles, retaining data consistency. Described scheme is presented in Fig. 11 .
Consumer function manages module output stream. At the end of every video frame, a valid results frame is received. Its size is also fixed to 128 bits wide, as it fits best to hardware. A special structure of the results frame was designed, as presented on Fig. 12 . The frame contains the results of calculation of blackout, exposure and interlace distortion metrics. The last part of computing the blockiness metric is performed in software, thus frame contains required InterSum and IntraSum. 
Hardware part of the module
Hardware part is composed of vqFPGA modules and the additional hardware which handles data fetching and sending results to the software part. Hardware part is equipped with two data streams corresponding to software streams, which are opened before the data transfer is conducted and closed once it is finished. Hardware module requires the information about video resolution to be sent in advance to the actual stream. Every 128 bit word is then arranged into a microblock. Afterwards, data is sent to the parts of hardware responsible for computing each metric. The module registers are reset after all the microblocks of a given frame are processed and a new frame comes in. The maximum number of combinational stages between registers were experimentally determined as 64 and implemented with Co Set stageDelay Impulse C pragma. This also requires using Co Pipeline pragma which implements pipelined design approach.
Blockiness metric
For blockiness metric, only the most computationally demanding parts were implemented in hardware. InterSum and IntraSum are calculated inside FPGA while final division is done in software. As presented in Fig. 2 , calculations require data from neighboring blocks and storing all necessary data inside FPGA would be very inconvenient. Therefore, authors modified data sending scheme to make it more suitable for blockiness metric calculation. First row and first column are omitted and block boundaries are shifted as presented in Fig. 13 . After such operations, all data necessary for InterSum and IntraSum calculations are available in a single block. The source code presented in Fig. 14 shows hardware implementation of the blockiness metric. Due to the efficient data serialization, the module is implemented with few lines of code, which also results in low hardware resources consumption. It is worth noting that the source code reflects the operations described by Eq. (3).
Exposure time distortions metric
The metric is composed of three steps. In the first one, a luminance mean value of every code block is calculated. Then, six extreme values for every frame are found (three smallest and three biggest). The extreme values are used to compute the mean value.
Several modifications were introduced to adapt it to hardware implementation. A size of each block is constant, therefore instead of the mean, a sum of values may be used. This allows to eliminate the division operation in mean calculation which is very resource demanding. It is only performed for the border blocks. Fractional part may be disregarded as of little importance. Without changing algorithm, the mean may be computed for eight results (four biggest and four smallest). This will enable the use of bit shift (shift right by two bits) operation instead of very hardware expensive division.
Sum of luminance values is stored in a blockSum variable. The extreme blocks are searched for (Fig. 15 ) and the sum of their luminance values are stored in the following variables blockSumMAX1, blockSumMAX2, blockSumMAX3, blockSumMAX4 and blockSumMIN1, blockSumMIN2, blockSumMIN3, blockSumMIN4.
Result of the metric is a weight mean of the pixels luminance from extreme blocks, i.e. all the blockSumMAX and blockSumMIN are summed up and the result is shifted left by nine bits (nine because 2 9 = 512 = 8 * 64; eight is a number of extreme blocks and 64 is a number of pixels within a single block). In order to prevent data range overflow (co_uint16 is used) each datum is shifted right by two bits and the result is subsequently moved by the remaining seven bits. microBlock and blockSumMAX are i f ( blockSum < blockSumMIN4 ) { i f ( blockSum < blockSumMIN3 ) { i f ( blockSum < blockSumMIN2 ) { i f ( blockSum < blockSumMIN1 ) { blockSumMIN4 = blockSumMIN3 ; blockSumMIN3 = blockSumMIN2 ; blockSumMIN2 = blockSumMIN1 ; blockSumMIN1 = blockSum ; } else { blockSumMIN4 = blockSumMIN3 ; blockSumMIN3 = blockSumMIN2 ; blockSumMIN2 = blockSum ; } } else { blockSumMIN4 = blockSumMIN3 ; blockSumMIN3 = blockSum ; } } else blockSumMIN4 = blockSum ; } reset after all the data results are sent to the software part of the module. blockSumMIN is set to 16 384 before the next frame is taken from the input.
Blackout metric
Blackout metric is implemented as four lines of Impulse C code (Fig. 16) . The module comprises one adder/subtractor and one comparator. The metric result is sent to the software part of the module as a single bit set to '1' in OutputStream which indicates that blackout occurred.
Interlace distortion metric
The way data is structured and transferred between hardware and software part is presented in subsection 6.2. It is adapted to this particular metric and improves performance of the module. A single microblock is sent and the interlace distortion detection is conducted just by examining IS_INTERLACE and IS_INTERLACE2 (Fig. 17) conditions.
If one of those conditions is met, the result of the metric is incremented by one. Sum of all the microblocks of a frame is a max. possible value of the result which affected a choice of the variable used to store it, i.e. co_uint32. After all the microblocks of the frame are received, the variable is reset. The module is composed of 12 interconnected comparators which form a single huge XNOR gate. In addition, the module comprises an adder and 32-bit shift register for interlaceMetric variable.
Experimental results
Several experiments were conducted to determine the performance of the module. Fig. 18 presents the performance of both hardware and software implementation of the video quality assessment module for variety of resolutions. Starting from very low resolutions QVGA (320 × 240) and VGA (640 × 480), through fullHD (1920 × 1080) to UHD resolutions 4K (4096 × 2160) and 8K (7680 × 4320). Due to variety of display aspect ratios, for 4K and 8K we chose power of two values to determine the aspect ratio. Resulting image was around 1 % larger than popular 16 : 9.
The green line indicates a real-time processing performance, assuming that the video is streamed at 30 frames-per-second rate. Hardware version of the video quality assessment module is capable of processing 8K in real-time. Fig. 19 presents acceleration results as a function of video resolution. It is worth noting that the resolution has a direct impact on the size of a single chunk of data sent over InputStream, which in turn affects transfer rate and the overall processing time.
The acceleration (Fig. 19) is a speed-up achieved by a hardware solution compared to the software one. Tab. 4 presents hardware resources consumption of registers (#reg) and lookup tables (#lut) in Pico platform as a function of a number of vqFPGA modules implemented, as well as corresponding throughput achieved. 
Summary
Presented solution is capable of calculating simultaneously four distinct video quality assessment metrics on a single video stream. Used hardware platform allowed for real time processing of 8K resolution. Solution based on CPU only did not meet real time requirements for resolutions higher than fullHD. Whole project was implemented using Impulse C, a high level language that significantly reduced design time, facilitated the system integration process and enabled architectural optimization which boosted the overall performance of the solution. Some improvements can still be done, more metrics can be added, also due to low resource utilization more parallel modules can be implemented inside the FPGA, what could further speed up calculations. If a theoretical highest throughput for Pico M503 platform (3 GB/s) will be reached, it would allow to process 16K resolution with 24 fps, which is the minimum that can be considered as a real time. However, because Impulse C language allows for seamless moving of the design between different platforms, with Platform Support Package provided, presented solution can utilize more efficient hardware to achieve even better results.
