7 research outputs found

    Exploration Experiments in Wavelet Video Coding

    Get PDF
    This document defines the Exploration Experiments (EE) to be conducted by the VidWav AHG till the Bangkok meeting. The common software platform was decided to be used by all the participants in the EEs. It is already available on the CVS server provided by RWTH. During the Poznan meeting, tools had been identified as potentially useful, and the associated experiment was not completes for this meeting. These previous EE are being extended till next meeting

    Description of Testing in Wavelet Video Coding

    Get PDF
    The provided testing scenario is intended to be used for evaluation of the coding performance of the JSVM coding scheme provided by JVT, with respect to the Exploration Wavelet Video Coding platform. The reference conditions have been derived from the JSVM testing conditions document JVT-P205 produced at the 16th JVT meeting in Poznan. Further details on JSVM configuration can be found in such document. The rate points to be used in all tests are derived from the combined scalability test sets. The applicable set of rate points is similar to the rate points originally applied in the Munich and Palma testing conditions. The target rates are arranged in a systematic manner. For each spatio-temporal resolution, the rate triples from the lowest rate point to the highest rate point. The settings are chosen such that an AVC compatible base layer can be provided for the base layer of combined and spatial scalability and for the QCIF sequences in the SNR scalability scenario. The encoder will generate embedded bit-streams. To extract the various bit-rates, frame-rates and resolutions specified in the tables below from the embedded bit-stream, an extractor shall be used. No transcoding is allowed for performing the decoding at the various bit-rates, frame-rates and resolutions. In the combined scalability settings, the testing addresses a broad range scalability and is motivated by the requirements of applications such as surveillance, broadcast and storage systems. This scenario is represented by the test sequences CITY, CREW, HARBOUR and SOCCER. Visual tests shall take place for the rates shown in green in the tables below for each spatial resolution, and for the Palma test at the rate points tested at Palma. PSNR curves will have to be provided for all other rate points

    Proposed Status Report on Wavelet Video Coding Exploration

    Get PDF
    Current 3-D wavelet video coding schemes with Motion Compensated Temporal Filtering (MCTF) can be divided into two main categories. The first performs MCTF on the input video sequence directly in the full resolution spatial domain before spatial transform and is often referred to as spatial domain MCTF. The second performs MCTF in wavelet subband domain generated by spatial transform, being often referred to as in-band MCTF. Figure 1(a) is a general framework which can support both of the above two schemes. Firstly, a pre-spatial decomposition can be applied to the input video sequence. Then a multi-level MCTF decomposes the video frames into several temporal subbands, such as temporal highpass subbands and temporal lowpass subbands. After temporal decomposition, a post-spatial decomposition is applied to each temporal subband to further decompose the frames spatially. In the framework, the whole spatial decomposition operations for each temporal subband are separated into two parts: pre-spatial decomposition operations and post-spatial decomposition operations. The pre-spatial decomposition can be void for some schemes while non-empty for other schemes. Figure 1(b) shows the case of the t+2D scheme where pre-spatial decomposition is empty. Figure 1(c) shows the case of the 2D+t+2D scheme where pre-spatial decomposition is usually a multi-level dyadic wavelet transform. Depending on the results of pre-spatial decomposition, the temporal decomposition should perform different MCTF operations, either in spatial domain or in subband domain. (a) The general coding framework; (b) Case for the t+2D scheme (Pre-spatial decomposition is void); (c) Case for the 2D+t+2D scheme (Pre-spatial decomposition exists). Figure 1: Framework for 3-D wavelet video coding. A first classification of SVC schemes according to the order the spatial and temporal wavelet transform are performed was introduced in the first Scalable Video Models [1], [2] on the base of the Call for Proposals responses at Munich meeting. The so called t+2D schemes (one example is [3]) performs first an MCTF, producing temporal subband frames, then the spatial DWT is applied on each one of these frames. Alternatively, in a 2D+t scheme (one example is [4]), a spatial DWT is applied first to each video frame and then MCTF is made on spatial subbands. A third approach named 2D+t+2D uses a first stage DWT to produce reference video sequences at various resolutions; t+2D transforms are then performed on each resolution level of the obtained spatial pyramid. Each scheme has evidenced its pros and cons [5,6] in terms of coding performance. From a theoretical point of view, the critical aspects of the above SVC scheme mainly reside: i) in the coherence and trustworthiness of the motion estimation at various scales (especially for t+2D schemes); ii) in the difficulties to compensate for the shift-variant nature of the wavelet transform (especially for 2D+t schemes); iii) in the performance of inter-scale prediction (ISP) mechanisms (especially for 2D+t+2D schemes). An analysis of the differences between schemes is also reported in the sequel

    Wavelet Codec Reference Document and Software Manual v 2.0

    No full text
    Current 3D wavelet video coding schemes with Motion Compensated Temporal Filtering (MCTF) can be divided into two main categories. The first performs MCTF on the input video sequence directly in the full resolution spatial domain before spatial transform and is often referred to as spatial domain MCTF. The second performs MCTF in wavelet subband domain generated by spatial transform, being often referred to as in-band MCTF. Figure 1(a) is a general framework which can support both of the above two schemes. Firstly, a pre-spatial decomposition can be applied to the input video sequence. Then a multi-level MCTF decomposes the video frames into several temporal subbands, such as temporal highpass subbands and temporal lowpass subbands. After temporal decomposition, a post-spatial decomposition is applied to each temporal subband to further decompose the frames spatially. In the framework, the whole spatial decomposition operations for each temporal subband are separated into two parts: pre-spatial decomposition operations and post-spatial decomposition operations. The pre-spatial decomposition can be void for some schemes while non-empty for other schemes. Figure 1(b) shows the case of the T+2D scheme where pre-spatial decomposition is empty. Figure 1(c) shows the case of the 2D+T+2D scheme where pre-spatial decomposition is usually a multi-level dyadic wavelet transform. Depending on the results of pre-spatial decomposition, the temporal decomposition should perform different MCTF operations, either in spatial domain or in subband domain. (a) The general coding framework; (b) Case for the T+2D scheme (Pre-spatial decomposition is void); (c) Case for the 2D+T+2D scheme (Pre-spatial decomposition exists). Figure 1: Framework for 3D wavelet video coding. A deep analysis on the difference between schemes is here reported. A simple T+2D scheme acts on the video sequences by applying a temporal decomposition followed by a spatial transform. The main problem arising with this scheme is that the inverse temporal transform is performed on the lower spatial resolution temporal subbands by using the same (scaled) motion field obtained from the higher resolution sequence analysis. Because of the non ideal decimation performed by the low-pass wavelet decomposition, a simply scaled motion field is, in general, not optimal for the low resolution level. This causes a loss in performance and, even if some means are being designed to obtain a better motion field, this is highly dependent on the working rate for the decoding process, and is thus difficult to estimate it in advance at the encoding stage. Furthermore, as the allowed bit-rate for the lower resolution format is generally very restrictive, it is not possible to add corrective measures at this level so as to compensate the problems due to inverse temporal transform. In order to solve the problem of motion fields at different spatial levels a natural approach has been to consider a 2D+T scheme, where the spatial transform is applied before the temporal one. Unfortunately, this approach suffers from the shift-variant nature of wavelet decomposition, which leads to inefficiency in motion compensated temporal transforms on the spatial subbands. This problem has found a partial solution in schemes where the motion estimation and compensation take place in an overcomplete (shift-invariant) wavelet domain. From the above discussion it comes clear that the spatial and temporal wavelet filtering cannot be decoupled because of the motion compensation. As a consequence it is not possible to encode different spatial resolution levels at once, with only one MCTF, and thus both lower and higher resolution sequences must be MCTF filtered. In this perspective, a possibility for obtaining good performance in terms of bit-rate and scalability is to use an Inter-Scale Prediction scheme. What has been proposed in the literature is to use prediction between the low resolution and the higher one before applying spatio-temporal transform. The low resolution sequence is interpolated and used as prediction for the high resolution sequence. The residual is then filtered both temporally and spatially. This architecture has a clear basis on what have been the first hierarchical representation technique, introduced for images, namely the Laplacian pyramid. So, even if from an intuitive point of view the scheme seems to be well motivated, it has the typical disadvantage of overcomplete transforms, namely that of leading to a full size residual image. This way the information to be encoded as refinement is spread on a high number of coefficients and coding efficiency is hardly achievable. A 2D+T+2D scheme that combines a layered representation with interband prediction in the MCTF domain appears now as a valid alternative approach. The proposed method called “STool” efficiently combines the idea of prediction between different resolution levels within the framework of spatial and temporal wavelet transforms. Compared with the previous schemes it has several advantages. First of all, the different spatial resolution levels have all undergone an MCTF, which prevents the problems of T+2D schemes. Furthermore, the MCTF are applied before spatial DWT, which solves the problem of 2D+T schemes. Moreover, the prediction is confined to the same number of transformed coefficients that exist in the lower resolution format. So, there is a clear distinction between the coefficients that are associated to differences in the low-pass bands of high resolution format with respect to the low resolution ones and the coefficients that are associated to higher resolution details. This constitutes an advantage between the prediction schemes based on interpolation in the original sequence domain. Another important advantage is that it is possible to decide which and how many temporal subbands to use in the prediction. So, one can for example discard the temporal high-pass subbands if when a good prediction cannot be achieved for such “quick” details. Alternatively this allows for example a QCIF sequence at 15 fps to be efficiently used as a base for prediction of a 30 fps CIF sequence

    Status Report on Wavelet Video Coding Exploration

    No full text
    Current 3-D wavelet video coding schemes with Motion Compensated Temporal Filtering (MCTF) can be divided into two main categories. The first performs MCTF on the input video sequence directly in the full resolution spatial domain before spatial transform and is often referred to as spatial domain MCTF. The second performs MCTF in wavelet subband domain generated by spatial transform, being often referred to as in-band MCTF. Figure 1(a) is a general framework which can support both of the above two schemes. Firstly, a pre-spatial decomposition can be applied to the input video sequence. Then a multi-level MCTF decomposes the video frames into several temporal subbands, such as temporal highpass subbands and temporal lowpass subbands. After temporal decomposition, a post-spatial decomposition is applied to each temporal subband to further decompose the frames spatially. In the framework, the whole spatial decomposition operations for each temporal subband are separated into two parts: pre-spatial decomposition operations and post-spatial decomposition operations. The pre-spatial decomposition can be void for some schemes while non-empty for other schemes. Figure 1(b) shows the case of the t+2D scheme where pre-spatial decomposition is empty. Figure 1(c) shows the case of the 2D+t+2D scheme where pre-spatial decomposition is usually a multi-level dyadic wavelet transform. Depending on the results of pre-spatial decomposition, the temporal decomposition should perform different MCTF operations, either in spatial domain or in subband domain. (a) The general coding framework; (b) Case for the t+2D scheme (Pre-spatial decomposition is void); (c) Case for the 2D+t+2D scheme (Pre-spatial decomposition exists). Figure 1: Framework for 3-D wavelet video coding. A first classification of SVC schemes according to the order the spatial and temporal wavelet transform are performed was introduced in the first Scalable Video Models [1], [2] on the base of the Call for Proposals responses at Munich meeting. The so called t+2D schemes (one example is [3]) performs first an MCTF, producing temporal subband frames, then the spatial DWT is applied on each one of these frames. Alternatively, in a 2D+t scheme (one example is [4]), a spatial DWT is applied first to each video frame and then MCTF is made on spatial subbands. A third approach named 2D+t+2D uses a first stage DWT to produce reference video sequences at various resolutions; t+2D transforms are then performed on each resolution level of the obtained spatial pyramid. Each scheme has evidenced its pros and cons [5,6] in terms of coding performance. From a theoretical point of view, the critical aspects of the above SVC scheme mainly reside: i) in the coherence and trustworthiness of the motion estimation at various scales (especially for t+2D schemes); ii) in the difficulties to compensate for the shift-variant nature of the wavelet transform (especially for 2D+t schemes); iii) in the performance of inter-scale prediction (ISP) mechanisms (especially for 2D+t+2D schemes). An analysis of the differences between schemes is also reported in the sequel

    Report on Testing in Wavelet Video Coding Exploration

    No full text
    In this document PSNR curves produced using the Exploration Wavelet Video Coding system are provided according to the reference test condition decided in the last meeting and described in [1], regarding the evaluation of the coding performance of the Exploration Wavelet Video Coding platform with respect to the JSVM coding scheme (to be provided by JVT)

    Draft Status Report on Wavelet Video Exploration

    No full text
    Current 3D wavelet video coding schemes with Motion Compensated Temporal Filtering (MCTF) can be divided into two main categories. The first performs MCTF on the input video sequence directly in the full resolution spatial domain before spatial transform and is often referred to as spatial domain MCTF. The second performs MCTF in wavelet subband domain generated by spatial transform, being often referred to as in-band MCTF. Figure 1(a) is a general framework which can support both of the above two schemes. Firstly, a pre-spatial decomposition can be applied to the input video sequence. Then a multi-level MCTF decomposes the video frames into several temporal subbands, such as temporal highpass subbands and temporal lowpass subbands. After temporal decomposition, a post-spatial decomposition is applied to each temporal subband to further decompose the frames spatially. In the framework, the whole spatial decomposition operations for each temporal subband are separated into two parts: pre-spatial decomposition operations and post-spatial decomposition operations. The pre-spatial decomposition can be void for some schemes while non-empty for other schemes. Figure 1(b) shows the case of the T+2D scheme where pre-spatial decomposition is empty. Figure 1(c) shows the case of the 2D+T+2D scheme where pre-spatial decomposition is usually a multi-level dyadic wavelet transform. Depending on the results of pre-spatial decomposition, the temporal decomposition should perform different MCTF operations, either in spatial domain or in subband domain. (a) The general coding framework; (b) Case for the T+2D scheme (Pre-spatial decomposition is void); (c) Case for the 2D+T+2D scheme (Pre-spatial decomposition exists). Figure 1: Framework for 3D wavelet video coding. A deep analysis on the difference between schemes is here reported. A simple T+2D scheme acts on the video sequences by applying a temporal decomposition followed by a spatial transform. The main problem arising with this scheme is that the inverse temporal transform is performed on the lower spatial resolution temporal subbands by using the same (scaled) motion field obtained from the higher resolution sequence analysis. Because of the non ideal decimation performed by the low-pass wavelet decomposition, a simply scaled motion field is, in general, not optimal for the low resolution level. This causes a loss in performance and, even if some means are being designed to obtain a better motion field, this is highly dependent on the working rate for the decoding process, and is thus difficult to estimate it in advance at the encoding stage. Furthermore, as the allowed bit-rate for the lower resolution format is generally very restrictive, it is not possible to add corrective measures at this level so as to compensate the problems due to inverse temporal transform. In order to solve the problem of motion fields at different spatial levels a natural approach has been to consider a 2D+T scheme, where the spatial transform is applied before the temporal one. Unfortunately, this approach suffers from the shift-variant nature of wavelet decomposition, which leads to inefficiency in motion compensated temporal transforms on the spatial subbands. This problem has found a partial solution in schemes where the motion estimation and compensation take place in an overcomplete (shift-invariant) wavelet domain. From the above discussion it comes clear that the spatial and temporal wavelet filtering cannot be decoupled because of the motion compensation. As a consequence it is not possible to encode different spatial resolution levels at once, with only one MCTF, and thus both lower and higher resolution sequences must be MCTF filtered. In this perspective, a possibility for obtaining good performance in terms of bit-rate and scalability is to use an Inter-Scale Prediction scheme. What has been proposed in the literature is to use prediction between the low resolution and the higher one before applying spatio-temporal transform. The low resolution sequence is interpolated and used as prediction for the high resolution sequence. The residual is then filtered both temporally and spatially. This architecture has a clear basis on what have been the first hierarchical representation technique, introduced for images, namely the Laplacian pyramid. So, even if from an intuitive point of view the scheme seems to be well motivated, it has the typical disadvantage of overcomplete transforms, namely that of leading to a full size residual image. This way the information to be encoded as refinement is spread on a high number of coefficients and coding efficiency is hardly achievable. A 2D+T+2D scheme that combines a layered representation with interband prediction in the MCTF domain appears now as a valid alternative approach. It efficiently combines the idea of prediction between different resolution levels within the framework of spatial and temporal wavelet transforms. Compared with the previous schemes it has several advantages. First of all, the different spatial resolution levels have all undergone an MCTF, which prevents the problems of T+2D schemes. Furthermore, the MCTF are applied before spatial DWT, which solves the problem of 2D+T schemes. Moreover, the prediction is confined to the same number of transformed coefficients that exist in the lower resolution format. So, there is a clear distinction between the coefficients that are associated to differences in the low-pass bands of high resolution format with respect to the low resolution ones and the coefficients that are associated to higher resolution details. This constitutes an advantage between the prediction schemes based on interpolation in the original sequence domain. Another important advantage is that it is possible to decide which and how many temporal subbands to use in the prediction. So, one can for example discard the temporal high-pass subbands if when a good prediction cannot be achieved for such “quick” details. Alternatively this allows for example a QCIF sequence at 15 fps to be efficiently used as a base for prediction of a 30 fps CIF sequence. A Scalable Video Coder (SVC) can be conceived according to different kinds of spatio-temporal decomposition structures which can be designed to produce a multiresolution spatio-temporal subband hierarchy which is then coded with a progressive or quality scalable coding technique [x-y]. A classification of SVC architectures has been suggested by the MPEG Ad-Hoc Group on SVC [x]. The so called t+2D schemes (one example is [x]) performs first an MCTF, producing temporal subband frames, then the spatial DWT is applied on each one of these frames. Alternatively, in a 2D+t scheme (one example is [x]), a spatial DWT is applied first to each video frame and then MCTF is made on spatial subbands. A third approach named 2D+t+2D uses a first stage DWT to produce reference video sequences at various resolutions; t+2D transforms are then performed on each resolution level of the obtained spatial pyramid. Each scheme has evidenced its pros and cons [x,y] in terms of coding performance. From a theoretical point of view, the critical aspects of the above SVC scheme mainly reside: i) in the coherence and trustworthiness of the motion estimation at various scales (especially for t+2D schemes); ii) in the difficulties to compensate for the shift-variant nature of the wavelet transform (especially for 2D+t schemes); iii) in the performance of inter-scale prediction (ISP) mechanisms (especially for 2D+t+2D schemes)
    corecore