220,067 research outputs found
IBVC: Interpolation-driven B-frame Video Compression
Learned B-frame video compression aims to adopt bi-directional motion
estimation and motion compensation (MEMC) coding for middle frame
reconstruction. However, previous learned approaches often directly extend
neural P-frame codecs to B-frame relying on bi-directional optical-flow
estimation or video frame interpolation. They suffer from inaccurate quantized
motions and inefficient motion compensation. To address these issues, we
propose a simple yet effective structure called Interpolation-driven B-frame
Video Compression (IBVC). Our approach only involves two major operations:
video frame interpolation and artifact reduction compression. IBVC introduces a
bit-rate free MEMC based on interpolation, which avoids optical-flow
quantization and additional compression distortions. Later, to reduce duplicate
bit-rate consumption and focus on unaligned artifacts, a residual guided
masking encoder is deployed to adaptively select the meaningful contexts with
interpolated multi-scale dependencies. In addition, a conditional
spatio-temporal decoder is proposed to eliminate location errors and artifacts
instead of using MEMC coding in other methods. The experimental results on
B-frame coding demonstrate that IBVC has significant improvements compared to
the relevant state-of-the-art methods. Meanwhile, our approach can save bit
rates compared with the random access (RA) configuration of H.266 (VTM). The
code will be available at https://github.com/ruhig6/IBVC.Comment: Submitted to IEEE TCSV
Multi-Output Gaussian Processes for Crowdsourced Traffic Data Imputation
Traffic speed data imputation is a fundamental challenge for data-driven
transport analysis. In recent years, with the ubiquity of GPS-enabled devices
and the widespread use of crowdsourcing alternatives for the collection of
traffic data, transportation professionals increasingly look to such
user-generated data for many analysis, planning, and decision support
applications. However, due to the mechanics of the data collection process,
crowdsourced traffic data such as probe-vehicle data is highly prone to missing
observations, making accurate imputation crucial for the success of any
application that makes use of that type of data. In this article, we propose
the use of multi-output Gaussian processes (GPs) to model the complex spatial
and temporal patterns in crowdsourced traffic data. While the Bayesian
nonparametric formalism of GPs allows us to model observation uncertainty, the
multi-output extension based on convolution processes effectively enables us to
capture complex spatial dependencies between nearby road segments. Using 6
months of crowdsourced traffic speed data or "probe vehicle data" for several
locations in Copenhagen, the proposed approach is empirically shown to
significantly outperform popular state-of-the-art imputation methods.Comment: 10 pages, IEEE Transactions on Intelligent Transportation Systems,
201
Learning Hard Alignments with Variational Inference
There has recently been significant interest in hard attention models for
tasks such as object recognition, visual captioning and speech recognition.
Hard attention can offer benefits over soft attention such as decreased
computational cost, but training hard attention models can be difficult because
of the discrete latent variables they introduce. Previous work used REINFORCE
and Q-learning to approach these issues, but those methods can provide
high-variance gradient estimates and be slow to train. In this paper, we tackle
the problem of learning hard attention for a sequential task using variational
inference methods, specifically the recently introduced VIMCO and NVIL.
Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We
demonstrate our method on a phoneme recognition task in clean and noisy
environments and show that our method outperforms REINFORCE, with the
difference being greater for a more complicated task
Mapping Chestnut Stands Using Bi-Temporal VHR Data
This study analyzes the potential of very high resolution (VHR) remote sensing images and extended morphological profiles for mapping Chestnut stands on Tenerife Island (Canary Islands, Spain). Regarding their relevance for ecosystem services in the region (cultural and provisioning services) the public sector demand up-to-date information on chestnut and a simple straight-forward approach is presented in this study. We used two VHR WorldView images (March and May 2015) to cover different phenological phases. Moreover, we included spatial information in the classification process by extended morphological profiles (EMPs). Random forest is used for the classification process and we analyzed the impact of the bi-temporal information as well as of the spatial information on the classification accuracies. The detailed accuracy assessment clearly reveals the benefit of bi-temporal VHR WorldView images and spatial information, derived by EMPs, in terms of the mapping accuracy. The bi-temporal classification outperforms or at least performs equally well when compared to the classification accuracies achieved by the mono-temporal data. The inclusion of spatial information by EMPs further increases the classification accuracy by 5% and reduces the quantity and allocation disagreements on the final map. Overall the new proposed classification strategy proves useful for mapping chestnut stands in a heterogeneous and complex landscape, such as the municipality of La Orotava, Tenerife
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
A major emerging challenge is how to protect people's privacy as cameras and
computer vision are increasingly integrated into our daily lives, including in
smart devices inside homes. A potential solution is to capture and record just
the minimum amount of information needed to perform a task of interest. In this
paper, we propose a fully-coupled two-stream spatiotemporal architecture for
reliable human action recognition on extremely low resolution (e.g., 12x16
pixel) videos. We provide an efficient method to extract spatial and temporal
features and to aggregate them into a robust feature representation for an
entire action video sequence. We also consider how to incorporate high
resolution videos during training in order to build better low resolution
action recognition models. We evaluate on two publicly-available datasets,
showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
- …