7,775 research outputs found
Video Stabilisation Based on Spatial Transformer Networks
User-Generated Content is normally recorded with mobile phones by non-professionals, which leads to a low viewing experience due to artifacts such as jitter and blur. Other jittery videos are those recorded with mounted cameras or moving platforms. In these scenarios, Digital Video Stabilization (DVS) has been utilized, to create high quality, professional level videos. In the industry and academia, there are a number of traditional and Deep Learning (DL)-based DVS systems, however both approaches have limitations: the former struggles to extract and track features in a number of scenarios, and the latter struggles with camera path smoothing, a hard problem to define in this context. On the other hand, traditional methods have shown good performance in smoothing camera path whereas DL methods are effective in feature extraction, tracking, and motion parameter estimation. Hence, to the best of our knowledge the available DVS systems struggle to stabilize videos in a wide variety of scenarios, especially with high motion and certain scene content, such as textureless areas, dark scenes, close object, lack of depth, amongst others. Another challenge faced by current DVS implementations is the resulting artifacts that such systems add to the stabilized videos, degrading the viewing experience. These artifacts are mainly distortion, blur, zoom, and ghosting effects. In this thesis, we utilize the strengths of Deep Learning and traditional methods for video stabilization. Our approach is robust to a wide variety of scene content and camera motion, and avoids adding artifacts to the stabilized video. First, we provide a dataset and evaluation framework for Deep Learning-based DVS. Then, we present our image alignment module, which contains a Spatial Transformer Network (STN). Next, we leverage this module to propose a homography-based video stabilization system. Aiming at avoiding blur and distortion caused by homographies, our next proposal is a translation-based video stabilization method, which contains Exponential Weighted Moving Averages (EWMAs) to smooth the camera path. Finally, instead of using EWMAs, we study the utilization of filters in our approach. In this case, we compare a number of filters and choose the filters with best performance. Since the quality of experience of a viewer does not only consist of video stability, but also of blur and distortion, we consider it is a good trade off to allow some jitter left on the video while avoiding adding distortion and blur. In all three cases, we show that this approach pays off, since our systems ourperform the state-of-the-art proposals
Learning how to be robust: Deep polynomial regression
Polynomial regression is a recurrent problem with a large number of
applications. In computer vision it often appears in motion analysis. Whatever
the application, standard methods for regression of polynomial models tend to
deliver biased results when the input data is heavily contaminated by outliers.
Moreover, the problem is even harder when outliers have strong structure.
Departing from problem-tailored heuristics for robust estimation of parametric
models, we explore deep convolutional neural networks. Our work aims to find a
generic approach for training deep regression models without the explicit need
of supervised annotation. We bypass the need for a tailored loss function on
the regression parameters by attaching to our model a differentiable hard-wired
decoder corresponding to the polynomial operation at hand. We demonstrate the
value of our findings by comparing with standard robust regression methods.
Furthermore, we demonstrate how to use such models for a real computer vision
problem, i.e., video stabilization. The qualitative and quantitative
experiments show that neural networks are able to learn robustness for general
polynomial regression, with results that well overpass scores of traditional
robust estimation methods.Comment: 18 pages, conferenc
Deep Burst Denoising
Noise is an inherent issue of low-light image capture, one which is
exacerbated on mobile devices due to their narrow apertures and small sensors.
One strategy for mitigating noise in a low-light situation is to increase the
shutter time of the camera, thus allowing each photosite to integrate more
light and decrease noise variance. However, there are two downsides of long
exposures: (a) bright regions can exceed the sensor range, and (b) camera and
scene motion will result in blurred images. Another way of gathering more light
is to capture multiple short (thus noisy) frames in a "burst" and intelligently
integrate the content, thus avoiding the above downsides. In this paper, we use
the burst-capture strategy and implement the intelligent integration via a
recurrent fully convolutional deep neural net (CNN). We build our novel,
multiframe architecture to be a simple addition to any single frame denoising
model, and design to handle an arbitrary number of noisy input frames. We show
that it achieves state of the art denoising results on our burst dataset,
improving on the best published multi-frame techniques, such as VBM4D and
FlexISP. Finally, we explore other applications of image enhancement by
integrating content from multiple frames and demonstrate that our DNN
architecture generalizes well to image super-resolution
Automated Top View Registration of Broadcast Football Videos
In this paper, we propose a novel method to register football broadcast video
frames on the static top view model of the playing surface. The proposed method
is fully automatic in contrast to the current state of the art which requires
manual initialization of point correspondences between the image and the static
model. Automatic registration using existing approaches has been difficult due
to the lack of sufficient point correspondences. We investigate an alternate
approach exploiting the edge information from the line markings on the field.
We formulate the registration problem as a nearest neighbour search over a
synthetically generated dictionary of edge map and homography pairs. The
synthetic dictionary generation allows us to exhaustively cover a wide variety
of camera angles and positions and reduce this problem to a minimal per-frame
edge map matching procedure. We show that the per-frame results can be improved
in videos using an optimization framework for temporal camera stabilization. We
demonstrate the efficacy of our approach by presenting extensive results on a
dataset collected from matches of football World Cup 2014
- …