57 research outputs found
A Computational Framework for Vertical Video Editing
International audienceVertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes
Automated Top View Registration of Broadcast Football Videos
In this paper, we propose a novel method to register football broadcast video
frames on the static top view model of the playing surface. The proposed method
is fully automatic in contrast to the current state of the art which requires
manual initialization of point correspondences between the image and the static
model. Automatic registration using existing approaches has been difficult due
to the lack of sufficient point correspondences. We investigate an alternate
approach exploiting the edge information from the line markings on the field.
We formulate the registration problem as a nearest neighbour search over a
synthetically generated dictionary of edge map and homography pairs. The
synthetic dictionary generation allows us to exhaustively cover a wide variety
of camera angles and positions and reduce this problem to a minimal per-frame
edge map matching procedure. We show that the per-frame results can be improved
in videos using an optimization framework for temporal camera stabilization. We
demonstrate the efficacy of our approach by presenting extensive results on a
dataset collected from matches of football World Cup 2014
Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification
We investigate the problem of reducing mistake severity for fine-grained
classification. Fine-grained classification can be challenging, mainly due to
the requirement of domain expertise for accurate annotation. However, humans
are particularly adept at performing coarse classification as it requires
relatively low levels of expertise. To this end, we present a novel approach
for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label
hierarchy to improve the performance of fine-grained classification at
test-time using the coarse-grained predictions. By only requiring the parents
of leaf nodes, our method significantly reduces avg. mistake severity while
improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets,
achieving a new state-of-the-art on both benchmarks. We also investigate the
efficacy of our approach in the semi-supervised setting. Our approach brings
notable gains in top-1 accuracy while significantly decreasing the severity of
mistakes as training data decreases for the fine-grained classes. The
simplicity and post-hoc nature of HiE renders it practical to be used with any
off-the-shelf trained model to improve its predictions further.Comment: 8 pages, 2 figures, 3 tables, Accepted at NeurIPS 202
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Significant progress has been made in speaker dependent Lip-to-Speech
synthesis, which aims to generate speech from silent videos of talking faces.
Current state-of-the-art approaches primarily employ non-autoregressive
sequence-to-sequence architectures to directly predict mel-spectrograms or
audio waveforms from lip representations. We hypothesize that the direct
mel-prediction hampers training/model efficiency due to the entanglement of
speech content with ambient information and speaker characteristics. To this
end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
First, a non-autoregressive sequence-to-sequence model maps self-supervised
visual features to a representation of disentangled speech content. A vocoder
then converts the speech features into raw waveforms. Extensive evaluations
confirm the effectiveness of our setup, achieving state-of-the-art performance
on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT
datasets. Speech samples from RobustL2S can be found at
https://neha-sherin.github.io/RobustL2S
- …