7 research outputs found
You Only Train Once: Multi-Identity Free-Viewpoint Neural Human Rendering from Monocular Videos
We introduce You Only Train Once (YOTO), a dynamic human generation
framework, which performs free-viewpoint rendering of different human
identities with distinct motions, via only one-time training from monocular
videos. Most prior works for the task require individualized optimization for
each input video that contains a distinct human identity, leading to a
significant amount of time and resources for the deployment, thereby impeding
the scalability and the overall application potential of the system. In this
paper, we tackle this problem by proposing a set of learnable identity codes to
expand the capability of the framework for multi-identity free-viewpoint
rendering, and an effective pose-conditioned code query mechanism to finely
model the pose-dependent non-rigid motions. YOTO optimizes neural radiance
fields (NeRF) by utilizing designed identity codes to condition the model for
learning various canonical T-pose appearances in a single shared volumetric
representation. Besides, our joint learning of multiple identities within a
unified model incidentally enables flexible motion transfer in high-quality
photo-realistic renderings for all learned appearances. This capability expands
its potential use in important applications, including Virtual Reality. We
present extensive experimental results on ZJU-MoCap and PeopleSnapshot to
clearly demonstrate the effectiveness of our proposed model. YOTO shows
state-of-the-art performance on all evaluation metrics while showing
significant benefits in training and inference efficiency as well as rendering
quality. The code and model will be made publicly available soon
Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning
Recent self-supervised video representation learning methods focus on
maximizing the similarity between multiple augmented views from the same video
and largely rely on the quality of generated views. In this paper, we propose
frequency augmentation (FreqAug), a spatio-temporal data augmentation method in
the frequency domain for video representation learning. FreqAug stochastically
removes undesirable information from the video by filtering out specific
frequency components so that learned representation captures essential features
of the video for various downstream tasks. Specifically, FreqAug pushes the
model to focus more on dynamic features rather than static features in the
video via dropping spatial or temporal low-frequency components. In other
words, learning invariance between remaining frequency components results in
high-frequency enhanced representation with less static bias. To verify the
generality of the proposed method, we experiment with FreqAug on multiple
self-supervised learning frameworks along with standard augmentations.
Transferring the improved representation to five video action recognition and
two temporal action localization downstream tasks shows consistent improvements
over baselines
Masked Autoencoder for Unsupervised Video Summarization
Summarizing a video requires a diverse understanding of the video, ranging
from recognizing scenes to evaluating how much each frame is essential enough
to be selected as a summary. Self-supervised learning (SSL) is acknowledged for
its robustness and flexibility to multiple downstream tasks, but the video SSL
has not shown its value for dense understanding tasks like video summarization.
We claim an unsupervised autoencoder with sufficient self-supervised learning
does not need any extra downstream architecture design or fine-tuning weights
to be utilized as a video summarization model. The proposed method to evaluate
the importance score of each frame takes advantage of the reconstruction score
of the autoencoder's decoder. We evaluate the method in major unsupervised
video summarization benchmarks to show its effectiveness under various
experimental settings
Detection Recovery in Online Multi-Object Tracking with Sparse Graph Tracker
In existing joint detection and tracking methods, pairwise relational
features are used to match previous tracklets to current detections. However,
the features may not be discriminative enough for a tracker to identify a
target from a large number of detections. Selecting only high-scored detections
for tracking may lead to missed detections whose confidence score is low.
Consequently, in the online setting, this results in disconnections of
tracklets which cannot be recovered. In this regard, we present Sparse Graph
Tracker (SGT), a novel online graph tracker using higher-order relational
features which are more discriminative by aggregating the features of
neighboring detections and their relations. SGT converts video data into a
graph where detections, their connections, and the relational features of two
connected nodes are represented by nodes, edges, and edge features,
respectively. The strong edge features allow SGT to track targets with tracking
candidates selected by top-K scored detections with large K. As a result, even
low-scored detections can be tracked, and the missed detections are also
recovered. The robustness of K value is shown through the extensive
experiments. In the MOT16/17/20 and HiEve Challenge, SGT outperforms the
state-of-the-art trackers with real-time inference speed. Especially, a large
improvement in MOTA is shown in the MOT20 and HiEve Challenge. Code is
available at https://github.com/HYUNJS/SGT.Comment: Accepted to WACV 2023; fix figure
Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering
To estimate the volume density and color of a 3D point in the multi-view
image-based rendering, a common approach is to inspect the consensus existence
among the given source image features, which is one of the informative cues for
the estimation procedure. To this end, most of the previous methods utilize
equally-weighted aggregation features. However, this could make it hard to
check the consensus existence when some outliers, which frequently occur by
occlusions, are included in the source image feature set. In this paper, we
propose a novel source-view-wise feature aggregation method, which facilitates
us to find out the consensus in a robust way by leveraging local structures in
the feature set. We first calculate the source-view-wise distance distribution
for each source feature for the proposed aggregation. After that, the distance
distribution is converted to several similarity distributions with the proposed
learnable similarity mapping functions. Finally, for each element in the
feature set, the aggregation features are extracted by calculating the weighted
means and variances, where the weights are derived from the similarity
distributions. In experiments, we validate the proposed method on various
benchmark datasets, including synthetic and real image scenes. The experimental
results demonstrate that incorporating the proposed features improves the
performance by a large margin, resulting in the state-of-the-art performance
An Efficient Human Instance-Guided Framework for Video Action Recognition
In recent years, human action recognition has been studied by many computer vision researchers. Recent studies have attempted to use two-stream networks using appearance and motion features, but most of these approaches focused on clip-level video action recognition. In contrast to traditional methods which generally used entire images, we propose a new human instance-level video action recognition framework. In this framework, we represent the instance-level features using human boxes and keypoints, and our action region features are used as the inputs of the temporal action head network, which makes our framework more discriminative. We also propose novel temporal action head networks consisting of various modules, which reflect various temporal dynamics well. In the experiment, the proposed models achieve comparable performance with the state-of-the-art approaches on two challenging datasets. Furthermore, we evaluate the proposed features and networks to verify the effectiveness of them. Finally, we analyze the confusion matrix and visualize the recognized actions at human instance level when there are several people
The Sixth Visual Object Tracking VOT2018 Challenge Results
The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new long-term tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).Funding agencies: Slovenian research agencySlovenian Research Agency - Slovenia [P2-0214, P2-0094, J2-8175]; Czech Science FoundationGrant Agency of the Czech Republic [GACR P103/12/G084]; WASP; VR (EMC2); SSF (SymbiCloud); SNIC; AIT Strategic Research Programme 2017 Visua</p