Search CORE

7 research outputs found

You Only Train Once: Multi-Identity Free-Viewpoint Neural Human Rendering from Monocular Videos

Author: Kim Jaehyeok
Wee Dongyoon
Xu Dan
Publication venue
Publication date: 10/03/2023
Field of study

We introduce You Only Train Once (YOTO), a dynamic human generation framework, which performs free-viewpoint rendering of different human identities with distinct motions, via only one-time training from monocular videos. Most prior works for the task require individualized optimization for each input video that contains a distinct human identity, leading to a significant amount of time and resources for the deployment, thereby impeding the scalability and the overall application potential of the system. In this paper, we tackle this problem by proposing a set of learnable identity codes to expand the capability of the framework for multi-identity free-viewpoint rendering, and an effective pose-conditioned code query mechanism to finely model the pose-dependent non-rigid motions. YOTO optimizes neural radiance fields (NeRF) by utilizing designed identity codes to condition the model for learning various canonical T-pose appearances in a single shared volumetric representation. Besides, our joint learning of multiple identities within a unified model incidentally enables flexible motion transfer in high-quality photo-realistic renderings for all learned appearances. This capability expands its potential use in important applications, including Virtual Reality. We present extensive experimental results on ZJU-MoCap and PeopleSnapshot to clearly demonstrate the effectiveness of our proposed model. YOTO shows state-of-the-art performance on all evaluation metrics while showing significant benefits in training and inference efficiency as well as rendering quality. The code and model will be made publicly available soon

arXiv.org e-Print Archive

Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Author: Han Dongyoon
Kim Jinhyung
Kim Junmo
Kim Taeoh
Shim Minho
Wee Dongyoon
Publication venue
Publication date: 08/04/2022
Field of study

Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes undesirable information from the video by filtering out specific frequency components so that learned representation captures essential features of the video for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. In other words, learning invariance between remaining frequency components results in high-frequency enhanced representation with less static bias. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Masked Autoencoder for Unsupervised Video Summarization

Author: Kim Jinhyung
Kim Taeoh
Shim Minho
Wee Dongyoon
Publication venue
Publication date: 02/06/2023
Field of study

Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings

arXiv.org e-Print Archive

Detection Recovery in Online Multi-Object Tracking with Sparse Graph Tracker

Author: Hyun Jeongseok
Kang Myunggu
Wee Dongyoon
Yeung Dit-Yan
Publication venue
Publication date: 19/09/2023
Field of study

In existing joint detection and tracking methods, pairwise relational features are used to match previous tracklets to current detections. However, the features may not be discriminative enough for a tracker to identify a target from a large number of detections. Selecting only high-scored detections for tracking may lead to missed detections whose confidence score is low. Consequently, in the online setting, this results in disconnections of tracklets which cannot be recovered. In this regard, we present Sparse Graph Tracker (SGT), a novel online graph tracker using higher-order relational features which are more discriminative by aggregating the features of neighboring detections and their relations. SGT converts video data into a graph where detections, their connections, and the relational features of two connected nodes are represented by nodes, edges, and edge features, respectively. The strong edge features allow SGT to track targets with tracking candidates selected by top-K scored detections with large K. As a result, even low-scored detections can be tracked, and the missed detections are also recovered. The robustness of K value is shown through the extensive experiments. In the MOT16/17/20 and HiEve Challenge, SGT outperforms the state-of-the-art trackers with real-time inference speed. Especially, a large improvement in MOTA is shown in the MOT20 and HiEve Challenge. Code is available at https://github.com/HYUNJS/SGT.Comment: Accepted to WACV 2023; fix figure

arXiv.org e-Print Archive

Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering

Author: Cha Geonho
Shin Chaehun
Wee Dongyoon
Yoon Sungroh
Publication venue
Publication date: 10/06/2022
Field of study

To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance

arXiv.org e-Print Archive

An Efficient Human Instance-Guided Framework for Video Action Recognition

Author: Dongyoon Wee
Doyoung Kim
Inwoong Lee
Sanghoon Lee
Publication venue: MDPI AG
Publication date: 01/12/2021
Field of study

In recent years, human action recognition has been studied by many computer vision researchers. Recent studies have attempted to use two-stream networks using appearance and motion features, but most of these approaches focused on clip-level video action recognition. In contrast to traditional methods which generally used entire images, we propose a new human instance-level video action recognition framework. In this framework, we represent the instance-level features using human boxes and keypoints, and our action region features are used as the inputs of the temporal action head network, which makes our framework more discriminative. We also propose novel temporal action head networks consisting of various modules, which reflect various temporal dynamics well. In the experiment, the proposed models achieve comparable performance with the state-of-the-art approaches on two challenging datasets. Furthermore, we evaluate the proposed features and networks to verify the effectiveness of them. Finally, we analyze the confusion matrix and visualize the recognized actions at human instance level when there are several people

Directory of Open Access Journals

PubMed Central

The Sixth Visual Object Tracking VOT2018 Challenge Results

Author: Abdelpakey Mohamed
Alatan A. Aydin
Bai Shuai
Bastos Guilherme
Battistone Francesco
Bertinetto Luca
Bhat Goutam
Bowden Richard
Chahl Javaan
Chang Hyung Jin
Che Manqiang
Chen Boyu
Chen Deming
Choi Jin Young
Choi Jongwon
Choi Seokeon
Chu Lutao
Danelljan Martin
De Ath George
Demiris Yiannis
Drummond Isabela
Eldesokey Abdelrahman
Everson Richard
Fan Heng
Felsberg Michael
Feng Wei
Feng Zhen-Hua
Fernández Gustavo
Fischer Tobias
Galoogahi Hamed Kiani
Gao Junyu
García-Martín Álvaro
Gavves Efstratios
Golodetz Stuart
González-García Abel
Gorthi Rama Krishna
Grm Klemen
Gundogdu Erhan
Guo Qing
Hadfield Simon
Hao Cong
He Anfeng
He Zhiqun
Henriques Joao
Herranz Jorge Rodríguez
Hu Weiming
Huang Shuangping
Iglesias-Arias Álvaro
Jiao Yifan
Johnander Joakim
Kang Myunggu
Khan Fahad Shahbaz
Kim Changick
Kim Daijin
Kittler Josef
Kristan Matej
Law Yee Wei
Lee Hankyeol
Lee Hyemin
Lee Namhoon
Leonardis Aleš
Li Bo
Li Feng
Li Haojie
Li Houqiang
Li Huiyun
Li Jing
Li Peixia
Li Yan
Li Yuhong
Ling Haibin
Lu Huchuan
Lu Yan
Lukezič Alan
Luo Chong
Maresca Mario Edoardo
Martin Jaime Spencer
Martín-Nieto Rafael
Martínez José M.
Matas Jirí
Memarmoghadam Alireza
Miksik Ondrej
Ming Tang
Mishra Deepak
Moallem P.
Muhič Andrej
Perera Asanka G.
Petrosino Alfredo
Pflugfelder Roman
Possegger Horst
Qi Jinqing
Raju Priya Mariam
Rout Litu
Ruihe Qian
Santopietro Vincenzo
Senna Pedro
Shehata Mohamed
Si Liu
Smeulders Arnold
Subrahmanyam Gorthi R. K. S.
Sun Chong
Sun Yuxuan
Sung Jinyoung
Tao Ran
Tian Cheng
Tian Xinmei
Torr Philip
Valmadre Jack
van de Weijer Joost
Vedaldi Andrea
Velasco-Salido Erik
Vicente-Moñivar Pablo
Vivas Sergio
Vojírì Tomáš
Wang Dong
Wang Jinqiao
Wang Lijun
Wang Ning
Wang Qiang
Wang Runling
Wang Siwen
Wang Zhihui
Wee Dongyoon
Wei Wang
Wu Sihang
Wu Wei
Wu Xiao-Jun
Wu Xiaohe
Wu Yi
Xiong Changzhen
Xu Changsheng
Xu Tianyang
Yang Fan
Yang Lingxiao
Yang Ming-Hsuan
Yang Yicai
Yun Sangdoo
Zajc Luka Cehovin
Zeng Wenjun
Zhang Honggang
Zhang Lichao
Zhang Tianzhu
Zhang Xiaofan
Zhang Yunhua
Zhang Zheng
Zhao Fei
Zhao Haojie
Zhi Hui
Zhou Qin
Zhou Wengang
Zhu Zheng
Zhuang Junfei
Zou Wei
Zuo Wangmeng
Štruc Vitomir
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new long-term tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).Funding agencies: Slovenian research agencySlovenian Research Agency - Slovenia [P2-0214, P2-0094, J2-8175]; Czech Science FoundationGrant Agency of the Czech Republic [GACR P103/12/G084]; WASP; VR (EMC2); SSF (SymbiCloud); SNIC; AIT Strategic Research Programme 2017 Visua</p

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Publikationer från Linköpings universitet

Queensland University of Technology ePrints Archive

Digitala Vetenskapliga Arkivet - Academic Archive On-line

OpenMETU (Middle East Technical University)