394 research outputs found
FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer
Limited by hardware cost and system size, camera's Field-of-View (FoV) is not
always satisfactory. However, from a spatio-temporal perspective, information
beyond the camera's physical FoV is off-the-shelf and can actually be obtained
"for free" from the past. In this paper, we propose a novel task termed
Beyond-FoV Estimation, aiming to exploit past visual cues and bidirectional
break through the physical FoV of a camera. We put forward a FlowLens
architecture to expand the FoV by achieving feature propagation explicitly by
optical flow and implicitly by a novel clip-recurrent transformer, which has
two appealing features: 1) FlowLens comprises a newly proposed Clip-Recurrent
Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global
information accumulated in the temporal dimension. 2) A multi-branch Mix Fusion
Feed Forward Network (MixF3N) is integrated to enhance the spatially-precise
flow of local features. To foster training and evaluation, we establish
KITTI360-EX, a dataset for outer- and inner FoV expansion. Extensive
experiments on both video inpainting and beyond-FoV estimation tasks show that
FlowLens achieves state-of-the-art performance. Code will be made publicly
available at https://github.com/MasterHow/FlowLens.Comment: Code will be made publicly available at
https://github.com/MasterHow/FlowLen
Control and Analysis for Sequential Information based on Machine Learning
Sequential information is crucial for real-world applications that are related to time, which is same with time-series being described by sequence data followed by temporal order and regular intervals. In this thesis, we consider four major tasks of sequential information that include sequential trend prediction, control strategy optimisation, visual-temporal interpolation and visual-semantic sequential alignment. We develop machine learning theories and provide state-of-the-art models for various real-world applications that involve sequential processes, including the industrial batch process, sequential video inpainting, and sequential visual-semantic image captioning. The ultimate goal is about designing a hybrid framework that can unify diverse sequential information analysis and control systems
For industrial process, control algorithms rely on simulations to find the optimal control strategy. However, few machine learning techniques can control the process using raw data, although some works use ML to predict trends. Most control methods rely on amounts of previous experiences, and cannot execute future information to optimize the control strategy. To improve the effectiveness of the industrial process, we propose improved reinforcement learning approaches that can modify the control strategy. We also propose a hybrid reinforcement virtual learning approach to optimise the long-term control strategy. This approach creates a virtual space that interacts with reinforcement learning to predict a virtual strategy without conducting any real experiments, thereby improving and optimising control efficiency.
For sequential visual information analysis, we propose a dual-fusion transformer model to tackle the sequential visual-temporal encoding in video inpainting tasks. Our framework includes a flow-guided transformer with dual attention fusion, and we observe that the sequential information is effectively processed, resulting in promising inpainting videos.
Finally, we propose a cycle-based captioning model for the analysis of sequential visual-semantic information. This model augments data from two views to optimise caption generation from an image, overcoming new few-shot and zero-shot settings. The proposed model can generate more accurate and informative captions by leveraging sequential visual-semantic information.
Overall, the thesis contributes to analysing and manipulating sequential information in multi-modal real-world applications. Our flexible framework design provides a unified theoretical foundation to deploy sequential information systems in distinctive application domains. Considering the diversity of challenges addressed in this thesis, we believe our technique paves the pathway towards versatile AI in the new era
Robust Pose Transfer with Dynamic Details using Neural Video Rendering
Pose transfer of human videos aims to generate a high fidelity video of a
target person imitating actions of a source person. A few studies have made
great progress either through image translation with deep latent features or
neural rendering with explicit 3D features. However, both of them rely on large
amounts of training data to generate realistic results, and the performance
degrades on more accessible internet videos due to insufficient training
frames. In this paper, we demonstrate that the dynamic details can be preserved
even trained from short monocular videos. Overall, we propose a neural video
rendering framework coupled with an image-translation-based dynamic details
generation network (D2G-Net), which fully utilizes both the stability of
explicit 3D features and the capacity of learning components. To be specific, a
novel texture representation is presented to encode both the static and
pose-varying appearance characteristics, which is then mapped to the image
space and rendered as a detail-rich frame in the neural rendering stage.
Moreover, we introduce a concise temporal loss in the training stage to
suppress the detail flickering that is made more visible due to high-quality
dynamic details generated by our method. Through extensive comparisons, we
demonstrate that our neural human video renderer is capable of achieving both
clearer dynamic details and more robust performance even on accessible short
videos with only 2k - 4k frames.Comment: Video link: https://www.bilibili.com/video/BV1y64y1C7ge
Neural Video Recovery for Cloud Gaming
Cloud gaming is a multi-billion dollar industry. A client in cloud gaming
sends its movement to the game server on the Internet, which renders and
transmits the resulting video back. In order to provide a good gaming
experience, a latency below 80 ms is required. This means that video rendering,
encoding, transmission, decoding, and display have to finish within that time
frame, which is especially challenging to achieve due to server overload,
network congestion, and losses. In this paper, we propose a new method for
recovering lost or corrupted video frames in cloud gaming. Unlike traditional
video frame recovery, our approach uses game states to significantly enhance
recovery accuracy and utilizes partially decoded frames to recover lost
portions. We develop a holistic system that consists of (i) efficiently
extracting game states, (ii) modifying H.264 video decoder to generate a mask
to indicate which portions of video frames need recovery, and (iii) designing a
novel neural network to recover either complete or partial video frames. Our
approach is extensively evaluated using iPhone 12 and laptop implementations,
and we demonstrate the utility of game states in the game video recovery and
the effectiveness of our overall design
VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation
We introduce VideoFlow, a novel optical flow estimation framework for videos.
In contrast to previous methods that learn to estimate optical flow from two
frames, VideoFlow concurrently estimates bi-directional optical flows for
multiple frames that are available in videos by sufficiently exploiting
temporal cues. We first propose a TRi-frame Optical Flow (TROF) module that
estimates bi-directional optical flows for the center frame in a three-frame
manner. The information of the frame triplet is iteratively fused onto the
center frame. To extend TROF for handling more frames, we further propose a
MOtion Propagation (MOP) module that bridges multiple TROFs and propagates
motion features between adjacent TROFs. With the iterative flow estimation
refinement, the information fused in individual TROFs can be propagated into
the whole sequence via MOP. By effectively exploiting video information,
VideoFlow presents extraordinary performance, ranking 1st on all public
benchmarks. On the Sintel benchmark, VideoFlow achieves 1.649 and 0.991 average
end-point-error (AEPE) on the final and clean passes, a 15.1% and 7.6% error
reduction from the best-published results (1.943 and 1.073 from FlowFormer++).
On the KITTI-2015 benchmark, VideoFlow achieves an F1-all error of 3.65%, a
19.2% error reduction from the best-published result (4.52% from FlowFormer++).
Code is released at \url{https://github.com/XiaoyuShi97/VideoFlow}
- …