353 research outputs found
Recommended from our members
Understanding the Dynamic Visual World: From Motion to Semantics
We live in a dynamic world, which is continuously in motion. Perceiving and interpreting the dynamic surroundings is an essential capability for an intelligent agent. Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational perception models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases, manual annotations are completely infeasible, such as the motion vector of each pixel (i.e., optical flow) since humans cannot reliably produce these types of labeling. In fact, living in a dynamic world, when we move around, motion information, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. As the famous psychologist James J. Gibson suggested, βwe must perceive in order to move, but we also must move in order to perceiveβ. In this thesis, we investigate how to use the motion information contained in unlabeled or partially labeled videos to better understand and synthesize the dynamic visual world.
This thesis consists of three parts. In the first part, we focus on the βmove to perceiveβ aspect. When moving through the world, it is natural for an intelligent agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains donβt move much; nearby trees move a lot. This natural relationship between the appearance of objects and their apparent motion is a rich source of information about the relationship between the distance of objects and their appearance in images. We present a pretext task of estimating the relative depth of elements of a scene (i.e., ordering the pixels in an image according to distance from the viewer) recovered from motion field of unlabeled videos. The goal of this pretext task was to induce useful feature representations in deep Convolutional Neural Networks (CNNs). These induced representations, using 1.1 million video frames crawled from YouTube within one hour without any manual labeling, provide valuable starting features for the training of neural networks for downstream tasks. It is promising to match or even surpass what ImageNet pre-training gives us today, which needs a huge amount of manual labeling, on tasks such as semantic image segmentation as all of our training data comes almost for free.
In the second part, we study the βperceive to moveβ aspect. As we humans look around, we do not solve a single vision task at a time. Instead, we perceive our surroundings in a holistic manner, doing visual understanding using all visual cues jointly. By simultaneously solving multiple tasks together, one task can influence another. In specific, we propose a neural network architecture, called SENSE, which shares common feature representations among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion detection, and semantic segmentation. The key insight is that sharing features makes the network more compact and induces better feature representations. For real-world data, however, not all an- notations of the four tasks mentioned above are always available at the same time. To this end, loss functions are designed to exploit interactions of different tasks and do not need manual annotations, to better handle partially labeled data in a semi- supervised manner, leading to superior understanding performance of the dynamic visual world.
Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the third part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos from a standard frame-rate video. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted virtual reality (VR) devices, compressing videos, synthesizing videos with motion blur, etc
λΉλμ€ νλ μ 보κ°μ μν ν μ€νΈ λ¨κ³μ μ μμ λ°©λ²λ‘ μ°κ΅¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. μ΄κ²½λ¬΄.Computationally handling videos has been one of the foremost goals in computer vision. In particular, analyzing the complex dynamics including motion and occlusion between two frames is of fundamental importance in understanding the visual contents of a video. Research on video frame interpolation, a problem where the goal is to synthesize high-quality intermediate frames between the two input frames, specifically investigates the low-level characteristics within the consecutive frames of a video. The topic has been recently gaining increased popularity and can be applied to various real-world applications such as generating slow-motion effects, novel view synthesis, or video stabilization. Existing methods for video frame interpolation aim to design complex new architectures to effectively estimate and compensate for the motion between two input frames. However, natural videos contain a wide variety of different scenarios, including foreground/background appearance and motion, frame rate, and occlusion. Therefore, even with a huge amount of training data, it is difficult for a single model to generalize well on all possible situations.
This dissertation introduces novel methodologies for test-time adaptation for tackling the problem of video frame interpolation. In particular, I propose to enable three different aspects of the deep-learning-based framework to be adaptive: (1) feature activation, (2) network weights, and (3) architectural structures. Specifically, I first present how adaptively scaling the feature activations of a deep neural network with respect to each input frame using attention models allows for accurate interpolation. Unlike the previous approaches that heavily depend on optical flow estimation models, the proposed channel-attention-based model can achieve high-quality frame synthesis without explicit motion estimation. Then, meta-learning is employed for fast adaptation of the parameter values of the frame interpolation models. By learning to adapt for each input video clips, the proposed framework can consistently improve the performance of many existing models with just a single gradient update to its parameters. Lastly, I introduce an input-adaptive dynamic architecture that can assign different inference paths with respect to each local region of the input frames. By deciding the scaling factors of the inputs and the network depth of the early exit in the interpolation model, the dynamic framework can greatly improve the computational efficiency while maintaining, and sometimes even outperforming the performance of the baseline interpolation method.
The effectiveness of the proposed test-time adaptation methodologies is extensively evaluated with multiple benchmark datasets for video frame interpolation. Thorough ablation studies with various hyperparameter settings and baseline networks also demonstrate the superiority of adaptation to the test-time inputs, which is a new research direction orthogonal to the other state-of-the-art frame interpolation approaches.κ³μ°μ μΌλ‘ λΉλμ€ λ°μ΄ν°λ₯Ό μ²λ¦¬νλ κ²μ μ»΄ν¨ν° λΉμ λΆμΌμ μ€μν λͺ©ν μ€ νλμ΄κ³ , μ΄λ₯Ό μν΄μ λ λΉλμ€ νλ μ μ¬μ΄μ μμ§μκ³Ό κ°λ¦¬μ΄μ§ λ±μ 볡μ‘ν μ 보λ₯Ό λΆμνλ κ²μ΄ νμμ μ΄λ€. λΉλμ€ νλ μ 보κ°λ²μ λ μ
λ ₯ νλ μ μ¬μ΄μ μ€κ° νλ μμ μ ννκ² μμ±νλ κ²μ λͺ©νλ‘ νλ λ¬Έμ λ‘, μ°μλ λΉλμ€ νλ μ μ¬μ΄μ μ λ°ν (νμ λ¨μμ) νΉμ§λ€μ μμ§μκ³Ό κ°λ¦¬μ΄μ§μ κ³ λ €νμ¬ λΆμνλλ‘ μ°κ΅¬λμλ€. μ΄ λΆμΌλ μ¬λ‘μ°λͺ¨μ
ν¨κ³Ό μμ±, λ€λ₯Έ μμ μμ λ°λΌλ³΄λ 물체 μμ±, μλ¨λ¦Ό 보μ λ± μ€μνμ λ€μν μ΄ν리μΌμ΄μ
μ νμ©λ μ μκΈ° λλ¬Έμ μ΅κ·Όμ λ§μ κ΄μ¬μ λ°κ³ μλ€. κΈ°μ‘΄μ λ°©λ²λ€μ λ μ
λ ₯ νλ μ μ¬μ΄μ ν½μ
λ¨μ μμ§μ μ 보λ₯Ό ν¨κ³Όμ μΌλ‘ μμΈ‘νκ³ λ³΄μνλ λ°©ν₯μΌλ‘ μ°κ΅¬λμ΄μλ€. νμ§λ§ μ€μ λΉλμ€ λ°μ΄ν°λ λ€μν λ¬Όμ²΄λ€ λ° λ³΅μ‘ν λ°°κ²½μ μμ§μ, μ΄μ λ°λ₯Έ κ°λ¦¬μ΄μ§, λΉλμ€λ§λ€ λ¬λΌμ§λ νλ μμ¨ λ± λ§€μ° λ€μν νκ²½μ λ΄κ³ μλ€. λ°λΌμ νλμ λͺ¨λΈλ‘ λͺ¨λ νκ²½μ μΌλ°μ μΌλ‘ μ λμνλ λͺ¨λΈμ νμ΅νλ κ²μ μλ§μ νμ΅ λ°μ΄ν°λ₯Ό νμ©νμ¬λ λ§€μ° μ΄λ €μ΄ λ¬Έμ μ΄λ€.
λ³Έ νμ λ
Όλ¬Έμμλ λΉλμ€ νλ μ λ³΄κ° λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν ν
μ€νΈ λ¨κ³μ μ μμ λ°©λ²λ‘ λ€μ μ μνλ€. νΉν λ₯λ¬λ κΈ°λ°μ νλ μμν¬λ₯Ό μ μμ μΌλ‘ λ§λ€κΈ° μνμ¬ (1) νΌμ³ νμ±λ (feature activation), (2) λͺ¨λΈμ νλΌλ―Έν°, κ·Έλ¦¬κ³ (3) λ€νΈμν¬μ ꡬ쑰λ₯Ό λ³νν μ μλλ‘ νλ μΈ κ°μ§μ μκ³ λ¦¬μ¦μ μ μνλ€. 첫 λ²μ§Έ μκ³ λ¦¬μ¦μ λ₯ μ κ²½λ§ λ€νΈμν¬μ λ΄λΆ νΌμ³ νμ±λμ ν¬κΈ°λ₯Ό κ°κ°μ μ
λ ₯ νλ μμ λ°λΌ μ μμ μΌλ‘ μ‘°μ νλλ‘ νλ©°, μ΄ν
μ
λͺ¨λΈμ νμ©νμ¬ μ νν νλ μ λ³΄κ° μ±λ₯μ μ»μ μ μμλ€. μ΅ν°μ»¬ νλ‘μ° μμΈ‘ λͺ¨λΈμ νμ©νμ¬ ν½μ
λ¨μλ‘ μμ§μ μ 보λ₯Ό μΆμΆν λλΆλΆμ κΈ°μ‘΄ λ°©μλ€κ³Ό λ¬λ¦¬, μ μν μ±λ μ΄ν
μ
κΈ°λ°μ λͺ¨λΈμ λ³λμ λͺ¨μ
λͺ¨λΈ μμ΄λ λ§€μ° μ νν μ€κ° νλ μμ μμ±ν μ μλ€. λ λ²μ§Έλ‘ μ μνλ μκ³ λ¦¬μ¦μ νλ μ λ³΄κ° λͺ¨λΈμ κ° νλΌλ―Έν° κ°μ μ μμ μΌλ‘ λ³κ²½ν μ μλλ‘ λ©νλ¬λ (meta-learning) λ°©λ²λ‘ μ μ¬μ©νλ€. κ°κ°μ μ
λ ₯ λΉλμ€ μνμ€λ§λ€ λͺ¨λΈμ νλΌλ―Έν° κ°μ μ μμ μΌλ‘ μ
λ°μ΄νΈν μ μλλ‘ νμ΅μμΌ μ€μΌλ‘μ¨, μ μν νλ μμν¬λ κΈ°μ‘΄μ μ΄λ€ νλ μ λ³΄κ° λͺ¨λΈμ μ¬μ©νλλΌλ λ¨ ν λ²μ κ·ΈλΌλμΈνΈ μ
λ°μ΄νΈλ₯Ό ν΅ν΄ μΌκ΄λ μ±λ₯ ν₯μμ 보μλ€. λ§μ§λ§μΌλ‘, μ
λ ₯μ λ°λΌ λ€νΈμν¬μ κ΅¬μ‘°κ° λμ μΌλ‘ λ³νλλ νλ μμν¬λ₯Ό μ μνμ¬ κ³΅κ°μ μΌλ‘ λΆν λ νλ μμ κ° μ§μλ§λ€ μλ‘ λ€λ₯Έ μΆλ‘ κ²½λ‘λ₯Ό ν΅κ³Όνκ³ , λΆνμν κ³μ°λμ μλΉ λΆλΆ μ€μΌ μ μλλ‘ νλ€. μ μνλ λμ λ€νΈμν¬λ μ
λ ₯ νλ μμ ν¬κΈ°μ νλ μ λ³΄κ° λͺ¨λΈμ κΉμ΄λ₯Ό μ‘°μ ν¨μΌλ‘μ¨ λ² μ΄μ€λΌμΈ λͺ¨λΈμ μ±λ₯μ μ μ§νλ©΄μ κ³μ° ν¨μ¨μ±μ ν¬κ² μ¦κ°νμλ€.
λ³Έ νμ λ
Όλ¬Έμμ μ μν μΈ κ°μ§μ μ μμ λ°©λ²λ‘ μ ν¨κ³Όλ λΉλμ€ νλ μ 보κ°λ²μ μν μ¬λ¬ λ²€μΉλ§ν¬ λ°μ΄ν°μ
μ λ©΄λ°νκ² νκ°λμλ€. νΉν, λ€μν νμ΄νΌνλΌλ―Έν° μΈν
κ³Ό μ¬λ¬ λ² μ΄μ€λΌμΈ λͺ¨λΈμ λν λΉκ΅, λΆμ μ€νμ ν΅ν΄ ν
μ€νΈ λ¨κ³μμμ μ μμ λ°©λ²λ‘ μ λν ν¨κ³Όλ₯Ό μ
μ¦νμλ€. μ΄λ λΉλμ€ νλ μ 보κ°λ²μ λν μ΅μ κ²°κ³Όλ€μ μΆκ°μ μΌλ‘ μ μ©λ μ μλ μλ‘μ΄ μ°κ΅¬ λ°©λ²μΌλ‘, μΆν λ€λ°©λ©΄μΌλ‘μ νμ₯μ±μ΄ κΈ°λλλ€.1 Introduction 1
1.1 Motivations 1
1.2 Proposed method 3
1.3 Contributions 5
1.4 Organization of dissertation 6
2 Feature Adaptation based Approach 7
2.1 Introduction 7
2.2 Related works 10
2.2.1 Video frame interpolation 10
2.2.2 Attention mechanism 12
2.3 Proposed Method 12
2.3.1 Overview of network architecture 13
2.3.2 Main components 14
2.3.3 Loss 16
2.4 Understanding our model 17
2.4.1 Internal feature visualization 18
2.4.2 Intermediate image reconstruction 21
2.5 Experiments 23
2.5.1 Datasets 23
2.5.2 Implementation details 25
2.5.3 Comparison to the state-of-the-art 26
2.5.4 Ablation study 36
2.6 Summary 38
3 Meta-Learning based Approach 39
3.1 Introduction 39
3.2 Related works 42
3.3 Proposed method 44
3.3.1 Video frame interpolation problem set-up 44
3.3.2 Exploiting extra information at test time 45
3.3.3 Background on MAML 48
3.3.4 MetaVFI: Meta-learning for frame interpolation 49
3.4 Experiments 54
3.4.1 Settings 54
3.4.2 Meta-learning algorithm selection 56
3.4.3 Video frame interpolation results 58
3.4.4 Ablation studies 66
3.5 Summary 69
4 Dynamic Architecture based Approach 71
4.1 Introduction 71
4.2 Related works 75
4.2.1 Video frame interpolation 75
4.2.2 Adaptive inference 76
4.3 Proposed Method 77
4.3.1 Dynamic framework overview 77
4.3.2 Scale and depth finder (SD-finder) 80
4.3.3 Dynamic interpolation model 82
4.3.4 Training 83
4.4 Experiments 85
4.4.1 Datasets 85
4.4.2 Implementation details 86
4.4.3 Quantitative comparison 87
4.4.4 Visual comparison 93
4.4.5 Ablation study 97
4.5 Summary 100
5 Conclusion 103
5.1 Summary of dissertation 103
5.2 Future works 104
Bibliography 107
κ΅λ¬Έμ΄λ‘ 120Docto
CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation
The advancement of computer vision has pushed visual analysis tasks from
still images to the video domain. In recent years, video instance segmentation,
which aims to track and segment multiple objects in video frames, has drawn
much attention for its potential applications in various emerging areas such as
autonomous driving, intelligent transportation, and smart retail. In this
paper, we propose an effective framework for instance-level visual analysis on
video frames, which can simultaneously conduct object detection, instance
segmentation, and multi-object tracking. The core idea of our method is
collaborative multi-task learning which is achieved by a novel structure, named
associative connections among detection, segmentation, and tracking task heads
in an end-to-end learnable CNN. These additional connections allow information
propagation across multiple related tasks, so as to benefit these tasks
simultaneously. We evaluate the proposed method extensively on KITTI MOTS and
MOTS Challenge datasets and obtain quite encouraging results
Optimizing Magnetic Resonance Imaging for Image-Guided Radiotherapy
Magnetic resonance imaging (MRI) is playing an increasingly important role in image-guided radiotherapy. MRI provides excellent soft tissue contrast, and is flexible in characterizing various tissue properties including relaxation, diffusion and perfusion. This thesis aims at developing new image analysis and reconstruction algorithms to optimize MRI in support of treatment planning, target delineation and treatment response assessment for radiotherapy.
First, unlike Computed Tomography (CT) images, MRI cannot provide electron density information necessary for radiation dose calculation. To address this, we developed a synthetic CT generation algorithm that generates pseudo CT images from MRI, based on tissue classification results on MRI for female pelvic patients. To improve tissue classification accuracy, we learnt a pelvic bone shape model from a training dataset, and integrated the shape model into an intensity-based fuzzy c-menas classification scheme. The shape-regularized tissue classification algorithm is capable of differentiating tissues that have significant overlap in MRI intensity distributions. Treatment planning dose calculations using synthetic CT image volumes generated from the tissue classification results show acceptably small variations as compared to CT volumes. As MRI artifacts, such as B1 filed inhomogeneity (bias field) may negatively impact the tissue classification accuracy, we also developed an algorithm that integrates the correction of bias field into the tissue classification scheme. We modified the fuzzy c-means classification by modeling the image intensity as the true intensity corrupted by the multiplicative bias field. A regularization term further ensures the smoothness of the bias field. We solved the optimization problem using a linearized alternating direction method of multipliers (ADMM) method, which is more computational efficient over existing methods.
The second part of this thesis looks at a special MR imaging technique, diffusion-weighted MRI (DWI). By acquiring a series of DWI images with a wide range of b-values, high order diffusion analysis can be performed using the DWI image series and new biomarkers for tumor grading, delineation and treatment response evaluation may be extracted. However, DWI suffers from low signal-to-noise ratio at high b-values, and the multi-b-value acquisition makes the total scan time impractical for clinical use. In this thesis, we proposed an accelerated DWI scheme, that sparsely samples k-space and reconstructs images using a model-based algorithm. Specifically, we built a 3D block-Hankel tensor from k-space samples, and modeled both local and global correlations of the high dimensional k-space data as a low-rank property of the tensor. We also added a phase constraint to account for large phase variations across different b-values, and to allow reconstruction from partial Fourier acquisition, which further accelerates the image acquisition. We proposed an ADMM algorithm to solve the constrained image reconstruction problem. Image reconstructions using both simulated and patient data show improved signal-to-noise ratio. As compared to clinically used parallel imaging scheme which achieves a 4-fold acceleration, our method achieves an 8-fold acceleration. Reconstructed images show reduced reconstruction errors as proved on simulated data and similar diffusion parameter mapping results on patient data.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143919/1/llliu_1.pd
Recommended from our members
Fast volume reconstruction from motion corrupted stacks of 2D slices
Capturing an enclosing volume of moving subjects and organs using fast individual image slice acquisition has shown promise in dealing with motion artefacts. Motion between slice acquisitions results in spatial inconsistencies that can be resolved by slice-to-volume reconstruction (SVR) methods to provide high quality 3D image data. Existing algorithms are, however, typically very slow, specialised to specific applications and rely on approximations, which impedes their potential clinical use. In this paper, we present a fast multi-GPU accelerated framework for slice-to-volume reconstruction. It is based on optimised 2D/3D registration, super-resolution with automatic outlier rejection and an additional (optional) intensity bias correction. We introduce a novel and fully automatic procedure for selecting the image stack with least motion to serve as an initial registration target. We evaluate the proposed method using artificial motion corrupted phantom data as well as clinical data, including tracked freehand ultrasound of the liver and fetal Magnetic Resonance Imaging. We achieve speed-up factors greater than 30 compared to a single CPU system and greater than 10 compared to currently available state-of-the-art multi-core CPU methods. We ensure high reconstruction accuracy by exact computation of the point-spread function for every input data point, which has not previously been possible due to computational limitations. Our framework and its implementation is scalable for available computational infrastructures and tests show a speed-up factor of 1.70 for each additional GPU. This paves the way for the online application of image based reconstruction methods during clinical examinations. The source code for the proposed approach is publicly available
Optical flow estimation using steered-L1 norm
Motion is a very important part of understanding the visual picture of the surrounding environment. In image processing it involves the estimation of displacements for image points in an image sequence. In this context dense optical flow estimation is concerned with the computation of pixel displacements in a sequence of images, therefore it has been used widely in the field of image processing and computer vision. A lot of research was dedicated to enable an accurate and fast motion computation in image sequences. Despite the recent advances in the computation of optical flow, there is still room for improvements and optical flow algorithms still suffer from several issues, such as motion discontinuities, occlusion handling, and robustness to illumination changes. This thesis includes an investigation for the topic of optical flow and its applications. It addresses several issues in the computation of dense optical flow and proposes solutions. Specifically, this thesis is divided into two main parts dedicated to address two main areas of interest in optical flow.
In the first part, image registration using optical flow is investigated. Both local and global image registration has been used for image registration. An image registration based on an improved version of the combined Local-global method of optical flow computation is proposed. A bi-lateral filter was used in this optical flow method to improve the edge preserving performance. It is shown that image registration via this method gives more robust results compared to the local and the global optical flow methods previously investigated.
The second part of this thesis encompasses the main contribution of this research which is an improved total variation L1 norm. A smoothness term is used in the optical flow energy function to regularise this function. The L1 is a plausible choice for such a term because of its performance in preserving edges, however this term is known to be isotropic and hence decreases the penalisation near motion boundaries in all directions. The proposed improved
L1 (termed here as the steered-L1 norm) smoothness term demonstrates similar performance across motion boundaries but improves the penalisation performance along such boundaries
Realtime Dynamic 3D Facial Reconstruction for Monocular Video In-the-Wild
With the increasing amount of videos recorded using 2D mobile cameras, the technique for recovering the 3D dynamic facial models from these monocular videos has become a necessity for many image and video editing applications. While methods based parametric 3D facial models can reconstruct the 3D shape in dynamic environment, large structural changes are ignored. Structure-from-motion methods can reconstruct these changes but assume the object to be static. To address this problem we present a novel method for realtime dynamic 3D facial tracking and reconstruction from videos captured in uncontrolled environments. Our method can track the deforming facial geometry and reconstruct external objects that protrude from the face such as glasses and hair. It also allows users to move around, perform facial expressions freely without degrading the reconstruction quality
Beyond the pixels: learning and utilising video compression features for localisation of digital tampering.
Video compression is pervasive in digital society. With rising usage of deep convolutional neural networks (CNNs) in the fields of computer vision, video analysis and video tampering detection, it is important to investigate how patterns invisible to human eyes may be influencing modern computer vision techniques and how they can be used advantageously. This work thoroughly explores how video compression influences accuracy of CNNs and shows how optimal performance is achieved when compression levels in the training set closely match those of the test set. A novel method is then developed, using CNNs, to derive compression features directly from the pixels of video frames. It is then shown that these features can be readily used to detect inauthentic video content with good accuracy across multiple different video tampering techniques. Moreover, the ability to explain these features allows predictions to be made about their effectiveness against future tampering methods. The problem is motivated with a novel investigation into recent video manipulation methods, which shows that there is a consistent drive to produce convincing, photorealistic, manipulated or synthetic video. Humans, blind to the presence of video tampering, are also blind to the type of tampering. New detection techniques are required and, in order to compensate for human limitations, they should be broadly applicable to multiple tampering types. This thesis details the steps necessary to develop and evaluate such techniques
- β¦