Search CORE

353 research outputs found

Recommended from our members

Understanding the Dynamic Visual World: From Motion to Semantics

Author: Jiang Huaizu
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/12/2020
Field of study

We live in a dynamic world, which is continuously in motion. Perceiving and interpreting the dynamic surroundings is an essential capability for an intelligent agent. Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational perception models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases, manual annotations are completely infeasible, such as the motion vector of each pixel (i.e., optical flow) since humans cannot reliably produce these types of labeling. In fact, living in a dynamic world, when we move around, motion information, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. As the famous psychologist James J. Gibson suggested, “we must perceive in order to move, but we also must move in order to perceive”. In this thesis, we investigate how to use the motion information contained in unlabeled or partially labeled videos to better understand and synthesize the dynamic visual world. This thesis consists of three parts. In the first part, we focus on the “move to perceive” aspect. When moving through the world, it is natural for an intelligent agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains don’t move much; nearby trees move a lot. This natural relationship between the appearance of objects and their apparent motion is a rich source of information about the relationship between the distance of objects and their appearance in images. We present a pretext task of estimating the relative depth of elements of a scene (i.e., ordering the pixels in an image according to distance from the viewer) recovered from motion field of unlabeled videos. The goal of this pretext task was to induce useful feature representations in deep Convolutional Neural Networks (CNNs). These induced representations, using 1.1 million video frames crawled from YouTube within one hour without any manual labeling, provide valuable starting features for the training of neural networks for downstream tasks. It is promising to match or even surpass what ImageNet pre-training gives us today, which needs a huge amount of manual labeling, on tasks such as semantic image segmentation as all of our training data comes almost for free. In the second part, we study the “perceive to move” aspect. As we humans look around, we do not solve a single vision task at a time. Instead, we perceive our surroundings in a holistic manner, doing visual understanding using all visual cues jointly. By simultaneously solving multiple tasks together, one task can influence another. In specific, we propose a neural network architecture, called SENSE, which shares common feature representations among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion detection, and semantic segmentation. The key insight is that sharing features makes the network more compact and induces better feature representations. For real-world data, however, not all an- notations of the four tasks mentioned above are always available at the same time. To this end, loss functions are designed to exploit interactions of different tasks and do not need manual annotations, to better handle partially labeled data in a semi- supervised manner, leading to superior understanding performance of the dynamic visual world. Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the third part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos from a standard frame-rate video. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted virtual reality (VR) devices, compressing videos, synthesizing videos with motion blur, etc

ScholarWorks@UMass Amherst

비디오 프레임 보간을 위한 테스트 단계의 적응적 방법론 연구

Author: 최명섭
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. 이경무.Computationally handling videos has been one of the foremost goals in computer vision. In particular, analyzing the complex dynamics including motion and occlusion between two frames is of fundamental importance in understanding the visual contents of a video. Research on video frame interpolation, a problem where the goal is to synthesize high-quality intermediate frames between the two input frames, specifically investigates the low-level characteristics within the consecutive frames of a video. The topic has been recently gaining increased popularity and can be applied to various real-world applications such as generating slow-motion effects, novel view synthesis, or video stabilization. Existing methods for video frame interpolation aim to design complex new architectures to effectively estimate and compensate for the motion between two input frames. However, natural videos contain a wide variety of different scenarios, including foreground/background appearance and motion, frame rate, and occlusion. Therefore, even with a huge amount of training data, it is difficult for a single model to generalize well on all possible situations. This dissertation introduces novel methodologies for test-time adaptation for tackling the problem of video frame interpolation. In particular, I propose to enable three different aspects of the deep-learning-based framework to be adaptive: (1) feature activation, (2) network weights, and (3) architectural structures. Specifically, I first present how adaptively scaling the feature activations of a deep neural network with respect to each input frame using attention models allows for accurate interpolation. Unlike the previous approaches that heavily depend on optical flow estimation models, the proposed channel-attention-based model can achieve high-quality frame synthesis without explicit motion estimation. Then, meta-learning is employed for fast adaptation of the parameter values of the frame interpolation models. By learning to adapt for each input video clips, the proposed framework can consistently improve the performance of many existing models with just a single gradient update to its parameters. Lastly, I introduce an input-adaptive dynamic architecture that can assign different inference paths with respect to each local region of the input frames. By deciding the scaling factors of the inputs and the network depth of the early exit in the interpolation model, the dynamic framework can greatly improve the computational efficiency while maintaining, and sometimes even outperforming the performance of the baseline interpolation method. The effectiveness of the proposed test-time adaptation methodologies is extensively evaluated with multiple benchmark datasets for video frame interpolation. Thorough ablation studies with various hyperparameter settings and baseline networks also demonstrate the superiority of adaptation to the test-time inputs, which is a new research direction orthogonal to the other state-of-the-art frame interpolation approaches.계산적으로 비디오 데이터를 처리하는 것은 컴퓨터 비전 분야의 중요한 목표 중 하나이고, 이를 위해선 두 비디오 프레임 사이의 움직임과 가리어짐 등의 복잡한 정보를 분석하는 것이 필수적이다. 비디오 프레임 보간법은 두 입력 프레임 사이의 중간 프레임을 정확하게 생성하는 것을 목표로 하는 문제로, 연속된 비디오 프레임 사이의 정밀한 (화소 단위의) 특징들을 움직임과 가리어짐을 고려하여 분석하도록 연구되었다. 이 분야는 슬로우모션 효과 생성, 다른 시점에서 바라보는 물체 생성, 손떨림 보정 등 실생활의 다양한 어플리케이션에 활용될 수 있기 때문에 최근에 많은 관심을 받고 있다. 기존의 방법들은 두 입력 프레임 사이의 픽셀 단위 움직임 정보를 효과적으로 예측하고 보완하는 방향으로 연구되어왔다. 하지만 실제 비디오 데이터는 다양한 물체들 및 복잡한 배경의 움직임, 이에 따른 가리어짐, 비디오마다 달라지는 프레임율 등 매우 다양한 환경을 담고 있다. 따라서 하나의 모델로 모든 환경에 일반적으로 잘 동작하는 모델을 학습하는 것은 수많은 학습 데이터를 활용하여도 매우 어려운 문제이다. 본 학위 논문에서는 비디오 프레임 보간 문제를 해결하기 위한 테스트 단계의 적응적 방법론들을 제시한다. 특히 딥러닝 기반의 프레임워크를 적응적으로 만들기 위하여 (1) 피쳐 활성도 (feature activation), (2) 모델의 파라미터, 그리고 (3) 네트워크의 구조를 변형할 수 있도록 하는 세 가지의 알고리즘을 제안한다. 첫 번째 알고리즘은 딥 신경망 네트워크의 내부 피쳐 활성도의 크기를 각각의 입력 프레임에 따라 적응적으로 조절하도록 하며, 어텐션 모델을 활용하여 정확한 프레임 보간 성능을 얻을 수 있었다. 옵티컬 플로우 예측 모델을 활용하여 픽셀 단위로 움직임 정보를 추출한 대부분의 기존 방식들과 달리, 제안한 채널 어텐션 기반의 모델은 별도의 모션 모델 없이도 매우 정확한 중간 프레임을 생성할 수 있다. 두 번째로 제안하는 알고리즘은 프레임 보간 모델의 각 파라미터 값을 적응적으로 변경할 수 있도록 메타러닝 (meta-learning) 방법론을 사용한다. 각각의 입력 비디오 시퀀스마다 모델의 파라미터 값을 적응적으로 업데이트할 수 있도록 학습시켜 줌으로써, 제시한 프레임워크는 기존의 어떤 프레임 보간 모델을 사용하더라도 단 한 번의 그라디언트 업데이트를 통해 일관된 성능 향상을 보였다. 마지막으로, 입력에 따라 네트워크의 구조가 동적으로 변형되는 프레임워크를 제시하여 공간적으로 분할된 프레임의 각 지역마다 서로 다른 추론 경로를 통과하고, 불필요한 계산량을 상당 부분 줄일 수 있도록 한다. 제안하는 동적 네트워크는 입력 프레임의 크기와 프레임 보간 모델의 깊이를 조절함으로써 베이스라인 모델의 성능을 유지하면서 계산 효율성을 크게 증가하였다. 본 학위 논문에서 제안한 세 가지의 적응적 방법론의 효과는 비디오 프레임 보간법을 위한 여러 벤치마크 데이터셋에 면밀하게 평가되었다. 특히, 다양한 하이퍼파라미터 세팅과 여러 베이스라인 모델에 대한 비교, 분석 실험을 통해 테스트 단계에서의 적응적 방법론에 대한 효과를 입증하였다. 이는 비디오 프레임 보간법에 대한 최신 결과들에 추가적으로 적용될 수 있는 새로운 연구 방법으로, 추후 다방면으로의 확장성이 기대된다.1 Introduction 1 1.1 Motivations 1 1.2 Proposed method 3 1.3 Contributions 5 1.4 Organization of dissertation 6 2 Feature Adaptation based Approach 7 2.1 Introduction 7 2.2 Related works 10 2.2.1 Video frame interpolation 10 2.2.2 Attention mechanism 12 2.3 Proposed Method 12 2.3.1 Overview of network architecture 13 2.3.2 Main components 14 2.3.3 Loss 16 2.4 Understanding our model 17 2.4.1 Internal feature visualization 18 2.4.2 Intermediate image reconstruction 21 2.5 Experiments 23 2.5.1 Datasets 23 2.5.2 Implementation details 25 2.5.3 Comparison to the state-of-the-art 26 2.5.4 Ablation study 36 2.6 Summary 38 3 Meta-Learning based Approach 39 3.1 Introduction 39 3.2 Related works 42 3.3 Proposed method 44 3.3.1 Video frame interpolation problem set-up 44 3.3.2 Exploiting extra information at test time 45 3.3.3 Background on MAML 48 3.3.4 MetaVFI: Meta-learning for frame interpolation 49 3.4 Experiments 54 3.4.1 Settings 54 3.4.2 Meta-learning algorithm selection 56 3.4.3 Video frame interpolation results 58 3.4.4 Ablation studies 66 3.5 Summary 69 4 Dynamic Architecture based Approach 71 4.1 Introduction 71 4.2 Related works 75 4.2.1 Video frame interpolation 75 4.2.2 Adaptive inference 76 4.3 Proposed Method 77 4.3.1 Dynamic framework overview 77 4.3.2 Scale and depth finder (SD-finder) 80 4.3.3 Dynamic interpolation model 82 4.3.4 Training 83 4.4 Experiments 85 4.4.1 Datasets 85 4.4.2 Implementation details 86 4.4.3 Quantitative comparison 87 4.4.4 Visual comparison 93 4.4.5 Ablation study 97 4.5 Summary 100 5 Conclusion 103 5.1 Summary of dissertation 103 5.2 Future works 104 Bibliography 107 국문초록 120Docto

SNU Open Repository and Archive

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

Author: Cui Yiming
Han Cheng
Liu Dongfang
Publication venue
Publication date: 02/11/2023
Field of study

The advancement of computer vision has pushed visual analysis tasks from still images to the video domain. In recent years, video instance segmentation, which aims to track and segment multiple objects in video frames, has drawn much attention for its potential applications in various emerging areas such as autonomous driving, intelligent transportation, and smart retail. In this paper, we propose an effective framework for instance-level visual analysis on video frames, which can simultaneously conduct object detection, instance segmentation, and multi-object tracking. The core idea of our method is collaborative multi-task learning which is achieved by a novel structure, named associative connections among detection, segmentation, and tracking task heads in an end-to-end learnable CNN. These additional connections allow information propagation across multiple related tasks, so as to benefit these tasks simultaneously. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets and obtain quite encouraging results

arXiv.org e-Print Archive

Optimizing Magnetic Resonance Imaging for Image-Guided Radiotherapy

Author: Liu Lianli
Publication venue
Publication date: 01/01/2018
Field of study

Magnetic resonance imaging (MRI) is playing an increasingly important role in image-guided radiotherapy. MRI provides excellent soft tissue contrast, and is flexible in characterizing various tissue properties including relaxation, diffusion and perfusion. This thesis aims at developing new image analysis and reconstruction algorithms to optimize MRI in support of treatment planning, target delineation and treatment response assessment for radiotherapy. First, unlike Computed Tomography (CT) images, MRI cannot provide electron density information necessary for radiation dose calculation. To address this, we developed a synthetic CT generation algorithm that generates pseudo CT images from MRI, based on tissue classification results on MRI for female pelvic patients. To improve tissue classification accuracy, we learnt a pelvic bone shape model from a training dataset, and integrated the shape model into an intensity-based fuzzy c-menas classification scheme. The shape-regularized tissue classification algorithm is capable of differentiating tissues that have significant overlap in MRI intensity distributions. Treatment planning dose calculations using synthetic CT image volumes generated from the tissue classification results show acceptably small variations as compared to CT volumes. As MRI artifacts, such as B1 filed inhomogeneity (bias field) may negatively impact the tissue classification accuracy, we also developed an algorithm that integrates the correction of bias field into the tissue classification scheme. We modified the fuzzy c-means classification by modeling the image intensity as the true intensity corrupted by the multiplicative bias field. A regularization term further ensures the smoothness of the bias field. We solved the optimization problem using a linearized alternating direction method of multipliers (ADMM) method, which is more computational efficient over existing methods. The second part of this thesis looks at a special MR imaging technique, diffusion-weighted MRI (DWI). By acquiring a series of DWI images with a wide range of b-values, high order diffusion analysis can be performed using the DWI image series and new biomarkers for tumor grading, delineation and treatment response evaluation may be extracted. However, DWI suffers from low signal-to-noise ratio at high b-values, and the multi-b-value acquisition makes the total scan time impractical for clinical use. In this thesis, we proposed an accelerated DWI scheme, that sparsely samples k-space and reconstructs images using a model-based algorithm. Specifically, we built a 3D block-Hankel tensor from k-space samples, and modeled both local and global correlations of the high dimensional k-space data as a low-rank property of the tensor. We also added a phase constraint to account for large phase variations across different b-values, and to allow reconstruction from partial Fourier acquisition, which further accelerates the image acquisition. We proposed an ADMM algorithm to solve the constrained image reconstruction problem. Image reconstructions using both simulated and patient data show improved signal-to-noise ratio. As compared to clinically used parallel imaging scheme which achieves a 4-fold acceleration, our method achieves an 8-fold acceleration. Reconstructed images show reduced reconstruction errors as proved on simulated data and similar diffusion parameter mapping results on patient data.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143919/1/llliu_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

Fast volume reconstruction from motion corrupted stacks of 2D slices

Author: Aljabar Paul
Hajnal Joseph V.
Kainz Bernhard
Keraudren Kevin
Kuklisova-Murgasova Maria
Malamateniou Christina
Rueckert Daniel
Rutherford Mary
Steinberger Markus
Torsney-Weir Thomas
Wein Wolfgang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Capturing an enclosing volume of moving subjects and organs using fast individual image slice acquisition has shown promise in dealing with motion artefacts. Motion between slice acquisitions results in spatial inconsistencies that can be resolved by slice-to-volume reconstruction (SVR) methods to provide high quality 3D image data. Existing algorithms are, however, typically very slow, specialised to specific applications and rely on approximations, which impedes their potential clinical use. In this paper, we present a fast multi-GPU accelerated framework for slice-to-volume reconstruction. It is based on optimised 2D/3D registration, super-resolution with automatic outlier rejection and an additional (optional) intensity bias correction. We introduce a novel and fully automatic procedure for selecting the image stack with least motion to serve as an initial registration target. We evaluate the proposed method using artificial motion corrupted phantom data as well as clinical data, including tracked freehand ultrasound of the liver and fetal Magnetic Resonance Imaging. We achieve speed-up factors greater than 30 compared to a single CPU system and greater than 10 compared to currently available state-of-the-art multi-core CPU methods. We ensure high reconstruction accuracy by exact computation of the point-spread function for every input data point, which has not previously been possible due to computational limitations. Our framework and its implementation is scalable for available computational infrastructures and tests show a speed-up factor of 1.70 for each additional GPU. This paves the way for the online application of image based reconstruction methods during clinical examinations. The source code for the proposed approach is publicly available

City Research Online

Greenwich Academic Literature Archive

Cronfa at Swansea University

Spiral - Imperial College Digital Repository

King's Research Portal

Motion compensated blocking artefact repair on low bit rate block transform coded video

Author: Coezijn E.R.E.
Publication venue
Publication date: 01/01/2005
Field of study

CiteSeerX

Repository TU/e

Pure OAI Repository

Optical flow estimation using steered-L1 norm

Author: Zayouna A.
Zayouna A.
Publication venue
Publication date: 01/01/2016
Field of study

Motion is a very important part of understanding the visual picture of the surrounding environment. In image processing it involves the estimation of displacements for image points in an image sequence. In this context dense optical flow estimation is concerned with the computation of pixel displacements in a sequence of images, therefore it has been used widely in the field of image processing and computer vision. A lot of research was dedicated to enable an accurate and fast motion computation in image sequences. Despite the recent advances in the computation of optical flow, there is still room for improvements and optical flow algorithms still suffer from several issues, such as motion discontinuities, occlusion handling, and robustness to illumination changes. This thesis includes an investigation for the topic of optical flow and its applications. It addresses several issues in the computation of dense optical flow and proposes solutions. Specifically, this thesis is divided into two main parts dedicated to address two main areas of interest in optical flow. In the first part, image registration using optical flow is investigated. Both local and global image registration has been used for image registration. An image registration based on an improved version of the combined Local-global method of optical flow computation is proposed. A bi-lateral filter was used in this optical flow method to improve the edge preserving performance. It is shown that image registration via this method gives more robust results compared to the local and the global optical flow methods previously investigated. The second part of this thesis encompasses the main contribution of this research which is an improved total variation L1 norm. A smoothness term is used in the optical flow energy function to regularise this function. The L1 is a plausible choice for such a term because of its performance in preserving edges, however this term is known to be isotropic and hence decreases the penalisation near motion boundaries in all directions. The proposed improved L1 (termed here as the steered-L1 norm) smoothness term demonstrates similar performance across motion boundaries but improves the penalisation performance along such boundaries

Middlesex University Research Repository

Realtime Dynamic 3D Facial Reconstruction for Monocular Video In-the-Wild

Author: Liu Shuang
Wang Z.
Yang Xiaosong
Zhang Jian J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/01/2018
Field of study

With the increasing amount of videos recorded using 2D mobile cameras, the technique for recovering the 3D dynamic facial models from these monocular videos has become a necessity for many image and video editing applications. While methods based parametric 3D facial models can reconstruct the 3D shape in dynamic environment, large structural changes are ignored. Structure-from-motion methods can reconstruct these changes but assume the object to be static. To address this problem we present a novel method for realtime dynamic 3D facial tracking and reconstruction from videos captured in uncontrolled environments. Our method can track the deforming facial geometry and reconstruct external objects that protrude from the face such as glasses and hair. It also allows users to move around, perform facial expressions freely without degrading the reconstruction quality

Bournemouth University Research Online

Beyond the pixels: learning and utilising video compression features for localisation of digital tampering.

Author: Johnston Pamela
Publication venue
Publication date: 31/08/2019
Field of study

Video compression is pervasive in digital society. With rising usage of deep convolutional neural networks (CNNs) in the fields of computer vision, video analysis and video tampering detection, it is important to investigate how patterns invisible to human eyes may be influencing modern computer vision techniques and how they can be used advantageously. This work thoroughly explores how video compression influences accuracy of CNNs and shows how optimal performance is achieved when compression levels in the training set closely match those of the test set. A novel method is then developed, using CNNs, to derive compression features directly from the pixels of video frames. It is then shown that these features can be readily used to detect inauthentic video content with good accuracy across multiple different video tampering techniques. Moreover, the ability to explain these features allows predictions to be made about their effectiveness against future tampering methods. The problem is motivated with a novel investigation into recent video manipulation methods, which shows that there is a consistent drive to produce convincing, photorealistic, manipulated or synthetic video. Humans, blind to the presence of video tampering, are also blind to the type of tampering. New detection techniques are required and, in order to compensate for human limitations, they should be broadly applicable to multiple tampering types. This thesis details the steps necessary to develop and evaluate such techniques

Open Access Institutional Repository at Robert Gordon University