41 research outputs found
A MAP-Estimation Framework for Blind Deblurring Using High-Level Edge Priors
International audienceIn this paper we propose a general MAP-estimation framework for blind image deconvolution that allows the incorporation of powerful priors regarding predicting the edges of the latent image, which is known to be a crucial factor for the success of blind deblurring. This is achieved in a principled, robust and unified manner through the use of a global energy function that can take into account multiple constraints. Based on this framework, we show how to successfully make use of a particular prior of this type that is quite strong and also applicable to a wide variety of cases. It relates to the strong structural regularity that is exhibited by many scenes, and which affects the location and distribution of the corresponding image edges. We validate the excellent performance of our approach through an extensive set of experimental results and comparisons to the state-of-the-art
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
Head-disk system characterization with head itself as transducer
Master'sMASTER OF ENGINEERIN
Learning Beyond-pixel Mappings from Internet Videos
Recently in the Computer Vision community, there have been significant advancements in algorithms to recognize or localize visual contents for both images and videos, for instance, object recognition and detection tasks. They infer the information that is directly visible within the images or video frames (predicting what’s in the frame). While human-level visual understanding could be much more than that, because human also have insights about the information ’beyond the frame’. In other words, people are able to reasonably infer information that is not visible from the current scenes, such as predicting possible future events. We expect the computational models could own the same capabilities one day. Learning beyond-pixel mappings can be a broad concept. In this dissertation, we carefully define and formulate the problems as specific and subdivided tasks from different aspects. Under this context, what beyond-pixel mapping does is to infer information of broader spatial or temporal context, or even information from other modalities like text or sound. We first present a computational framework to learn the mappings between short event video clips and their intrinsic temporal sequence (which one usually happens first). Then we keep exploring the follow-up direction by directly predicting the future. Specifically we utilize generative models to predict depictions of objects in their future state. Next, we explore a related generation task to generate video frames of the target person with unseen poses guided by a random person. Finally, we propose a framework to learn the mappings between input video frames and it’s counterpart in sound domain. The main contribution of this dissertation lies in exploring beyond-pixel mappings from various directions to add relevant knowledge to the next-generation AI platforms.Doctor of Philosoph
Down Selection of Polymerized Bovine Hemoglobins for Use as Oxygen Releasing Therapeutics in a Guinea Pig Model
Editor's Highlight: The development of hemoglobin-based oxygen carriers (HBOCs) as a replacement for whole-blood transfusions has been impeded by their systemic toxicity. This paper presents data from a series of HBOCs, demonstrating one candidate that meets predetermined safety criteria. This approach may allow the development of an acceptable blood substitute for human us
VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection
In recent years, transformer-based detectors have demonstrated remarkable
performance in 2D visual perception tasks. However, their performance in
multi-view 3D object detection remains inferior to the state-of-the-art (SOTA)
of convolutional neural network based detectors. In this work, we investigate
this issue from the perspective of bird's-eye-view (BEV) feature generation.
Specifically, we examine the BEV feature generation method employed by the
transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it
only generates attention weights from BEV, which precludes the use of lidar
points for supervision, and (ii) it aggregates camera view features to the BEV
through deformable sampling, which only selects a small subset of features and
fails to exploit all information. To overcome these limitations, we propose a
novel BEV feature generation method, dual-view attention, which generates
attention weights from both the BEV and camera view. This method encodes all
camera features into the BEV feature. By combining dual-view attention with the
BEVFormer architecture, we build a new detector named VoxelFormer. Extensive
experiments are conducted on the nuScenes benchmark to verify the superiority
of dual-view attention and VoxelForer. We observe that even only adopting 3
encoders and 1 historical frame during training, VoxelFormer still outperforms
BEVFormer significantly. When trained in the same setting, VoxelFormer can
surpass BEVFormer by 4.9% NDS point. Code is available at:
https://github.com/Lizhuoling/VoxelFormer-public.git
Self-appearance-aided Differential Evolution for Motion Transfer
Image animation transfers the motion of a driving video to a static object in
a source image, while keeping the source identity unchanged. Great progress has
been made in unsupervised motion transfer recently, where no labelled data or
ground truth domain priors are needed. However, current unsupervised approaches
still struggle when there are large motion or viewpoint discrepancies between
the source and driving images. In this paper, we introduce three measures that
we found to be effective for overcoming such large viewpoint changes. Firstly,
to achieve more fine-grained motion deformation fields, we propose to apply
Neural-ODEs for parametrizing the evolution dynamics of the motion transfer
from source to driving. Secondly, to handle occlusions caused by large
viewpoint and motion changes, we take advantage of the appearance flow obtained
from the source image itself ("self-appearance"), which essentially "borrows"
similar structures from other regions of an image to inpaint missing regions.
Finally, our framework is also able to leverage the information from additional
reference views which help to drive the source identity in spite of varying
motion state. Extensive experiments demonstrate that our approach outperforms
the state-of-the-arts by a significant margin (~40%), across six benchmarks
varying from human faces, human bodies to robots and cartoon characters. Model
generality analysis indicates that our approach generalises the best across
different object categories as well.Comment: 10 pages, 6 figure
A Unified Model for Tracking and Image-Video Detection Has More Power
Objection detection (OD) has been one of the most fundamental tasks in
computer vision. Recent developments in deep learning have pushed the
performance of image OD to new heights by learning-based, data-driven
approaches. On the other hand, video OD remains less explored, mostly due to
much more expensive data annotation needs. At the same time, multi-object
tracking (MOT) which requires reasoning about track identities and
spatio-temporal trajectories, shares similar spirits with video OD. However,
most MOT datasets are class-specific (e.g., person-annotated only), which
constrains a model's flexibility to perform tracking on other objects. We
propose TrIVD (Tracking and Image-Video Detection), the first framework that
unifies image OD, video OD, and MOT within one end-to-end model. To handle the
discrepancies and semantic overlaps across datasets, TrIVD formulates
detection/tracking as grounding and reasons about object categories via
visual-text alignments. The unified formulation enables cross-dataset,
multi-task training, and thus equips TrIVD with the ability to leverage
frame-level features, video-level spatio-temporal relations, as well as track
identity associations. With such joint training, we can now extend the
knowledge from OD data, that comes with much richer object category
annotations, to MOT and achieve zero-shot tracking capability. Experiments
demonstrate that TrIVD achieves state-of-the-art performances across all
image/video OD and MOT tasks.Comment: (13 pages, 4 figures