Search CORE

41 research outputs found

A MAP-Estimation Framework for Blind Deblurring Using High-Level Edge Priors

Author: Komodakis Nikos
Zhou Yipin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceIn this paper we propose a general MAP-estimation framework for blind image deconvolution that allows the incorporation of powerful priors regarding predicting the edges of the latent image, which is known to be a crucial factor for the success of blind deblurring. This is achieved in a principled, robust and unified manner through the use of a global energy function that can take into account multiple constraints. Based on this framework, we show how to successfully make use of a particular prior of this type that is quite strong and also applicable to a wide variety of cases. It relates to the strong structural regularity that is exhibited by many scenes, and which affects the location and distribution of the corresponding image edges. We validate the excellent performance of our approach through an extensive set of experimental results and comparisons to the state-of-the-art

Crossref

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Visual to Sound: Generating Natural Sound for Videos in the Wild

Author: Berg Tamara L.
Bui Trung
Fang Chen
Wang Zhaowen
Zhou Yipin
Publication venue
Publication date: 01/06/2018
Field of study

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.Comment: Project page: http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm

arXiv.org e-Print Archive

Crossref

Head-disk system characterization with head itself as transducer

Author: ZHOU YIPIN
Publication venue
Publication date: 04/07/2006
Field of study

Master'sMASTER OF ENGINEERIN

ScholarBank@NUS

Learning Beyond-pixel Mappings from Internet Videos

Author: Zhou Yipin
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2020
Field of study

Recently in the Computer Vision community, there have been significant advancements in algorithms to recognize or localize visual contents for both images and videos, for instance, object recognition and detection tasks. They infer the information that is directly visible within the images or video frames (predicting what’s in the frame). While human-level visual understanding could be much more than that, because human also have insights about the information ’beyond the frame’. In other words, people are able to reasonably infer information that is not visible from the current scenes, such as predicting possible future events. We expect the computational models could own the same capabilities one day. Learning beyond-pixel mappings can be a broad concept. In this dissertation, we carefully define and formulate the problems as specific and subdivided tasks from different aspects. Under this context, what beyond-pixel mapping does is to infer information of broader spatial or temporal context, or even information from other modalities like text or sound. We first present a computational framework to learn the mappings between short event video clips and their intrinsic temporal sequence (which one usually happens first). Then we keep exploring the follow-up direction by directly predicting the future. Specifically we utilize generative models to predict depictions of objects in their future state. Next, we explore a related generation task to generate video frames of the target person with unseen poses guided by a random person. Finally, we propose a framework to learn the mappings between input video frames and it’s counterpart in sound domain. The main contribution of this dissertation lies in exploring beyond-pixel mappings from various directions to add relevant knowledge to the next-generation AI platforms.Doctor of Philosoph

Carolina Digital Repository

Down Selection of Polymerized Bovine Hemoglobins for Use as Oxygen Releasing Therapeutics in a Guinea Pig Model

Author: Baek Jin Hyen
Buehler Paul W.
Harris David R.
Palmer Andre F.
Schaer Dominik J.
Zhou Yipin
Publication venue
Publication date: 02/08/2017
Field of study

Editor's Highlight: The development of hemoglobin-based oxygen carriers (HBOCs) as a replacement for whole-blood transfusions has been impeded by their systemic toxicity. This paper presents data from a series of HBOCs, demonstrating one candidate that meets predetermined safety criteria. This approach may allow the development of an acceptable blood substitute for human us

RERO DOC Digital Library

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

Author: Huang Linyan
Li Zhuoling
Lim SerNam
Ma Wei-Chiu
Wang Haoqian
Zhang Chuanrui
Zhao Hengshuang
Zhou Yipin
Publication venue
Publication date: 03/04/2023
Field of study

In recent years, transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks. However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors. In this work, we investigate this issue from the perspective of bird's-eye-view (BEV) feature generation. Specifically, we examine the BEV feature generation method employed by the transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it only generates attention weights from BEV, which precludes the use of lidar points for supervision, and (ii) it aggregates camera view features to the BEV through deformable sampling, which only selects a small subset of features and fails to exploit all information. To overcome these limitations, we propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view. This method encodes all camera features into the BEV feature. By combining dual-view attention with the BEVFormer architecture, we build a new detector named VoxelFormer. Extensive experiments are conducted on the nuScenes benchmark to verify the superiority of dual-view attention and VoxelForer. We observe that even only adopting 3 encoders and 1 historical frame during training, VoxelFormer still outperforms BEVFormer significantly. When trained in the same setting, VoxelFormer can surpass BEVFormer by 4.9% NDS point. Code is available at: https://github.com/Lizhuoling/VoxelFormer-public.git

arXiv.org e-Print Archive

Self-appearance-aided Differential Evolution for Motion Transfer

Author: Cao Xuefei
Couprie Camille
Lim Ser-Nam
Liu Peirong
Oquab Maxime
Shah Ashish
Wang Rui
Zhou Yipin
Publication venue
Publication date: 09/10/2021
Field of study

Image animation transfers the motion of a driving video to a static object in a source image, while keeping the source identity unchanged. Great progress has been made in unsupervised motion transfer recently, where no labelled data or ground truth domain priors are needed. However, current unsupervised approaches still struggle when there are large motion or viewpoint discrepancies between the source and driving images. In this paper, we introduce three measures that we found to be effective for overcoming such large viewpoint changes. Firstly, to achieve more fine-grained motion deformation fields, we propose to apply Neural-ODEs for parametrizing the evolution dynamics of the motion transfer from source to driving. Secondly, to handle occlusions caused by large viewpoint and motion changes, we take advantage of the appearance flow obtained from the source image itself ("self-appearance"), which essentially "borrows" similar structures from other regions of an image to inpaint missing regions. Finally, our framework is also able to leverage the information from additional reference views which help to drive the source identity in spite of varying motion state. Extensive experiments demonstrate that our approach outperforms the state-of-the-arts by a significant margin (~40%), across six benchmarks varying from human faces, human bodies to robots and cartoon characters. Model generality analysis indicates that our approach generalises the best across different object categories as well.Comment: 10 pages, 6 figure

arXiv.org e-Print Archive

An improved retrieval of tropospheric NO₂columns from the Ozone Monitoring Instrument

Author: Boersma K.F.
Brunner D.
Dirksen R.J.
Dobber M.R.
Eskes H.J.
Huijnen V.
Kleipool Q.L.
Stammes P.
Veefkind J.P.
Zhou Yipin
Publication venue
Publication date: 01/01/2010
Field of study

Repository TU/e

Pure OAI Repository

A Unified Model for Tracking and Image-Video Detection Has More Power

Author: Cao Xuefei
Lim Ser-Nam
Liu Peirong
Poursaeed Omid
Roy Sreya Dutta
Shah Ashish
Wang Rui
Zhang Pengchuan
Zhou Yipin
Publication venue
Publication date: 20/11/2022
Field of study

Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps across datasets, TrIVD formulates detection/tracking as grounding and reasons about object categories via visual-text alignments. The unified formulation enables cross-dataset, multi-task training, and thus equips TrIVD with the ability to leverage frame-level features, video-level spatio-temporal relations, as well as track identity associations. With such joint training, we can now extend the knowledge from OD data, that comes with much richer object category annotations, to MOT and achieve zero-shot tracking capability. Experiments demonstrate that TrIVD achieves state-of-the-art performances across all image/video OD and MOT tasks.Comment: (13 pages, 4 figures

arXiv.org e-Print Archive