60 research outputs found
H2-Stereo: High-Speed, High-Resolution Stereoscopic Video System
High-speed, high-resolution stereoscopic (H2-Stereo) video allows us to
perceive dynamic 3D content at fine granularity. The acquisition of H2-Stereo
video, however, remains challenging with commodity cameras. Existing spatial
super-resolution or temporal frame interpolation methods provide compromised
solutions that lack temporal or spatial details, respectively. To alleviate
this problem, we propose a dual camera system, in which one camera captures
high-spatial-resolution low-frame-rate (HSR-LFR) videos with rich spatial
details, and the other captures low-spatial-resolution high-frame-rate
(LSR-HFR) videos with smooth temporal details. We then devise a Learned
Information Fusion network (LIFnet) that exploits the cross-camera redundancies
to enhance both camera views to high spatiotemporal resolution (HSTR) for
reconstructing the H2-Stereo video effectively. We utilize a disparity network
to transfer spatiotemporal information across views even in large disparity
scenes, based on which, we propose disparity-guided flow-based warping for
LSR-HFR view and complementary warping for HSR-LFR view. A multi-scale fusion
method in feature domain is proposed to minimize occlusion-induced warping
ghosts and holes in HSR-LFR view. The LIFnet is trained in an end-to-end manner
using our collected high-quality Stereo Video dataset from YouTube. Extensive
experiments demonstrate that our model outperforms existing state-of-the-art
methods for both views on synthetic data and camera-captured real data with
large disparity. Ablation studies explore various aspects, including
spatiotemporal resolution, camera baseline, camera desynchronization,
long/short exposures and applications, of our system to fully understand its
capability for potential applications
DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models
Perpetual view generation -- the task of generating long-range novel views by
flying into a given image -- has been a novel yet promising task. We introduce
DiffDreamer, an unsupervised framework capable of synthesizing novel views
depicting a long camera trajectory while training solely on internet-collected
images of nature scenes. We demonstrate that image-conditioned diffusion models
can effectively perform long-range scene extrapolation while preserving both
local and global consistency significantly better than prior GAN-based methods.
Project page: https://primecai.github.io/diffdreamer
NIO: Lightweight neural operator-based architecture for video frame interpolation
We present, NIO - Neural Interpolation Operator, a lightweight efficient
neural operator-based architecture to perform video frame interpolation.
Current deep learning based methods rely on local convolutions for feature
learning and require a large amount of training on comprehensive datasets.
Furthermore, transformer-based architectures are large and need dedicated GPUs
for training. On the other hand, NIO, our neural operator-based approach learns
the features in the frames by translating the image matrix into the Fourier
space by using Fast Fourier Transform (FFT). The model performs global
convolution, making it discretization invariant. We show that NIO can produce
visually-smooth and accurate results and converges in fewer epochs than
state-of-the-art approaches. To evaluate the visual quality of our interpolated
frames, we calculate the structural similarity index (SSIM) and Peak Signal to
Noise Ratio (PSNR) between the generated frame and the ground truth frame. We
provide the quantitative performance of our model on Vimeo-90K dataset, DAVIS,
UCF101 and DISFA+ dataset
Survey of image-based representations and compression techniques
In this paper, we survey the techniques for image-based rendering (IBR) and for compressing image-based representations. Unlike traditional three-dimensional (3-D) computer graphics, in which 3-D geometry of the scene is known, IBR techniques render novel views directly from input images. IBR techniques can be classified into three categories according to how much geometric information is used: rendering without geometry, rendering with implicit geometry (i.e., correspondence), and rendering with explicit geometry (either with approximate or accurate geometry). We discuss the characteristics of these categories and their representative techniques. IBR techniques demonstrate a surprising diverse range in their extent of use of images and geometry in representing 3-D scenes. We explore the issues in trading off the use of images and geometry by revisiting plenoptic-sampling analysis and the notions of view dependency and geometric proxies. Finally, we highlight compression techniques specifically designed for image-based representations. Such compression techniques are important in making IBR techniques practical.published_or_final_versio
Development of optical methods for real-time whole-brain functional imaging of zebrafish neuronal activity
Each one of us in his life has, at least once, smelled the scent of roses, read one canto of Dante’s Commedia or listened to the sound of the sea from a shell. All of this is possible thanks to the astonishing capabilities of an organ, such as the brain, that allows us to collect and organize perceptions coming from sensory organs and to produce behavioural responses accordingly. Studying an operating brain in a non-invasive way is extremely difficult in mammals, and particularly in humans. In the last decade, a small teleost fish, zebrafish (Danio rerio), has been making its way into the field of neurosciences. The brain of a larval zebrafish is made up of 'only' 100000 neurons and it’s completely transparent, making it possible to optically access it. Here, taking advantage of the best of currently available technology, we devised optical solutions to investigate the dynamics of neuronal activity throughout the entire brain of zebrafish larvae
Compression and Subjective Quality Assessment of 3D Video
In recent years, three-dimensional television (3D TV) has been broadly considered as the successor to the existing traditional two-dimensional television (2D TV) sets. With its capability of offering a dynamic and immersive experience, 3D video (3DV) is expected to expand conventional video in several applications in the near future. However, 3D content requires more than a single view to deliver the depth sensation to the viewers and this, inevitably, increases the bitrate compared to the corresponding 2D content. This need drives the research trend in video compression field towards more advanced and more efficient algorithms.
Currently, the Advanced Video Coding (H.264/AVC) is the state-of-the-art video coding standard which has been developed by the Joint Video Team of ISO/IEC MPEG and ITU-T VCEG. This codec has been widely adopted in various applications and products such as TV broadcasting, video conferencing, mobile TV, and blue-ray disc. One important extension of H.264/AVC, namely Multiview Video Coding (MVC) was an attempt to multiple view compression by taking into consideration the inter-view dependency between different views of the same scene. This codec H.264/AVC with its MVC extension (H.264/MVC) can be used for encoding either conventional stereoscopic video, including only two views, or multiview video, including more than two views.
In spite of the high performance of H.264/MVC, a typical multiview video sequence requires a huge amount of storage space, which is proportional to the number of offered views. The available views are still limited and the research has been devoted to synthesizing an arbitrary number of views using the multiview video and depth map (MVD). This process is mandatory for auto-stereoscopic displays (ASDs) where many views are required at the viewer side and there is no way to transmit such a relatively huge number of views with currently available broadcasting technology. Therefore, to satisfy the growing hunger for 3D related applications, it is mandatory to further decrease the bitstream by introducing new and more efficient algorithms for compressing multiview video and depth maps.
This thesis tackles the 3D content compression targeting different formats i.e. stereoscopic video and depth-enhanced multiview video. Stereoscopic video compression algorithms introduced in this thesis mostly focus on proposing different types of asymmetry between the left and right views. This means reducing the quality of one view compared to the other view aiming to achieve a better subjective quality against the symmetric case (the reference) and under the same bitrate constraint. The proposed algorithms to optimize depth-enhanced multiview video compression include both texture compression schemes as well as depth map coding tools. Some of the introduced coding schemes proposed for this format include asymmetric quality between the views.
Knowing that objective metrics are not able to accurately estimate the subjective quality of stereoscopic content, it is suggested to perform subjective quality assessment to evaluate different codecs. Moreover, when the concept of asymmetry is introduced, the Human Visual System (HVS) performs a fusion process which is not completely understood. Therefore, another important aspect of this thesis is conducting several subjective tests and reporting the subjective ratings to evaluate the perceived quality of the proposed coded content against the references. Statistical analysis is carried out in the thesis to assess the validity of the subjective ratings and determine the best performing test cases
Implicit Object Pose Estimation on RGB Images Using Deep Learning Methods
With the rise of robotic and camera systems and the success of deep learning in computer vision,
there is growing interest in precisely determining object positions and orientations. This is crucial for
tasks like automated bin picking, where a camera sensor analyzes images or point clouds to guide a
robotic arm in grasping objects. Pose recognition has broader applications, such as predicting a
car's trajectory in autonomous driving or adapting objects in virtual reality based on the viewer's
perspective.
This dissertation focuses on RGB-based pose estimation methods that use depth information only
for refinement, which is a challenging problem. Recent advances in deep learning have made it
possible to predict object poses in RGB images, despite challenges like object overlap, object
symmetries and more.
We introduce two implicit deep learning-based pose estimation methods for RGB images, covering
the entire process from data generation to pose selection. Furthermore, theoretical findings on
Fourier embeddings are shown to improve the performance of the so-called implicit neural
representations - which are then successfully utilized for the task of implicit pose estimation
High Performance Multiview Video Coding
Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications
- …