2,565 research outputs found
Volumetric kombat:a case study on developing a VR game with Volumetric Video
This paper presents a case study on the development of a Virtual Reality (VR) game using Volumetric Video (VV) for character animation. We delve into the potential of VV, a technology that fuses video and depth sensor data, which has progressively matured since its initial introduction in 1995. Despite its potential to deliver unmatched realism and dynamic 4D sequences, VV applications are predominantly used in non-interactive scenarios. We explore the barriers to entry such as high costs associated with large-scale VV capture systems and the lack of tools optimized for VV in modern game engines. By actively using VV to develop a VR game, we examine and overcome these constraints developing a set of tools that address these challenges. Drawing lessons from past games, we propose an open-source data processing workflow for future VV games. This case study provides insights into the opportunities and challenges of VV in game development and contributes towards making VV more accessible for creators and researchers
Bridging formal methods and machine learning with model checking and global optimisation
Formal methods and machine learning are two research fields with drastically different foundations and philosophies. Formal methods utilise mathematically rigorous techniques for software and hardware systems' specification, development and verification. Machine learning focuses on pragmatic approaches to gradually improve a parameterised model by observing a training data set. While historically, the two fields lack communication, this trend has changed in the past few years with an outburst of research interest in the robustness verification of neural networks. This paper will briefly review these works, and focus on the urgent need for broader and more in-depth communication between the two fields, with the ultimate goal of developing learning-enabled systems with excellent performance and acceptable safety and security. We present a specification language, MLS2, and show that it can express a set of known safety and security properties, including generalisation, uncertainty, robustness, data poisoning, backdoor, model stealing, membership inference, model inversion, interpretability, and fairness. To verify MLS2 properties, we promote the global optimisation-based methods, which have provable guarantees on the convergence to the optimal solution. Many of them have theoretical bounds on the gap between current solutions and the optimal solution
A deep learning-enhanced digital twin framework for improving safety and reliability in human-robot collaborative manufacturing
In Industry 5.0, Digital Twins bring in flexibility and efficiency for smart manufacturing. Recently, the success of artificial intelligence techniques such as deep learning has led to their adoption in manufacturing and especially in human–robot collaboration. Collaborative manufacturing tasks involving human operators and robots pose significant safety and reliability concerns. In response to these concerns, a deep learning-enhanced Digital Twin framework is introduced through which human operators and robots can be detected and their actions can be classified during the manufacturing process, enabling autonomous decision making by the robot control system. Developed using Unreal Engine 4, our Digital Twin framework complies with the Robotics Operating System specification, and supports synchronous control and communication between the Digital Twin and the physical system. In our framework, a fully-supervised detector based on a faster region-based convolutional neural network is firstly trained on synthetic data generated by the Digital Twin, and then tested on the physical system to demonstrate the effectiveness of the proposed Digital Twin-based framework. To ensure safety and reliability, a semi-supervised detector is further designed to bridge the gap between the twin system and the physical system, and improved performance is achieved by the semi-supervised detector compared to the fully-supervised detector that is simply trained on either synthetic data or real data. The evaluation of the framework in multiple scenarios in which human operators collaborate with a Universal Robot 10 shows that it can accurately detect the human and robot, and classify their actions under a variety of conditions. The data from this evaluation have been made publicly available, and can be widely used for research and operational purposes. Additionally, a semi-automated annotation tool from the Digital Twin framework is published to benefit the collaborative robotics community
Development and user evaluation of an immersive light field system for space exploration
This paper presents the developmental work and user evaluation results of an immersive light field system built for the European Space Agency’s (ESA) project called “Light field-enhanced immersive teleoperation system for space station and ground control.” The main aim of the project is to evaluate the usefulness and feasibility of light fields in space exploration, and compare it to other types of immersive content, such as 360° photos and point clouds. In the course of the project, light field data were captured with a robotically controlled camera and processed into a suitable format. The light field authoring process was performed, and a light field renderer capable of displaying immersive panoramic or planar light fields on modern virtual reality hardware was developed. The planetary surface points of interest (POIs) were modeled in the laboratory environment, and three distinct test use cases utilizing them were developed. The user evaluation was held in the European Astronaut Centre (EAC) in the summer of 2023, involving prospective end-users of various backgrounds. During the evaluation, questionnaires, interviews, and observation were used for data collection. At the end of the paper, the evaluation results, as well as a discussion about lessons learned and possible improvements to the light field system, are presented
A Deep Learning-enhanced Digital Twin Framework for Improving Safety and Reliability in Human-Robot Collaborative Manufacturing
In Industry 5.0, Digital Twins bring in flexibility and efficiency for smart manufacturing. Recently, the success of artificial intelligence techniques such as deep learning has led to their adoption in manufacturing and especially in human–robot collaboration. Collaborative manufacturing tasks involving human operators and robots pose significant safety and reliability concerns. In response to these concerns, a deep learning-enhanced Digital Twin framework is introduced through which human operators and robots can be detected and their actions can be classified during the manufacturing process, enabling autonomous decision making by the robot control system. Developed using Unreal Engine 4, our Digital Twin framework complies with the Robotics Operating System specification, and supports synchronous control and communication between the Digital Twin and the physical system. In our framework, a fully-supervised detector based on a faster region-based convolutional neural network is firstly trained on synthetic data generated by the Digital Twin, and then tested on the physical system to demonstrate the effectiveness of the proposed Digital Twin-based framework. To ensure safety and reliability, a semi-supervised detector is further designed to bridge the gap between the twin system and the physical system, and improved performance is achieved by the semi-supervised detector compared to the fully-supervised detector that is simply trained on either synthetic data or real data. The evaluation of the framework in multiple scenarios in which human operators collaborate with a Universal Robot 10 shows that it can accurately detect the human and robot, and classify their actions under a variety of conditions. The data from this evaluation have been made publicly available, and can be widely used for research and operational purposes. Additionally, a semi-automated annotation tool from the Digital Twin framework is published to benefit the collaborative robotics community
Fast Learning Radiance Fields by Shooting Much Fewer Rays
Learning radiance fields has shown remarkable results for novel view
synthesis. The learning procedure usually costs lots of time, which motivates
the latest methods to speed up the learning procedure by learning without
neural networks or using more efficient data structures. However, these
specially designed approaches do not work for most of radiance fields based
methods. To resolve this issue, we introduce a general strategy to speed up the
learning procedure for almost all radiance fields based methods. Our key idea
is to reduce the redundancy by shooting much fewer rays in the multi-view
volume rendering procedure which is the base for almost all radiance fields
based methods. We find that shooting rays at pixels with dramatic color change
not only significantly reduces the training burden but also barely affects the
accuracy of the learned radiance fields. In addition, we also adaptively
subdivide each view into a quadtree according to the average rendering error in
each node in the tree, which makes us dynamically shoot more rays in more
complex regions with larger rendering error. We evaluate our method with
different radiance fields based methods under the widely used benchmarks.
Experimental results show that our method achieves comparable accuracy to the
state-of-the-art with much faster training.Comment: Accepted by lEEE Transactions on lmage Processing 2023. Project Page:
https://zparquet.github.io/Fast-Learning . Code:
https://github.com/zParquet/Fast-Learnin
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.Comment: 18 pages, 16 figures, and 3 table
Scaling up GANs for Text-to-Image Synthesis
The recent success of text-to-image synthesis has taken the world by storm
and captured the general public's imagination. From a technical standpoint, it
also marked a drastic change in the favored architecture to design generative
image models. GANs used to be the de facto choice, with techniques like
StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new
standard for large-scale generative models overnight. This rapid shift raises a
fundamental question: can we scale up GANs to benefit from large datasets like
LAION? We find that na\"Ively increasing the capacity of the StyleGAN
architecture quickly becomes unstable. We introduce GigaGAN, a new GAN
architecture that far exceeds this limit, demonstrating GANs as a viable option
for text-to-image synthesis. GigaGAN offers three major advantages. First, it
is orders of magnitude faster at inference time, taking only 0.13 seconds to
synthesize a 512px image. Second, it can synthesize high-resolution images, for
example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various
latent space editing applications such as latent interpolation, style mixing,
and vector arithmetic operations.Comment: CVPR 2023. Project webpage at https://mingukkang.github.io/GigaGAN
Scalable Exploration of Complex Objects and Environments Beyond Plain Visual Replication​
Digital multimedia content and presentation means are rapidly increasing their sophistication and are now capable of describing detailed representations of the physical world. 3D exploration experiences allow people to appreciate, understand and interact with intrinsically virtual objects.
Communicating information on objects requires the ability to explore them under different angles, as well as to mix highly photorealistic or illustrative presentations of the object themselves with additional data that provides additional insights on these objects, typically represented in the form of annotations. Effectively providing these capabilities requires the solution of important problems in visualization and user interaction.
In this thesis, I studied these problems in the cultural heritage-computing-domain, focusing on the very common and important special case of mostly planar, but visually, geometrically, and semantically rich objects. These could be generally roughly flat objects with a standard frontal viewing direction (e.g., paintings, inscriptions, bas-reliefs), as well as visualizations of fully 3D objects from a particular point of views (e.g., canonical views of buildings or statues). Selecting a precise application domain and a specific presentation mode allowed me to concentrate on the well defined use-case of the exploration of annotated relightable stratigraphic models (in particular, for local and remote museum presentation).
My main results and contributions to the state of the art have been a novel technique for interactively controlling visualization lenses while automatically maintaining good focus-and-context parameters, a novel approach for avoiding clutter in an annotated model and for guiding users towards interesting areas, and a method for structuring audio-visual object annotations into a graph and for using that graph to improve guidance and support storytelling and automated tours.
We demonstrated the effectiveness and potential of our techniques by performing interactive exploration sessions on various screen sizes and types ranging from desktop devices to large-screen displays for a walk-up-and-use museum installation.
KEYWORDS - Computer Graphics, Human-Computer Interaction, Interactive Lenses, Focus-and-Context, Annotated Models, Cultural Heritage Computing
3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
Text-guided 3D object generation aims to generate 3D objects described by
user-defined captions, which paves a flexible way to visualize what we
imagined. Although some works have been devoted to solving this challenging
task, these works either utilize some explicit 3D representations (e.g., mesh),
which lack texture and require post-processing for rendering photo-realistic
views; or require individual time-consuming optimization for every single case.
Here, we make the first attempt to achieve generic text-guided cross-category
3D object generation via a new 3D-TOGO model, which integrates a text-to-views
generation module and a views-to-3D generation module. The text-to-views
generation module is designed to generate different views of the target 3D
object given an input caption. prior-guidance, caption-guidance and view
contrastive learning are proposed for achieving better view-consistency and
caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D
generation module to obtain the implicit 3D neural representation from the
previously-generated views. Our 3D-TOGO model generates 3D objects in the form
of the neural radiance field with good texture and requires no time-cost
optimization for every single caption. Besides, 3D-TOGO can control the
category, color and shape of generated 3D objects with the input caption.
Extensive experiments on the largest 3D object dataset (i.e., ABO) are
conducted to verify that 3D-TOGO can better generate high-quality 3D objects
according to the input captions across 98 different categories, in terms of
PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields
- …