206 research outputs found
CCA: Collaborative Competitive Agents for Image Editing
This paper presents a novel generative model, Collaborative Competitive
Agents (CCA), which leverages the capabilities of multiple Large Language
Models (LLMs) based agents to execute complex tasks. Drawing inspiration from
Generative Adversarial Networks (GANs), the CCA system employs two equal-status
generator agents and a discriminator agent. The generators independently
process user instructions and generate results, while the discriminator
evaluates the outputs, and provides feedback for the generator agents to
further reflect and improve the generation results. Unlike the previous
generative model, our system can obtain the intermediate steps of generation.
This allows each generator agent to learn from other successful executions due
to its transparency, enabling a collaborative competition that enhances the
quality and robustness of the system's results. The primary focus of this study
is image editing, demonstrating the CCA's ability to handle intricate
instructions robustly. The paper's main contributions include the introduction
of a multi-agent-based generative model with controllable intermediate steps
and iterative optimization, a detailed examination of agent relationships, and
comprehensive experiments on image editing. Code is available at
\href{https://github.com/TiankaiHang/CCA}{https://github.com/TiankaiHang/CCA}
Real-time smoke rendering using compensated ray marching
We present a real-time algorithm called compensated ray march-ing for rendering of smoke under dynamic low-frequency environ-ment lighting. Our approach is based on a decomposition of the input smoke animation, represented as a sequence of volumetric density fields, into a set of radial basis functions (RBFs) and a se-quence of residual fields. To expedite rendering, the source radi-ance distribution within the smoke is computed from only the low-frequency RBF approximation of the density fields, since the high-frequency residuals have little impact on global illumination under low-frequency environment lighting. Furthermore, in computing source radiances the contributions from single and multiple scatter-ing are evaluated at only the RBF centers and then approximated at other points in the volume using an RBF-based interpolation. A slice-based integration of these source radiances along each view ray is then performed to render the final image. The high-frequency residual fields, which are a critical component in the local appear-ance of smoke, are compensated back into the radiance integral dur-ing this ray march to generate images of high detail. The runtime algorithm, which includes both light transfer simula-tion and ray marching, can be easily implemented on the GPU, and thus allows for real-time manipulation of viewpoint and lighting, as well as interactive editing of smoke attributes such as extinction cross section, scattering albedo, and phase function. Only moderate preprocessing time and storage is needed. This approach provides the first method for real-time smoke rendering that includes sin-gle and multiple scattering while generating results comparable in quality to offline algorithms like ray tracing
Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
Pretrained backbones with fine-tuning have been widely adopted in 2D vision
and natural language processing tasks and demonstrated significant advantages
to task-specific networks. In this paper, we present a pretrained 3D backbone,
named {\SST}, which first outperforms all state-of-the-art methods in
downstream 3D indoor scene understanding tasks. Our backbone network is based
on a 3D Swin transformer and carefully designed to efficiently conduct
self-attention on sparse voxels with linear memory complexity and capture the
irregularity of point signals via generalized contextual relative positional
embedding. Based on this backbone design, we pretrained a large {\SST} model on
a synthetic Structed3D dataset that is 10 times larger than the ScanNet dataset
and fine-tuned the pretrained model in various downstream real-world indoor
scene understanding tasks. The results demonstrate that our model pretrained on
the synthetic dataset not only exhibits good generality in both downstream
segmentation and detection on real 3D point datasets, but also surpasses the
state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU
and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on
ScanNet segmentation (val), +1.9 [email protected] on ScanNet detection, +8.1 [email protected] on
S3DIS detection. Our method demonstrates the great potential of pretrained 3D
backbones with fine-tuning for 3D understanding tasks. The code and models are
available at https://github.com/microsoft/Swin3D
- …