6 research outputs found
Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection
Generative steganography is the process of hiding secret messages in
generated images instead of cover images. Existing studies on generative
steganography use GAN or Flow models to obtain high hiding message capacity and
anti-detection ability over cover images. However, they create relatively
unrealistic stego images because of the inherent limitations of generative
models. We propose Diffusion-Stego, a generative steganography approach based
on diffusion models which outperform other generative models in image
generation. Diffusion-Stego projects secret messages into latent noise of
diffusion models and generates stego images with an iterative denoising
process. Since the naive hiding of secret messages into noise boosts visual
degradation and decreases extracted message accuracy, we introduce message
projection, which hides messages into noise space while addressing these
issues. We suggest three options for message projection to adjust the trade-off
between extracted message accuracy, anti-detection ability, and image quality.
Diffusion-Stego is a training-free approach, so we can apply it to pre-trained
diffusion models which generate high-quality images, or even large-scale
text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high
capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp
with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0
bpp on the FFHQ 6464 dataset) that makes it challenging to distinguish
from real images in the PNG format
Edit-A-Video: Single Video Editing with Object-Aware Consistency
Despite the fact that text-to-video (TTV) model has recently achieved
remarkable success, there have been few approaches on TTV for its extension to
video editing. Motivated by approaches on TTV models adapting from
diffusion-based text-to-image (TTI) models, we suggest the video editing
framework given only a pretrained TTI model and a single pair,
which we term Edit-A-Video. The framework consists of two stages: (1) inflating
the 2D model into the 3D model by appending temporal modules and tuning on the
source video (2) inverting the source video into the noise and editing with
target text prompt and attention map injection. Each stage enables the temporal
modeling and preservation of semantic attributes of the source video. One of
the key challenges for video editing include a background inconsistency
problem, where the regions not included for the edit suffer from undesirable
and inconsistent temporal alterations. To mitigate this issue, we also
introduce a novel mask blending method, termed as sparse-causal blending (SC
Blending). We improve previous mask blending methods to reflect the temporal
consistency so that the area where the editing is applied exhibits smooth
transition while also achieving spatio-temporal consistency of the unedited
regions. We present extensive experimental results over various types of text
and videos, and demonstrate the superiority of the proposed method compared to
baselines in terms of background consistency, text alignment, and video editing
quality
ControlDreamer: Stylized 3D Generation with Multi-View ControlNet
Recent advancements in text-to-3D generation have significantly contributed
to the automation and democratization of 3D content creation. Building upon
these developments, we aim to address the limitations of current methods in
generating 3D models with creative geometry and styles. We introduce multi-view
ControlNet, a novel depth-aware multi-view diffusion model trained on generated
datasets from a carefully curated text corpus. Our multi-view ControlNet is
then integrated into our two-stage pipeline, ControlDreamer, enabling
text-guided generation of stylized 3D models. Additionally, we present a
comprehensive benchmark for 3D style editing, encompassing a broad range of
subjects, including objects, animals, and characters, to further facilitate
research on diverse 3D generation. Our comparative analysis reveals that this
new pipeline outperforms existing text-to-3D methods as evidenced by human
evaluations and CLIP score metrics.Comment: Project page: https://controldreamer.github.io
Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering
To estimate the volume density and color of a 3D point in the multi-view
image-based rendering, a common approach is to inspect the consensus existence
among the given source image features, which is one of the informative cues for
the estimation procedure. To this end, most of the previous methods utilize
equally-weighted aggregation features. However, this could make it hard to
check the consensus existence when some outliers, which frequently occur by
occlusions, are included in the source image feature set. In this paper, we
propose a novel source-view-wise feature aggregation method, which facilitates
us to find out the consensus in a robust way by leveraging local structures in
the feature set. We first calculate the source-view-wise distance distribution
for each source feature for the proposed aggregation. After that, the distance
distribution is converted to several similarity distributions with the proposed
learnable similarity mapping functions. Finally, for each element in the
feature set, the aggregation features are extracted by calculating the weighted
means and variances, where the weights are derived from the similarity
distributions. In experiments, we validate the proposed method on various
benchmark datasets, including synthetic and real image scenes. The experimental
results demonstrate that incorporating the proposed features improves the
performance by a large margin, resulting in the state-of-the-art performance