14,694 research outputs found
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
Audio editing is applicable for various purposes, such as adding background
sound effects, replacing a musical instrument, and repairing damaged audio.
Recently, some diffusion-based methods achieved zero-shot audio editing by
using a diffusion and denoising process conditioned on the text description of
the output audio. However, these methods still have some problems: 1) they have
not been trained on editing tasks and cannot ensure good editing effects; 2)
they can erroneously modify audio segments that do not require editing; 3) they
need a complete description of the output audio, which is not always available
or necessary in practical scenarios. In this work, we propose AUDIT, an
instruction-guided audio editing model based on latent diffusion models.
Specifically, AUDIT has three main design features: 1) we construct triplet
training data (instruction, input audio, output audio) for different audio
editing tasks and train a diffusion model using instruction and input (to be
edited) audio as conditions and generating output (edited) audio; 2) it can
automatically learn to only modify segments that need to be edited by comparing
the difference between the input and output audio; 3) it only needs edit
instructions instead of full target audio descriptions as text input. AUDIT
achieves state-of-the-art results in both objective and subjective metrics for
several audio editing tasks (e.g., adding, dropping, replacement, inpainting,
super-resolution). Demo samples are available at https://audit-demo.github.io/
Semantic Object Parsing with Local-Global Long Short-Term Memory
Semantic object parsing is a fundamental task for understanding objects in
detail in computer vision community, where incorporating multi-level contextual
information is critical for achieving such fine-grained pixel-level
recognition. Prior methods often leverage the contextual information through
post-processing predicted confidence maps. In this work, we propose a novel
deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly
incorporate short-distance and long-distance spatial dependencies into the
feature learning over all pixel positions. In each LG-LSTM layer, local
guidance from neighboring positions and global guidance from the whole image
are imposed on each position to better exploit complex local and global
contextual information. Individual LSTMs for distinct spatial dimensions are
also utilized to intrinsically capture various spatial layouts of semantic
parts in the images, yielding distinct hidden and memory cells of each position
for each dimension. In our parsing approach, several LG-LSTM layers are stacked
and appended to the intermediate convolutional layers to directly enhance
visual features, allowing network parameters to be learned in an end-to-end
way. The long chains of sequential computation by stacked LG-LSTM layers also
enable each pixel to sense a much larger region for inference benefiting from
the memorization of previous dependencies in all positions along all
dimensions. Comprehensive evaluations on three public datasets well demonstrate
the significant superiority of our LG-LSTM over other state-of-the-art methods.Comment: 10 page
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
- …