1,018 research outputs found
Sound Event Detection by Exploring Audio Sequence Modelling
Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing a sound recognition system are, which portion of a sound event should the system analyse, and what proportion of a sound event should the system process in order to claim a confident detection of that particular sound event. While the classification of sound events has improved a lot in recent years, it is considered that the temporal-segmentation of sound events has not improved in the same extent. The aim of this thesis is to propose and develop methods to improve the segmentation and classification of everyday sound events in SED models. In particular, this thesis explores the segmentation of sound events by investigating audio sequence encoding-based and audio sequence modelling-based methods, in an effort to improve the overall sound event detection performance. In the first phase of this thesis, efforts are put towards improving sound event detection by explicitly conditioning the audio sequence representations of an SED model using sound activity detection (SAD) and onset detection. To achieve this, we propose multi-task learning-based SED models in which SAD and onset detection are used as auxiliary tasks for the SED task. The next part of this thesis explores self-attention-based audio sequence modelling, which aggregates audio representations based on temporal relations within and between sound events, scored on the basis of the similarity of sound event portions in audio event sequences. We propose SED models that include memory-controlled, adaptive, dynamic, and source separation-induced self-attention variants, with the aim to improve overall sound recognition
Low- and high-resource opinion summarization
Customer reviews play a vital role in the online purchasing decisions we make. The reviews
express user opinions that are useful for setting realistic expectations and uncovering important
details about products. However, some products receive hundreds or even thousands of
reviews, making them time-consuming to read. Moreover, many reviews contain uninformative
content, such as irrelevant personal experiences. Automatic summarization offers an
alternative – short text summaries capturing the essential information expressed in reviews.
Automatically produced summaries can reflect overall or particular opinions and be tailored to
user preferences. Besides being presented on major e-commerce platforms, home assistants
can also vocalize them. This approach can improve user satisfaction by assisting in making
faster and better decisions.
Modern summarization approaches are based on neural networks, often requiring thousands of
annotated samples for training. However, human-written summaries for products are expensive
to produce because annotators need to read many reviews. This has led to annotated data
scarcity where only a few datasets are available. Data scarcity is the central theme of our
works, and we propose a number of approaches to alleviate the problem. The thesis consists
of two parts where we discuss low- and high-resource data settings.
In the first part, we propose self-supervised learning methods applied to customer reviews
and few-shot methods for learning from small annotated datasets. Customer reviews without
summaries are available in large quantities, contain a breadth of in-domain specifics, and
provide a powerful training signal. We show that reviews can be used for learning summarizers
via a self-supervised objective. Further, we address two main challenges associated with
learning from small annotated datasets. First, large models rapidly overfit on small datasets
leading to poor generalization. Second, it is not possible to learn a wide range of in-domain
specifics (e.g., product aspects and usage) from a handful of gold samples. This leads to
subtle semantic mistakes in generated summaries, such as ‘great dead on arrival battery.’ We
address the first challenge by explicitly modeling summary properties (e.g., content coverage
and sentiment alignment). Furthermore, we leverage small modules – adapters – that are
more robust to overfitting. As we show, despite their size, these modules can be used to
store in-domain knowledge to reduce semantic mistakes. Lastly, we propose a simple method
for learning personalized summarizers based on aspects, such as ‘price,’ ‘battery life,’ and
‘resolution.’ This task is harder to learn, and we present a few-shot method for training a
query-based summarizer on small annotated datasets.
In the second part, we focus on the high-resource setting and present a large dataset with
summaries collected from various online resources. The dataset has more than 33,000 humanwritten
summaries, where each is linked up to thousands of reviews. This, however, makes it
challenging to apply an ‘expensive’ deep encoder due to memory and computational costs. To
address this problem, we propose selecting small subsets of informative reviews. Only these
subsets are encoded by the deep encoder and subsequently summarized. We show that the
selector and summarizer can be trained end-to-end via amortized inference and policy gradient
methods
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.Comment: 18 pages, 16 figures, and 3 table
TaleCrafter: Interactive Story Visualization with Multiple Characters
Accurate Story visualization requires several necessary elements, such as
identity consistency across frames, the alignment between plain text and visual
content, and a reasonable layout of objects in images. Most previous works
endeavor to meet these requirements by fitting a text-to-image (T2I) model on a
set of videos in the same style and with the same characters, e.g., the
FlintstonesSV dataset. However, the learned T2I models typically struggle to
adapt to new characters, scenes, and styles, and often lack the flexibility to
revise the layout of the synthesized images. This paper proposes a system for
generic interactive story visualization, capable of handling multiple novel
characters and supporting the editing of layout and local structure. It is
developed by leveraging the prior knowledge of large language and T2I models,
trained on massive corpora. The system comprises four interconnected
components: story-to-prompt generation (S2P), text-to-layout generation (T2L),
controllable text-to-image generation (C-T2I), and image-to-video animation
(I2V). First, the S2P module converts concise story information into detailed
prompts required for subsequent stages. Next, T2L generates diverse and
reasonable layouts based on the prompts, offering users the ability to adjust
and refine the layout to their preference. The core component, C-T2I, enables
the creation of images guided by layouts, sketches, and actor-specific
identifiers to maintain consistency and detail across visualizations. Finally,
I2V enriches the visualization process by animating the generated images.
Extensive experiments and a user study are conducted to validate the
effectiveness and flexibility of interactive editing of the proposed system.Comment: Github repository: https://github.com/VideoCrafter/TaleCrafte
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Transformer is a deep neural network that employs a self-attention mechanism
to comprehend the contextual relationships within sequential data. Unlike
conventional neural networks or updated versions of Recurrent Neural Networks
(RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in
handling long dependencies between input sequence elements and enable parallel
processing. As a result, transformer-based models have attracted substantial
interest among researchers in the field of artificial intelligence. This can be
attributed to their immense potential and remarkable achievements, not only in
Natural Language Processing (NLP) tasks but also in a wide range of domains,
including computer vision, audio and speech processing, healthcare, and the
Internet of Things (IoT). Although several survey papers have been published
highlighting the transformer's contributions in specific fields, architectural
differences, or performance evaluations, there is still a significant absence
of a comprehensive survey paper encompassing its major applications across
various domains. Therefore, we undertook the task of filling this gap by
conducting an extensive survey of proposed transformer models from 2017 to
2022. Our survey encompasses the identification of the top five application
domains for transformer-based models, namely: NLP, Computer Vision,
Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze
the impact of highly influential transformer-based models in these domains and
subsequently classify them based on their respective tasks using a proposed
taxonomy. Our aim is to shed light on the existing potential and future
possibilities of transformers for enthusiastic researchers, thus contributing
to the broader understanding of this groundbreaking technology
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model
Text-to-image generative models have attracted rising attention for flexible
image editing via user-specified descriptions. However, text descriptions alone
are not enough to elaborate the details of subjects, often compromising the
subjects' identity or requiring additional per-subject fine-tuning. We
introduce a new framework called \textit{Paste, Inpaint and Harmonize via
Denoising} (PhD), which leverages an exemplar image in addition to text
descriptions to specify user intentions. In the pasting step, an off-the-shelf
segmentation model is employed to identify a user-specified subject within an
exemplar image which is subsequently inserted into a background image to serve
as an initialization capturing both scene context and subject identity in one.
To guarantee the visual coherence of the generated or edited image, we
introduce an inpainting and harmonizing module to guide the pre-trained
diffusion model to seamlessly blend the inserted subject into the scene
naturally. As we keep the pre-trained diffusion model frozen, we preserve its
strong image synthesis ability and text-driven ability, thus achieving
high-quality results and flexible editing with diverse texts. In our
experiments, we apply PhD to both subject-driven image editing tasks and
explore text-driven scene generation given a reference subject. Both
quantitative and qualitative comparisons with baseline methods demonstrate that
our approach achieves state-of-the-art performance in both tasks. More
qualitative results can be found at
\url{https://sites.google.com/view/phd-demo-page}.Comment: 10 pages, 12 figure
Scaling up GANs for Text-to-Image Synthesis
The recent success of text-to-image synthesis has taken the world by storm
and captured the general public's imagination. From a technical standpoint, it
also marked a drastic change in the favored architecture to design generative
image models. GANs used to be the de facto choice, with techniques like
StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new
standard for large-scale generative models overnight. This rapid shift raises a
fundamental question: can we scale up GANs to benefit from large datasets like
LAION? We find that na\"Ively increasing the capacity of the StyleGAN
architecture quickly becomes unstable. We introduce GigaGAN, a new GAN
architecture that far exceeds this limit, demonstrating GANs as a viable option
for text-to-image synthesis. GigaGAN offers three major advantages. First, it
is orders of magnitude faster at inference time, taking only 0.13 seconds to
synthesize a 512px image. Second, it can synthesize high-resolution images, for
example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various
latent space editing applications such as latent interpolation, style mixing,
and vector arithmetic operations.Comment: CVPR 2023. Project webpage at https://mingukkang.github.io/GigaGAN
Hyperbolic Image-Text Representations
Visual and linguistic concepts naturally organize themselves in a hierarchy,
where a textual concept ``dog'' entails all images that contain dogs. Despite
being intuitive, current large-scale vision and language models such as CLIP do
not explicitly capture such hierarchy. We propose MERU, a contrastive model
that yields hyperbolic representations of images and text. Hyperbolic spaces
have suitable geometric properties to embed tree-like data, so MERU can better
capture the underlying hierarchy in image-text data. Our results show that MERU
learns a highly interpretable representation space while being competitive with
CLIP's performance on multi-modal tasks like image classification and
image-text retrieval.Comment: Technical repor
EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions [Technical Report]
We introduce EQUI-VOCAL: a new system that automatically synthesizes queries
over videos from limited user interactions. The user only provides a handful of
positive and negative examples of what they are looking for. EQUI-VOCAL
utilizes these initial examples and additional ones collected through active
learning to efficiently synthesize complex user queries. Our approach enables
users to find events without database expertise, with limited labeling effort,
and without declarative specifications or sketches. Core to EQUI-VOCAL's design
is the use of spatio-temporal scene graphs in its data model and query language
and a novel query synthesis approach that works on large and noisy video data.
Our system outperforms two baseline systems -- in terms of F1 score, synthesis
time, and robustness to noise -- and can flexibly synthesize complex queries
that the baselines do not support.Comment: This is an extended technical report for the following paper: "Enhao
Zhang, Maureen Daum, Dong He, Brandon Haynes, Ranjay Krishna, and Magdalena
Balazinska. EQUI-VOCAL: Synthesizing Queries for Compositional Video Events
from Limited User Interactions. PVLDB, 16(11): 2714-2727, 2023.
doi:10.14778/3611479.3611482
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Recent advancements in generative speech models based on audio-text prompts
have enabled remarkable innovations like high-quality zero-shot text-to-speech.
However, existing models still face limitations in handling diverse audio-text
speech generation tasks involving transforming input speech and processing
audio captured in adverse acoustic conditions. This paper introduces SpeechX, a
versatile speech generation model capable of zero-shot TTS and various speech
transformation tasks, dealing with both clean and noisy signals. SpeechX
combines neural codec language modeling with multi-task learning using
task-dependent prompting, enabling unified and extensible modeling and
providing a consistent way for leveraging textual input in speech enhancement
and transformation tasks. Experimental results show SpeechX's efficacy in
various tasks, including zero-shot TTS, noise suppression, target speaker
extraction, speech removal, and speech editing with or without background
noise, achieving comparable or superior performance to specialized models
across tasks. See https://aka.ms/speechx for demo samples.Comment: See https://aka.ms/speechx for demo sample
- …