18 research outputs found
Test-Time Training for Speech
In this paper, we study the application of Test-Time Training (TTT) as a
solution to handling distribution shifts in speech applications. In particular,
we introduce distribution-shifts to the test datasets of standard
speech-classification tasks -- for example, speaker-identification and
emotion-detection -- and explore how Test-Time Training (TTT) can help adjust
to the distribution-shift. In our experiments that include distribution shifts
due to background noise and natural variations in speech such as gender and
age, we identify some key-challenges with TTT including sensitivity to
optimization hyperparameters (e.g., number of optimization steps and subset of
parameters chosen for TTT) and scalability (e.g., as each example gets its own
set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a
parameter-efficient fine-tuning algorithm proposed for text applications that
only considers the bias parameters for fine-tuning -- as a solution to the
aforementioned challenges and demonstrate that it is consistently more stable
than fine-tuning all the parameters of the model
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models
Text-to-image generative models have demonstrated remarkable capabilities in
generating high-quality images based on textual prompts. However, crafting
prompts that accurately capture the user's creative intent remains challenging.
It often involves laborious trial-and-error procedures to ensure that the model
interprets the prompts in alignment with the user's intention. To address the
challenges, we present Promptify, an interactive system that supports prompt
exploration and refinement for text-to-image generative models. Promptify
utilizes a suggestion engine powered by large language models to help users
quickly explore and craft diverse prompts. Our interface allows users to
organize the generated images flexibly, and based on their preferences,
Promptify suggests potential changes to the original prompt. This feedback loop
enables users to iteratively refine their prompts and enhance desired features
while avoiding unwanted ones. Our user study shows that Promptify effectively
facilitates the text-to-image workflow and outperforms an existing baseline
tool widely used for text-to-image generation
Logical Activation Functions: Logit-space equivalents of Probabilistic Boolean Operators
The choice of activation functions and their motivation is a long-standing
issue within the neural network community. Neuronal representations within
artificial neural networks are commonly understood as logits, representing the
log-odds score of presence of features within the stimulus. We derive
logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR,
and XNOR for independent probabilities. Such theories are important to
formalize more complex dendritic operations in real neurons, and these
operations can be used as activation functions within a neural network,
introducing probabilistic Boolean-logic as the core operation of the neural
network. Since these functions involve taking multiple exponents and
logarithms, they are computationally expensive and not well suited to be
directly used within neural networks. Consequently, we construct efficient
approximations named (the AND operator Approximate for
Independent Logits), , and ,
which utilize only comparison and addition operations, have well-behaved
gradients, and can be deployed as activation functions in neural networks. Like
MaxOut, and are generalizations
of ReLU to two-dimensions. While our primary aim is to formalize dendritic
computations within a logit-space probabilistic-Boolean framework, we deploy
these new activation functions, both in isolation and in conjunction to
demonstrate their effectiveness on a variety of tasks including image
classification, transfer learning, abstract reasoning, and compositional
zero-shot learning
Zero-shot Clustering of Embeddings with Self-Supervised Learnt Encoders
We explore whether self-supervised pretrained models can provide a useful representation space for datasets they were not trained on, and whether these representations can be used to group novel unlabelled data into meaningful clusters. To this end, we conduct experiments using image representation encoders pretrained on ImageNet using a variety of self-supervised training techniques. These encoders are deployed on image datasets that were not seen during training, without fine-tuning, and we investigate whether their embeddings can be clustered with conventional clustering algorithms. We find that it is possible to create well-defined clusters using self-supervised feature encoders, especially when using the Agglomerative Clustering method, and that it is possible to do so even for very fine-grained datasets such as NABirds. We also find indications that the Silhouette score is a good proxy of cluster quality when no ground-truth is available
SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration
Synthesizers are powerful tools that allow musicians to create dynamic and
original sounds. Existing commercial interfaces for synthesizers typically
require musicians to interact with complex low-level parameters or to manage
large libraries of premade sounds. To address these challenges, we implement
SynthScribe -- a fullstack system that uses multimodal deep learning to let
users express their intentions at a much higher level. We implement features
which address a number of difficulties, namely 1) searching through existing
sounds, 2) creating completely new sounds, 3) making meaningful modifications
to a given sound. This is achieved with three main features: a multimodal
search engine for a large library of synthesizer sounds; a user centered
genetic algorithm by which completely new sounds can be created and selected
given the users preferences; a sound editing support feature which highlights
and gives examples for key control parameters with respect to a text or audio
based query. The results of our user studies show SynthScribe is capable of
reliably retrieving and modifying sounds while also affording the ability to
create completely new sounds that expand a musicians creative horizon