Search CORE

18 research outputs found

Test-Time Training for Speech

Author: Dumpala Sri Harsha
Oore Sageev
Sastry Chandramouli
Publication venue
Publication date: 28/09/2023
Field of study

In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks -- for example, speaker-identification and emotion-detection -- and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text applications that only considers the bias parameters for fine-tuning -- as a solution to the aforementioned challenges and demonstrate that it is consistently more stable than fine-tuning all the parameters of the model

arXiv.org e-Print Archive

TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Author: Anil Cem
Bao Xuchan
Grosse Roger B.
Huang Sicong
Li Qiyang
Oore Sageev
Publication venue
Publication date: 01/05/2019
Field of study

In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201

arXiv.org e-Print Archive

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Author: Brade Stephen
Grossman Tovi
Oore Sageev
Sousa Mauricio
Wang Bryan
Publication venue
Publication date: 18/04/2023
Field of study

Text-to-image generative models have demonstrated remarkable capabilities in generating high-quality images based on textual prompts. However, crafting prompts that accurately capture the user's creative intent remains challenging. It often involves laborious trial-and-error procedures to ensure that the model interprets the prompts in alignment with the user's intention. To address the challenges, we present Promptify, an interactive system that supports prompt exploration and refinement for text-to-image generative models. Promptify utilizes a suggestion engine powered by large language models to help users quickly explore and craft diverse prompts. Our interface allows users to organize the generated images flexibly, and based on their preferences, Promptify suggests potential changes to the original prompt. This feedback loop enables users to iteratively refine their prompts and enhance desired features while avoiding unwanted ones. Our user study shows that Promptify effectively facilitates the text-to-image workflow and outperforms an existing baseline tool widely used for text-to-image generation

arXiv.org e-Print Archive

Logical Activation Functions: Logit-space equivalents of Probabilistic Boolean Operators

Author: d'Eon Jason
Earle Robert
Lowe Scott C.
Oore Sageev
Trappenberg Thomas
Publication venue
Publication date: 29/11/2022
Field of study

The choice of activation functions and their motivation is a long-standing issue within the neural network community. Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence of features within the stimulus. We derive logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independent probabilities. Such theories are important to formalize more complex dendritic operations in real neurons, and these operations can be used as activation functions within a neural network, introducing probabilistic Boolean-logic as the core operation of the neural network. Since these functions involve taking multiple exponents and logarithms, they are computationally expensive and not well suited to be directly used within neural networks. Consequently, we construct efficient approximations named

\text{AND}_\text{AIL}

(the AND operator Approximate for Independent Logits),

\text{OR}_\text{AIL}

, and

\text{XNOR}_\text{AIL}

, which utilize only comparison and addition operations, have well-behaved gradients, and can be deployed as activation functions in neural networks. Like MaxOut,

\text{AND}_\text{AIL}

and

\text{OR}_\text{AIL}

are generalizations of ReLU to two-dimensions. While our primary aim is to formalize dendritic computations within a logit-space probabilistic-Boolean framework, we deploy these new activation functions, both in isolation and in conjunction to demonstrate their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning

arXiv.org e-Print Archive

Zero-shot Clustering of Embeddings with Self-Supervised Learnt Encoders

Author: Haurum Joakim Bruslund
Lowe Scott C.
Moeslund Thomas B.
Oore Sageev
Taylor Graham W.
Publication venue
Publication date: 01/01/2023
Field of study

We explore whether self-supervised pretrained models can provide a useful representation space for datasets they were not trained on, and whether these representations can be used to group novel unlabelled data into meaningful clusters. To this end, we conduct experiments using image representation encoders pretrained on ImageNet using a variety of self-supervised training techniques. These encoders are deployed on image datasets that were not seen during training, without fine-tuning, and we investigate whether their embeddings can be clustered with conventional clustering algorithms. We find that it is possible to create well-defined clusters using self-supervised feature encoders, especially when using the Agglomerative Clustering method, and that it is possible to do so even for very fine-grained datasets such as NABirds. We also find indications that the Silhouette score is a good proxy of cluster quality when no ground-truth is available

VBN

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

Author: Brade Stephen
Grossman Tovi
Newsome Gregory Lee
Oore Sageev
Sousa Mauricio
Wang Bryan
Publication venue
Publication date: 20/02/2024
Field of study

Synthesizers are powerful tools that allow musicians to create dynamic and original sounds. Existing commercial interfaces for synthesizers typically require musicians to interact with complex low-level parameters or to manage large libraries of premade sounds. To address these challenges, we implement SynthScribe -- a fullstack system that uses multimodal deep learning to let users express their intentions at a much higher level. We implement features which address a number of difficulties, namely 1) searching through existing sounds, 2) creating completely new sounds, 3) making meaningful modifications to a given sound. This is achieved with three main features: a multimodal search engine for a large library of synthesizer sounds; a user centered genetic algorithm by which completely new sounds can be created and selected given the users preferences; a sound editing support feature which highlights and gives examples for key control parameters with respect to a text or audio based query. The results of our user studies show SynthScribe is capable of reliably retrieving and modifying sounds while also affording the ability to create completely new sounds that expand a musicians creative horizon

arXiv.org e-Print Archive