34 research outputs found

    Speaker Normalization for Self-supervised Speech Emotion Recognition

    Full text link
    Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.Comment: ICASSP 2

    Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

    Full text link
    We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.Comment: 9 pages, 6 figure

    On The Robustness of Self-Supervised Representations for Spoken Language Modeling

    Full text link
    Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach

    Simple and Controllable Music Generation

    Full text link
    We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

    Statistical conservation laws in turbulent transport

    Full text link
    We address the statistical theory of fields that are transported by a turbulent velocity field, both in forced and in unforced (decaying) experiments. We propose that with very few provisos on the transporting velocity field, correlation functions of the transported field in the forced case are dominated by statistically preserved structures. In decaying experiments (without forcing the transported fields) we identify infinitely many statistical constants of the motion, which are obtained by projecting the decaying correlation functions on the statistically preserved functions. We exemplify these ideas and provide numerical evidence using a simple model of turbulent transport. This example is chosen for its lack of Lagrangian structure, to stress the generality of the ideas

    Code Llama: Open Foundation Models for Code

    Full text link
    We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use

    “This is where it all started” – the pivotal role of PLCζ within the sophisticated process of mammalian reproduction: a systemic review

    No full text
    Résumé La reproduction des mammifères est l’un des phénomènes biologiques les plus complexes et fascinant; son objectif est de transférer le matériel génétique maternel et paternel à la génération suivante. A la fin de l’ovogenèse et de la spermatogenèse, les deux gamètes haploïdes contiennent un ensemble unique de chromosomes prêts à former un zygote, la première cellule du nouvel individu en développement. L’ovocyte mature et les spermatozoïdes restent dans un état quiescent au cours duquel l’ovocyte est caractérisé par un arrêt cytoplasmique et nucléaire, alors que les spermatozoïdes ont besoin d’une maturation ultérieure dans l’épididyme et le tractus génital femelle avant de pouvoir féconder l’ovocyte. Que ce soit in vivo ou in vitro, le spermatozoïde amorce une série de modifications biochimiques et physiologiques dans l’ovocyte. Le premier signal détecté après la fécondation est constitué des oscillations Ca2+ cytosoliques, une condition préalable au développement embryonnaire. Ces oscillations libèrent l’ovocyte de son arrêt en seconde méiose vers l’embryogenèse, phénomène connu sous le nom ‘d’activation de l’ovocyte’. L’isoforme phospholipase C zeta (PLCζ) est la seule protéine soluble du spermatozoïde capable d’activer dans l’ovocyte la voie de signalisation inositol triphosphate/Ca2+ qui mène aux oscillations Ca2+ et par conséquent au développement de l’embryon. Par rapport aux autres PLC, la structure spécifique de PLCζ lui confère une activité spécialisée via les domaines catalytiques préservés de X et Y, ainsi que des caractéristiques propres tels un déclenchement rapide, une grande sensibilité au Ca2+ et un arrêt des oscillations à la formation du zygote. Les découvertes récentes de PLCζ ont induit des études centrées sur les possibles applications cliniques de cette protéine dans l’évaluation et la prise en charge de l’infertilité masculine lors de FIV/ICSI. L’échec de fécondation est attribué à l’absence de reprise de la seconde méiose ovocytaire, suggérant la possibilité que l’échec de l’ICSI puisse être lié à une activité défectueuse de la PLCζ dans le spermatozoïde. La micro injection de PLCζ humaine recombinante dans des ovocytes humains après échec de fécondation en ICSI pourrait déclencher des oscillations Ca2+ et permettre une fécondation réussie, offrant de nouveaux espoirs aux couples traditionnellement orientés vers le don de spermatozoïdes. Il est toutefois nécessaire de disposer d’un plus grand nombre d’études avant toute mise en œuvre de cette approche en clinique. Des orientations pour de futures études sont discutées
    corecore