3 research outputs found

    BWSNet: Automatic Perceptual Assessment of Audio Signals

    Full text link
    This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements

    Conversion neuronale des attitudes sociales dans les signaux de parole

    No full text
    As social animals, humans communicate with each other by transmitting various types of information about the world and about themselves. At the heart of this process, the voice allows the transmission of linguistic messages denoting a strict meaning that can be decoded by the interlocutor. By conveying other information such as attitudes or emotions that connote the strict meaning, the voice enriches and enhances the communication process. In the last few decades, the digital world has become an important part of our lives. In many everyday situations, we are moving away from keyboards, mice and even touch screens to interactions with voice assistants or even virtual agents that enable human-like communication with machines. In the emergence of a hybrid world where physical and virtual reality coexist, it becomes crucial to enable machines to capture, interpret, and replicate the emotions and attitudes conveyed by the human voice.This research focuses on speech social attitudes, which can be defined - in a context of interaction - as speech dispositions towards others and aims to develop algorithms for their conversion. Fulfilling this objective requires data, i.e. a collection of audio recordings of utterances conveying various vocal attitudes. This research is thus built out of this initial step in gathering raw material - a dataset dedicated to speech social attitudes. Designing such algorithms involves a thorough understanding of what these attitudes are both in terms of production - how do individuals use their vocal apparatus to produce attitudes? - and perception - how do they decode those attitudes in speech? We therefore conducted two studies, a first uncovering the production strategies of speech attitudes and a second - based on a Best Worst Scaling (BWS) experiment - mainly hinting at biases involved in the perception such vocal attitudes, thus providing a twofold account for how speech attitudes are communicated by French individuals. These findings were the basis for the choice of speech signal representation as well as the architectural and optimisation choices for the design of a speech attitude conversion algorithm. In order to extend the knowledge on the perception of vocal attitudes gathered during this second study to the whole database, we worked on the elaboration of a BWS-Net allowing the detection of mis-communicated attitudes, and thus provided clean data for conversion learning. In order to learn how to convert vocal attitudes, we adopted a transformer-based approach in a many-to-many conversion paradigm with mel-spectrogram as speech signal representation. Since early experiments revealed a loss of intelligibility in the converted utterances, we proposed a linguistic conditioning of the conversion algorithm through incorporation of a speech-to-text module. Both objective and subjective measures have shown the resulting algorithm achieves better performance than the baseline transformer both in terms of intelligibility and attitude conveyed.En tant qu’animaux sociaux, les humains communiquent entre eux en se transmettant divers types d’information sur le monde et sur eux-mĂȘmes. Au cƓur de ce processus, la voix permet la transmission de messages linguistiques dĂ©notant un sens strict qui peut ĂȘtre dĂ©codĂ© par l’interlocuteur. En transmettant d’autres informations telles que des attitudes ou des Ă©motions qui connotent le sens strict, la voix enrichit et facilite le processus de communication. Au cours des derniĂšres dĂ©cennies, l’importance des technologies numĂ©riques dans nos vies n’a cessĂ© de croĂźtre. Dans de nombreuses situations quotidiennes, nous dĂ©laissons les claviers, les souris et mĂȘme les Ă©crans tactiles au profit d’interactions avec des assistants vocaux ou mĂȘme des agents virtuels qui permettent de communiquer avec les machines comme on le fait avec nos congĂ©nĂšres. Avec l’émergence d’un monde hybride oĂč coexistent rĂ©alitĂ©s physique et virtuelle, il devient crucial de permettre aux machines de capter, d’interprĂ©ter et de reproduire les Ă©motions et les attitudes vĂ©hiculĂ©es par la voix humaine. Cette recherche se concentre sur les attitudes sociales de la parole, qui peuvent ĂȘtre dĂ©finies dans un contexte d’interaction comme des dispositions vocales envers les autres, et vise Ă  dĂ©velopper des algorithmes pour leur conversion. Pour atteindre cet objectif, des donnĂ©es - c’est-Ă -dire une collection d’enregistrements audio d’énoncĂ©s vĂ©hiculant diverses attitudes vocales - sont nĂ©cessaires. Cette recherche est donc construite Ă  partir de cette Ă©tape initiale de collecte d’une matiĂšre premiĂšre, Ă  savoir un jeu de donnĂ©es dĂ©diĂ© aux attitudes sociales de la parole. La conception d’algorithmes de conversion des attitudes vocales implique de comprendre ce qui les dĂ©finit, Ă  la fois en termes de production - comment les individus utilisent-ils leur appareil vocal pour produire des attitudes ? - et de perception - comment dĂ©codent-ils ces attitudes dans la parole?. Nous avons donc menĂ© deux Ă©tudes, une premiĂšre mettant en Ă©vidence les stratĂ©gies de production des attitudes vocales et une seconde - basĂ©e sur une expĂ©rience de Best Worst Scaling (BWS) - mettant principalement en Ă©vidence les biais impliquĂ©s dans la perception de ces attitudes vocales, fournissant ainsi une double comprĂ©hension de la maniĂšre dont les attitudes vocales sont communiquĂ©es par les individus français. Ces rĂ©sultats nous ont permis de motiver notre choix de reprĂ©sentation du signal vocal ainsi que nos choix d’architecture et d’optimisation pour la conception d’algorithmes de conversion des attitudes vocales. Afin d’étendre Ă  l’ensemble de la base de donnĂ©es les connaissances sur la perception des attitudes vocales recueillies lors de cette seconde Ă©tude, nous avons travaillĂ© Ă  l’élaboration d’un BWS-Net permettant la dĂ©tection des attitudes mal communiquĂ©es, fournissant ainsi des donnĂ©es propres pour l’apprentissage de la conversion. Afin d’apprendre Ă  convertir les attitudes vocales, nous avons adoptĂ© une approche basĂ©e sur un rĂ©seau transformer dans un paradigme de conversion many-to-many utilisant le mel-spectrogramme comme reprĂ©sentation du signal de parole. Les premiĂšres expĂ©riences ayant rĂ©vĂ©lĂ© une perte d’intelligibilitĂ© dans les Ă©chantillons convertis, nous avons proposĂ© un conditionnement linguistique de l’algorithme de conversion en lui incorporant un module de reconnaissance de parole. Des mesures objectives et subjectives ont montrĂ© que l’algorithme rĂ©sultant obtient de meilleures performances que le transformer de rĂ©fĂ©rence aussi bien en termes d’intelligibilitĂ© et d’attitude vĂ©hiculĂ©e

    BWSNET: automatic perceptual assessment of audio signals

    No full text
    This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements
    corecore