5 research outputs found

    Recovering implicit pitch contours from formants in whispered speech

    Full text link
    Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is being transmitted, suggesting that intonation "survives" in whispered formant structure. In this paper, we aim to estimate the way in which formant contours correlate with an "implicit" pitch contour in whisper, using a machine learning model. We propose a two-step method: using a parallel corpus, we first transform the whispered formants into their phonated equivalents using a denoising autoencoder. We then analyse the formant contours to predict phonated pitch contour variation. We observe that our method is effective in establishing a relationship between whispered and phonated formants and in uncovering implicit pitch contours in whisper.Comment: 5 pages, 3 figures, 2 tables, Accepted at ICPhS 202

    Speaker-independent neural formant synthesis

    Full text link
    We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).Comment: 5 pages, 4 figures. Article accepted at INTERSPEECH 202

    Definición y extracción de características en programas audiovisuales para el reconocimiento automático de temática

    Get PDF
    En este proyecto estudiamos las principales características de distintos tipos de vídeos retransmitidos por televisión y diseñamos el comienzo de una aplicación que se encargue de analizar estas características para clasificar los distintos programas en función de su temática

    Preserving Speech Privacy in Interactions with Ad Hoc Sensor Networks

    No full text
    Speech is our main method of communication that allows us to intuitively communicate complex ideas and provide our messages with deeper meaning than the lexical content of such messages. For example, we can stress specific words to emphasise or subtract significance from different sections of a sentence. Considering this, the increasing popularity of voice user interfaces is only natural and expected to keep growing in the following years, as they allow us to interact with our electronic devices using our speech. Any device with which we can interact using our voice can be considered a voice user interface, and among them we can find a great variety of services, from telecommunication applications like Zoom or Skype, to virtual assistants like Alexa or Siri. However, in order to provide better services and more natural interactions, voice user interfaces require the gathering of a great amount of our speech data and transmitting it usually without us being aware of it. If that data is misused or an unauthorised user manages to obtain it, it would cause a grave violation of the user's privacy. In an environment where multiple electronic devices can provide a voice user interface, collaboration between them as a wireless acoustic sensor network can improve the services that they provide individually. It is important then to study those applications that require sending our voice to a remote party in order to provide their services, and more specifically, in a scenario where multiple devices can pick up the voice of multiple users, it is crucial to define which of these devices are actually allowed to record the user's speech. For example, if a user's voice leaks into another user's interaction, and is therefore transmitted to a destination that they have not specifically authorised, the privacy of the users is violated. As a solution, if our devices could perceive our privacy the same way as we do, they could adapt the information they shared to protect the personal data of the users. For that reason, we need to analyse how users perceive privacy in their spoken interactions, based on which we can devise rules that our devicescan follow when they provide a voice user interface. In this thesis we study methods to recognise when two devices are located in the same acoustic space based on the audio signals that they record. We show how acoustic fingerprints can be used to securely share the audio information from a device and estimate the physical proximity of devices. We also generated a speech corpus in conversational scenarios to analyse the effect that the acoustic properties of the environment have on the level of privacy that we perceive. Finally, we developed source separation methods to remove the voice of interfering speakers in a multi-device scenario, thus protecting the privacy of external users

    A processing framework to access large quantities of whispered speech found in ASMR

    No full text
    Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.QC 20230630Multimodal encoding of prosodic prominence in voiced and whispered speec
    corecore