264,215 research outputs found

    Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

    Get PDF
    This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for singing voice synthesis (SVS) that exploits the physical characteristics of the human voice using differentiable digital signal processing. GOLF employs a glottal model as the harmonic source and IIR filters to simulate the vocal tract, resulting in an interpretable and efficient approach. We show it is competitive with state-of-the-art singing voice vocoders, requiring fewer synthesis parameters and less memory to train, and runs an order of magnitude faster for inference. Additionally, we demonstrate that GOLF can model the phase components of the human voice, which has immense potential for rendering and analysing singing voices in a differentiable manner. Our results highlight the effectiveness of incorporating the physical properties of the human voice mechanism into SVS and underscore the advantages of signal-processing-based approaches, which offer greater interpretability and efficiency in synthesis

    Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

    Full text link
    This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for singing voice synthesis (SVS) that exploits the physical characteristics of the human voice using differentiable digital signal processing. GOLF employs a glottal model as the harmonic source and IIR filters to simulate the vocal tract, resulting in an interpretable and efficient approach. We show it is competitive with state-of-the-art singing voice vocoders, requiring fewer synthesis parameters and less memory to train, and runs an order of magnitude faster for inference. Additionally, we demonstrate that GOLF can model the phase components of the human voice, which has immense potential for rendering and analysing singing voices in a differentiable manner. Our results highlight the effectiveness of incorporating the physical properties of the human voice mechanism into SVS and underscore the advantages of signal-processing-based approaches, which offer greater interpretability and efficiency in synthesis. Audio samples are available at https://yoyololicon.github.io/golf-demo/.Comment: 9 pages, 4 figures. Accepted at ISMIR 202

    Voice Feature Extraction for Gender and Emotion Recognition

    Get PDF
    Voice recognition plays a key role in spoken communication that helps to identify the emotions of a person that reflects in the voice. Gender classification through speech is a widely used Human Computer Interaction (HCI) as it is not easy to identify gender by computer. This led to the development of a model for “Voice feature extraction for Emotion and Gender Recognition”. The speech signal consists of semantic information, speaker information (gender, age, emotional state), accompanied by noise. Females and males have different voice characteristics due to their acoustical and perceptual differences along with a variety of emotions which convey their own unique perceptions. In order to explore this area, feature extraction requires pre- processing of data, which is necessary for increasing the accuracy. The proposed model follows steps such as data extraction, pre- processing using Voice Activity Detector (VAD), feature extraction using Mel-Frequency Cepstral Coefficient (MFCC), feature reduction by Principal Component Analysis (PCA) and Support Vector Machine (SVM) classifier. The proposed combination of techniques produced better results which can be useful in the healthcare sector, virtual assistants, security purposes and other fields related to the Human Machine Interaction domain.&nbsp

    Tracking the Sound of Human Affection: EEG Signals Reveal Online Decoding of Socio-Emotional Expression in Human Speech and Voice

    Get PDF
    This chapter provides a perspective from the latest EEG evidence in how brain signals enlighten the neurophysiological and neurocognitive mechanisms underlying the recognition of socioemotional expression conveyed in human speech and voice, drawing upon event‐related potentials’ studies (ERPs). Human sound can encode emotional meanings by different vocal parameters in words, real‐ vs. pseudo‐speeches, and vocalizations. Based on the ERP findings, recent development of the three‐stage model in vocal processing has highlighted initial‐ and late‐stage processing of vocal emotional stimuli. These processes, depending on which ERP components they were mapped onto, can be divided into the acoustic analysis, relevance and motivational processing, fine‐grained meaning analysis/integration/access, and higher‐level social inference, as the unfolding of the time scale. ERP studies on vocal socioemotions, such as happiness, anger, fear, sadness, neutral, sincerity, confidence, and sarcasm in the human voice and speech have employed different experimental paradigms such as crosssplicing, crossmodality priming, oddball, stroop, etc. Moreover, task demand and listener characteristics affect the neural responses underlying the decoding processes, revealing the role of attention deployment and interpersonal sensitivity in the neural decoding of vocal emotional stimuli. Cultural orientation affects our ability to decode emotional meaning in the voice. Neurophysiological patterns were compared between normal and abnormal emotional processing in the vocal expressions, especially in schizophrenia and in congenital amusia. Future directions highlight the study on human vocal expression aligning with other nonverbal cues, such as facial and body language, and the need to synchronize listener\u27s brain potentials with other peripheral measures

    Speaker identification based on hybrid feature extraction techniques

    Get PDF
    One of the most exciting areas of signal processing is speech processing; speech contains many features or characteristics that can discriminate the identity of the person. The human voice is considered one of the important biometric characteristics that can be used for person identification. This work is concerned with studying the effect of appropriate extracted features from various levels of discrete wavelet transformation (DWT) and the concatenation of two techniques (discrete wavelet and curvelet transform) and study the effect of reducing the number of features by using principal component analysis (PCA) on speaker identification. Backpropagation (BP) neural network was also introduced as a classifier

    Blood Pressure Estimation from Speech Recordings: Exploring the Role of Voice-over Artists

    Get PDF
    Hypertension, a prevalent global health concern, is associated with cardiovascular diseases and significant morbidity and mortality. Accurate and prompt Blood Pressure monitoring is crucial for early detection and successful management. Traditional cuff-based methods can be inconvenient, leading to the exploration of non-invasive and continuous estimation methods. This research aims to bridge the gap between speech processing and health monitoring by investigating the relationship between speech recordings and Blood Pressure estimation. Speech recordings offer promise for non-invasive Blood Pressure estimation due to the potential link between vocal characteristics and physiological responses. In this study, we focus on the role of Voice-over Artists, known for their ability to convey emotions through voice. By exploring the expertise of Voice-over Artists in controlling speech and expressing emotions, we seek valuable insights into the potential correlation between speech characteristics and Blood Pressure. This research sheds light on presenting an innovative and convenient approach to health assessment. By unraveling the specific role of Voice-over Artists in this process, the study lays the foundation for future advancements in healthcare and human-robot interactions. Through the exploration of speech characteristics and emotional expression, this investigation offers valuable insights into the correlation between vocal features and Blood Pressure levels. By leveraging the expertise of Voice-over Artists in conveying emotions through voice, this study enriches our understanding of the intricate relationship between speech recordings and physiological responses, opening new avenues for the integration of voice-related factors in healthcare technologies

    The acoustics of concentric sources and receivers – human voice and hearing applications

    Get PDF
    One of the most common ways in which we experience environments acoustically is by listening to the reflections of our own voice in a space. By listening to our own voice we adjust its characteristics to suit the task and audience. This is of particular importance in critical voice tasks such as actors or singers on a stage with no additional electroacoustic or other amplification (e.g. in ear monitors, loudspeakers, etc.). Despite the usualness of this situation, there are very few acoustic measurements aimed to quantify it and even fewer that address the problem of having a source and receiver that are very closely located. The aim of this thesis is to introduce new measurement transducers and methods that quantify correctly this situation. This is achieved by analysing the characteristics of the human as a source, a receiver and their interaction in close proximity when placed in acoustical environments. The characteristics of the human voice and human ear are analysed in this thesis in a similar manner as a loudspeaker or microphone would be analysed. This provides the basis for further analysis by making them analogous to measurement transducers. These results are then used to explore the consequences of having a source and receiver very closely located using acoustic room simulation. Different techniques for processing data using directional transducers in real rooms are introduced. The majority of the data used in this thesis was obtained in rooms used for performance. The final chapters of this thesis include details of the design and construction of a concentric directional transducer, where an array of microphones and loudspeakers occupy the same structure. Finally, sample measurements with this transducer are presented

    Breaking voice identity perception: Expressive voices are more confusable for listeners.

    Get PDF
    The human voice is a highly flexible instrument for self-expression, yet voice identity perception is largely studied using controlled speech recordings. Using two voice-sorting tasks with naturally varying stimuli, we compared the performance of listeners who were familiar and unfamiliar with the TV show Breaking Bad. Listeners organised audio clips of speech with (1) low-expressiveness and (2) high-expressiveness into perceived identities. We predicted that increased expressiveness (e.g., shouting, strained voice) would significantly impair performance. Overall, while unfamiliar listeners were less able to generalise identity across exemplars, the two groups performed equivalently well when telling voices apart when dealing with low-expressiveness stimuli. However, high vocal expressiveness significantly impaired telling apart in both the groups: this led to increased misidentifications, where sounds from one character were assigned to the other. These misidentifications were highly consistent for familiar listeners but less consistent for unfamiliar listeners. Our data suggest that vocal flexibility has powerful effects on identity perception, where changes in the acoustic properties of vocal signals introduced by expressiveness lead to effects apparent in familiar and unfamiliar listeners alike. At the same time, expressiveness appears to have affected other aspects of voice identity processing selectively in one listener group but not the other, thus revealing complex interactions of stimulus properties and listener characteristics (i.e., familiarity) in identity processing

    Poetry in Pandemic: A Multimodal Neuroaesthetic Study on the Emotional Reaction to the Divina Commedia Poem

    Get PDF
    Poetry elicits emotions, and emotion is a fundamental component of human ontogeny. Although neuroaesthetics is a rapidly developing field of research, few studies focus on poetry, and none address its different modalities of fruition (MOF) of universal cultural heritage works, such as the Divina Commedia (DC) poem. Moreover, alexithymia (AX) resulted in being a psychological risk factor during the Covid-19 pandemic. The present study aims to investigate the emotional response to poetry excerpts from different cantica (Inferno, Purgatorio, Paradiso) of DC with the dual objective of assessing the impact of both the structure of the poem and MOF and that of the characteristics of the acting voice in experts and non-experts, also considering AX. Online emotion facial coding biosignal (BS) techniques, self-reported and psychometric measures were applied to 131 literary (LS) and scientific (SS) university students. BS results show that LS globally manifest more JOY than SS in both reading and listening MOF and more FEAR towards Inferno. Furthermore, LS and SS present different results regarding NEUTRAL emotion about acting voice. AX influences listening in NEUTRAL and SURPRISE expressions. DC’s structure affects DISGUST and SADNESS during listening, regardless of participant characteristics. PLEASANTNESS varies according to DC’s structure and the acting voice, as well as AROUSAL, which is also correlated with AX. Results are discussed in light of recent findings in affective neuroscience and neuroaesthetics, suggesting the critical role of poetry and listening in supporting human emotional processing

    Hey Dona! Can you help me with student course registration?

    Full text link
    In this paper, we present a demo of an intelligent personal agent called Hey Dona (or just Dona) with virtual voice assistance in student course registration. It is a deployed project in the theme of AI for education. In this digital age with a myriad of smart devices, users often delegate tasks to agents. While pointing and clicking supersedes the erstwhile command-typing, modern devices allow users to speak commands for agents to execute tasks, enhancing speed and convenience. In line with this progress, Dona is an intelligent agent catering to student needs by automated, voice-operated course registration, spanning a multitude of accents, entailing task planning optimization, with some language translation as needed. Dona accepts voice input by microphone (Bluetooth, wired microphone), converts human voice to computer understandable language, performs query processing as per user commands, connects with the Web to search for answers, models task dependencies, imbibes quality control, and conveys output by speaking to users as well as displaying text, thus enabling human-AI interaction by speech cum text. It is meant to work seamlessly on desktops, smartphones etc. and in indoor as well as outdoor settings. To the best of our knowledge, Dona is among the first of its kind as an intelligent personal agent for voice assistance in student course registration. Due to its ubiquitous access for educational needs, Dona directly impacts AI for education. It makes a broader impact on smart city characteristics of smart living and smart people due to its contributions to providing benefits for new ways of living and assisting 21st century education, respectively
    corecore