28,340 research outputs found
Speech Processing in Computer Vision Applications
Deep learning has been recently proven to be a viable asset in determining features in the field of Speech Analysis. Deep learning methods like Convolutional Neural Networks facilitate the expansion of specific feature information in waveforms, allowing networks to create more feature dense representations of data. Our work attempts to address the problem of re-creating a face given a speaker\u27s voice and speaker identification using deep learning methods. In this work, we first review the fundamental background in speech processing and its related applications. Then we introduce novel deep learning-based methods to speech feature analysis. Finally, we will present our deep learning approaches to speaker identification and speech to face synthesis. The presented method can convert a speaker audio sample to an image of their predicted face. This framework is composed of several chained together networks, each with an essential step in the conversion process. These include Audio embedding, encoding, and face generation networks, respectively. Our experiments show that certain features can map to the face and that with a speaker\u27s voice, DNNs can create their face and that a GUI could be used in conjunction to display a speaker recognition network\u27s data
Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders
An effective approach to non-parallel voice conversion (VC) is to utilize
deep neural networks (DNNs), specifically variational auto encoders (VAEs), to
model the latent structure of speech in an unsupervised manner. A previous
study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for
VC. How- ever, VAE using other types of spectral features such as mel- cepstral
coefficients (MCCs), which are related to human per- ception and have been
widely used in VC, have not been prop- erly investigated. Instead of using one
specific type of spectral feature, it is expected that VAE may benefit from
using multi- ple types of spectral features simultaneously, thereby improving
the capability of VAE for VC. To this end, we propose a novel VAE framework
(called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework
utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple
objectives in order to constrain the behavior of the learned encoder and de-
coder. Experimental results demonstrate that the proposed CD- VAE framework
outperforms the conventional VAE framework in terms of subjective tests.Comment: Accepted to ISCSLP 201
Community Foundations: Learning from a Collective Experience: Process of Systematization
The report of a community foundation strengthening program involving eight Mexican community foundations: Tecate CF, Frontera Norte CF, Matamoros CF, Oaxaca CF, Puebla CF, FundaciĂłn Comunidad, FundaciĂłn del Empresariado Chihuahuense (FECHAC), and FundaciĂłn Internacional de la Comunidad (FIC). The report is also available in Spanish
Speaker-normalized sound representations in the human auditory cortex
The acoustic dimensions that distinguish speech sounds (like the vowel differences in âbootâ and âboatâ) also differentiate speakersâ voices. Therefore, listeners must normalize across speakers without losing linguistic information. Past behavioral work suggests an important role for auditory contrast enhancement in normalization: preceding context affects listenersâ perception of subsequent speech sounds. Here, using intracranial electrocorticography in humans, we investigate whether and how such context effects arise in auditory cortex. Participants identified speech sounds that were preceded by phrases from two different speakers whose voices differed along the same acoustic dimension as target words (the lowest resonance of the vocal tract). In every participant, target vowels evoke a speaker-dependent neural response that is consistent with the listenerâs perception, and which follows from a contrast enhancement model. Auditory cortex processing thus displays a critical feature of normalization, allowing listeners to extract meaningful content from the voices of diverse speakers
Exploring the Margins of Kotha Culture : Reconstructing a Courtesanâs life in Neelum Saran Gourâs \u3cem\u3eRequiem in Raga Janki\u3c/em\u3e
In their article, âExploring the Margins of Kotha Culture: Reconstructing a Courtesanâs life in Neelum Saran Gourâs Requiem in Raga Janki,â Chhandita Das and Priyanka Tripathi discuss the invisible challenges in life of a famous courtesan Janki Bai Ilahabadi through close analysis of Neelum Saran Gourâs 2018 novel, Requiem in Raga Janki. In this novel, Janki belongs to the infamous kotha but she never fails to seek her subjectivity. This marginal place of Janakiâs belonging will be discussed by appropriating and the theoretical framework of Indian feminist Lata Singhâs (2007) for whom courtesans have been represented as ââotherâ in historyâ (1677). Other than Singh, bell hooksâ âmargin as a space of radical opennessâ (Yearning 228), Veena Oldenburgâs spectacular scholarship on courtesansâ in âLifestyle as Resistanceâ (1990) will be synthesized to deconstruct the social hierarchy. Although baijis or tawaifs in India possess rich artistic heritage but surprisingly enough they have been often in a questionable space wherein their individual and social integrity has been compromised. Gour attempts to rewrite life of a courtesan from Allahabad and in the process creates an alternative discourse or understanding of a courtesanâs life through Janki, matron, yes! not patron of Indian classical music and tradition
Tensions in Creating Possibilities for Youth Voice in School Choice: An Ethnographerâs Reflexive Story of Research
The following article relates a reflexive ethnographic research project that focuses on youth voice in relation to the process of choosing a high school and a language of instruction in Ontario, Canada. The purpose of this methodological article is to relate a story of research and explore the tensions between theory and practice experienced by a young researcher during and after fieldwork. To do so, I explore the theoretical and epistemological underpinnings of the relevance and importance of youth-centred research and uncover some of the complexities of conducting participant observation, interviews, andco-analysis activities with youth participants.RĂ©sumĂ©Cet article prĂ©sente un projet de recherche ethnographique rĂ©flexif qui se concentre sur les perspectives des jeunes dans leur processus de sĂ©lection dâune Ă©cole secondaire et dâune langue dâenseignement dans la province Canadienne de lâOntario. Le but de cet article mĂ©thodologique est de raconter lâhistoire dâun projet de recherche et dâexplorer les tensions entre la thĂ©orie et la pratique vĂ©cue par une jeune chercheure pendant et aprĂšs le travail de terrain. Pour ce faire, jâexplore les fondements thĂ©oriques et Ă©pistĂ©mologiques de la pertinence et de lâimportance de la recherche centrĂ©e sur les enfants, et de mettre en lumiĂšre les complexitĂ©s de lâobservation participante, des entretiens et des activitĂ©s de co-analyse de donnĂ©es avec les jeunes participants
Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation
Whispering is a natural, unphonated, secondary aspect of speech communications for most people. However, it is the primary mechanism of communications for some speakers who have impaired voice production mechanisms, such as partial laryngectomees, as well as for those prescribed voice rest, which often follows surgery or damage to the larynx. Unlike most people, who choose when to whisper and when not to, these speakers may have little choice but to rely on whispers for much of their daily vocal interaction.
Even though most speakers will whisper at times, and some speakers can only whisper, the majority of todayâs computational speech technology systems assume or require phonated speech. This article considers conversion of whispers into natural-sounding phonated speech as a noninvasive prosthetic aid for people with voice impairments who can only whisper. As a by-product, the technique is also useful for unimpaired speakers who choose to whisper.
Speech reconstruction systems can be classified into those requiring training and those that do not. Among the latter, a recent parametric reconstruction framework is explored and then enhanced through a refined estimation of plausible pitch from weighted formant differences. The improved reconstruction framework, with proposed formant-derived artificial pitch modulation, is validated through subjective and objective comparison tests alongside state-of-the-art alternatives
- âŠ