Search CORE

6,254 research outputs found

Singing voice separation: a study on training data

Author: Hennequin Romain
Prétet Laure
Royo-Letelier Jimena
Vaglio Andrea
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/06/2019
Field of study

In the recent years, singing voice separation systems showed increased performance due to the use of supervised training. The design of training datasets is known as a crucial factor in the performance of such systems. We investigate on how the characteristics of the training dataset impacts the separation performances of state-of-the-art singing voice separation algorithms. We show that the separation quality and diversity are two important and complementary assets of a good training dataset. We also provide insights on possible transforms to perform data augmentation for this task

arXiv.org e-Print Archive

Crossref

BaNa: a noise resilient fundamental frequency detection algorithm for speech and music

Author: Ba He
Cai Weiyang
Heinzelman Wendi
Seyfettin Demirkol Ilker
Yang Na
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Fundamental frequency (F0) is one of the essential features in many acoustic related applications. Although numerous F0 detection algorithms have been developed, the detection accuracy in noisy environments still needs improvement. We present a hybrid noise resilient F0 detection algorithm named BaNa that combines the approaches of harmonic ratios and Cepstrum analysis. A Viterbi algorithm with a cost function is used to identify the F0 value among several F0 candidates. Speech and music databases with eight different types of additive noise are used to evaluate the performance of the BaNa algorithm and several classic and state-of-the-art F0 detection algorithms. Results show that for almost all types of noise and signal-to-noise ratio (SNR) values investigated, BaNa achieves the lowest Gross Pitch Error (GPE) rate among all the algorithms. Moreover, for the 0 dB SNR scenarios, the BaNa algorithm is shown to achieve 20% to 35% GPE rate for speech and 12% to 39% GPE rate for music. We also describe implementation issues that must be addressed to run the BaNa algorithm as a real-time application on a smartphone platform.Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

Author: Chang Xuankai
Cornell Samuele
Garcia Paola
Khudanpur Sanjeev
Maciejewski Matthew
Masuyama Yoshiki
Raj Desh
Squartini Stefano
Wang Zhong-Qiu
Watanabe Shinji
Wiesner Matthew
Publication venue
Publication date: 14/07/2023
Field of study

The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR)

arXiv.org e-Print Archive

The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes

Author: Anguera
Baby
Bagchi
Barker
Barker
Castro Martinez
DiBiase
Du
Emmanuel Vincent
Fletcher
Frigge
Fujita
Garofalo
Hermansky
Heymann
Hirsch
Hori
Jalalvand
Jon Barker
Kim
Loesch
Ma
Mestre
Mikolov
Moritz
Mostefa
Parihar
Pfeifenberger
Povey
Prudnikov
Renals
Ricard Marxer
Shinji Watanabe
Sivasankaran
Taal
Tachioka
Taghia
Veselý
Vincent
Vincent
Vu
Yoshioka
Zhao
Zhuang
Publication venue: 'Elsevier BV'
Publication date: 15/10/2016
Field of study

This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations

Crossref

INRIA a CCSD electronic archive server

White Rose Research Online

HAL-Rennes 1

Multichannel Automatic Recognition of Voice Command in a Multi-Room Smart Home : an Experiment involving Seniors and Users with Visual Impairment

Author: Lecouteux Benjamin
Portet François
Vacher Michel
Publication venue: HAL CCSD
Publication date: 14/09/2014
Field of study

International audienceVoice command system in multi-room smart homes for assist- ing people in loss of autonomy in their daily activities must face several challenges, one of which being the distant condi- tion which impacts the ASR system performance. This paper presents an approach to improve voice command recognition at the decoding level by using multiple sources and model adap- tation. The method has been tested on data recorded with 11 elderly and visually impaired participants in a real smart home. The results show an error rate of 3.2% in off-line condition and of 13.2% in on-line condition

Hal - Université Grenoble Alpes

Latent Class Model with Application to Speaker Diarization

Author: Chen Xianhong
He Liang
Johnson Michael T
Liu Jia
Liu Yi
Xu Can
Publication venue
Publication date: 24/04/2019
Field of study

In this paper, we apply a latent class model (LCM) to the task of speaker diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations. In contrast to the VB method, which is based on a generative model, LCM provides a framework allowing both generative and discriminative models. The discriminative property is realized through the use of i-vector (Ivec), probabilistic linear discriminative analysis (PLDA), and a support vector machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid are introduced. In addition, three further improvements are applied to enhance its performance. 1) Adding neighbor windows to extract more speaker information for each short segment. 2) Using a hidden Markov model to avoid frequent speaker change points. 3) Using an agglomerative hierarchical cluster to do initialization and present hard and soft priors, in order to overcome the problem of initial sensitivity. Experiments on the National Institute of Standards and Technology Rich Transcription 2009 speaker diarization database, under the condition of a single distant microphone, show that the diarization error rate (DER) of the proposed methods has substantial relative improvements compared with mainstream systems. Compared to the VB method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial conditions also show that the proposed LCM-Ivec-Hybrid system has the best overall performance

arXiv.org e-Print Archive

University of Kentucky

FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data

Author: Hansen John H. L.
Joglekar Aditya
Sangwan Abhijeet
Shekar Meena Chandra
Publication venue
Publication date: 15/08/2020
Field of study

The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2) Challenge is the second annual challenge held for the Speech and Language Technology community to motivate supervised learning algorithm development for multi-party and multi-stream naturalistic audio. In this paper, we present an overview of the challenge sub-tasks, data, performance metrics, and lessons learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present advancements made in FS-2 through extensive community outreach and feedback. We describe innovations in the challenge corpus development, and present revised baseline results. We finally discuss the challenge outcome and general trends in system development across both phases (Phase FS-1 Unsupervised, and Phase FS-2 Supervised) of the challenge, and its continuation into multi-channel challenge tasks for the upcoming Fearless Steps Challenge Phase-3.Comment: Paper Accepted in the Interspeech 2020 Conferenc

arXiv.org e-Print Archive

Crossref