23 research outputs found
PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music
Lyrics transcription of polyphonic music is challenging as the background
music affects lyrics intelligibility. Typically, lyrics transcription can be
performed by a two step pipeline, i.e. singing vocal extraction frontend,
followed by a lyrics transcriber backend, where the frontend and backend are
trained separately. Such a two step pipeline suffers from both imperfect vocal
extraction and mismatch between frontend and backend. In this work, we propose
a novel end-to-end integrated training framework, that we call PoLyScriber, to
globally optimize the vocal extractor front-end and lyrics transcriber backend
for lyrics transcription in polyphonic music. The experimental results show
that our proposed integrated training model achieves substantial improvements
over the existing approaches on publicly available test datasets.Comment: 13 page
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
IberSPEECH 2020: XI Jornadas en TecnologĂa del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de TecnologĂas del Habla. Universidad de Valladoli
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
Real-time noise filtering with adaptive filters in heavy equipment soundscape
In this master’s thesis, adaptive filters are used to abate the engine noise of heavy equipment and the changes in the soundscape are studied. The main objective of this work is to enhance the sound quality of human speech and shouting. The results are evaluated both with subjective tests and computationally. The first method used in achieving the goal is the Butterworth band-pass-filter, which is designed to preserve the frequencies possibly containing human speech and filter the rest of the frequencies. The resulting signal of the Butterworth filter is filtered with the adaptive filter targeting the engine noise. In this study two different adaptive filters are used, the Wiener filter and NLMS filter, and their performance is compared. In addition to the method of implementation, these filters also differ in that the Wiener filter does not use a reference signal for adaptation, but the noise is estimated from the input signal itself, while the NLMS filter uses a microphone in the engine compartment as a reference signal. The filtering system developed in this study is implemented on a developing platform, which is designed to be used by the end user, in other words, the operator of the heavy equipment. This is the reason why the results of the subjective tests are in focus in this study. According to both objective and subjective evaluations in this study, the engine noise deteriorates considerably and the speech coming outside the vehicle is cleaner. In the thesis the results were evaluated both with Sandvik Pantera DPI series drilling machine and a large diesel engine car, Nissan Pathfinder. The subjective evaluation is compared with the signal-to-noise ratio and signal-distortion ratio, which both indicated that the enhancement was successful. Even though the test conditions were not optimal, the results show that the adaptive filters can be used efficiently to filter the engine noise in real-time