Search CORE

316 research outputs found

Studies on noise robust automatic speech recognition

Author: Kurimo Mikko
Palomäki Kalle J.
Remes Ulpu
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2009
Field of study

Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

Aaltodoc Publication Archive

Enhancing the front-end of speaker recognition systems

Author: Ahmed Ahmed Isam
Publication venue
Publication date: 01/07/2019
Field of study

Portsmouth University Research Portal (Pure)

Early adductive reasoning for blind signal separation

Author: Ozdemir Pinar
Publication venue
Publication date: 02/11/2017
Field of study

We demonstrate that explicit and systematic incorporation of abductive reasoning capabilities into algorithms for blind signal separation can yield significant performance improvements. Our formulated mechanisms apply to the output data of signal processing modules in order to conjecture the structure of time-frequency interactions between the signal components that are to be separated. The conjectured interactions are used to drive subsequent signal separation processes that are as a result less blind to the interacting signal components and, therefore, more effective. We refer to this type of process as early abductive reasoning (EAR); the “early” refers to the fact that in contrast to classical Artificial Intelligence paradigms, the reasoning process here is utilized before the signal processing transformations are completed. We have used our EAR approach to formulate a practical algorithm that is more effective in realistically noisy conditions than reference algorithms that are representative of the current state of the art in two-speaker pitch tracking. Our algorithm uses the Blackboard architecture from Artificial Intelligence to control EAR and advanced signal processing modules. The algorithm has been implemented in MATLAB and successfully tested on a database of 570 mixture signals representing simultaneous speakers in a variety of real-world, noisy environments. With 0 dB Target-to-Masking Ratio (TMR) and no noise, the Gross Error Rate (GER) for our algorithm is 5% in comparison to the best GER performance of 11% among the reference algorithms. In diffuse noisy environments (such as street or restaurant environments), we find that our algorithm on the average outperforms the best reference algorithm by 9.4%. With directional noise, our algorithm also outperforms the best reference algorithm by 29%. The extracted pitch tracks from our algorithm were also used to carry out comb filtering for separating the harmonics of the two speakers from each other and from the other sound sources in the environment. The separated signals were evaluated subjectively by a set of 20 listeners to be of reasonable quality

Boston University Institutional Repository (OpenBU)

Contextual Beamforming: Exploiting Location and AI for Enhanced Wireless Telecommunication Performance

Author: Abbas Hasan T
Abbasi Qammer H
Bhatti Satyam
Ghannam Rami
Imran Muhammad Ali
Kaur Jaspreet
Popoola Olaoluwa R
Publication venue
Publication date: 02/07/2023
Field of study

The pervasive nature of wireless telecommunication has made it the foundation for mainstream technologies like automation, smart vehicles, virtual reality, and unmanned aerial vehicles. As these technologies experience widespread adoption in our daily lives, ensuring the reliable performance of cellular networks in mobile scenarios has become a paramount challenge. Beamforming, an integral component of modern mobile networks, enables spatial selectivity and improves network quality. However, many beamforming techniques are iterative, introducing unwanted latency to the system. In recent times, there has been a growing interest in leveraging mobile users' location information to expedite beamforming processes. This paper explores the concept of contextual beamforming, discussing its advantages, disadvantages and implications. Notably, the study presents an impressive 53% improvement in signal-to-noise ratio (SNR) by implementing the adaptive beamforming (MRT) algorithm compared to scenarios without beamforming. It further elucidates how MRT contributes to contextual beamforming. The importance of localization in implementing contextual beamforming is also examined. Additionally, the paper delves into the use of artificial intelligence schemes, including machine learning and deep learning, in implementing contextual beamforming techniques that leverage user location information. Based on the comprehensive review, the results suggest that the combination of MRT and Zero forcing (ZF) techniques, alongside deep neural networks (DNN) employing Bayesian Optimization (BO), represents the most promising approach for contextual beamforming. Furthermore, the study discusses the future potential of programmable switches, such as Tofino, in enabling location-aware beamforming

arXiv.org e-Print Archive

Recommended from our members

Single Channel auditory source separation with neural network

Author: Chen Zhuo
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

Although distinguishing diﬀerent sounds in noisy environment is a relative easy task for human, source separation has long been extremely diﬃcult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds. In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech. In this thesis, we ﬁrstly propose extensions for the current mask learning network, for the problem of speech enhancement, to ﬁx the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under diﬀerent recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data. Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise aﬃnity matrix that approximates the ideal aﬃnity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB. We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform eﬃcient end-to-end training. In the proposed model, attractor points for each source are ﬁrstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by ﬁnding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results. Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classiﬁcation based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system ﬁrstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience

Columbia University Academic Commons

Digital Microphone Array - Design, Implementation and Speech Recognition Experiments

Author: Zwyssig Erich Paul
Publication venue
Publication date: 26/11/2009
Field of study

The instrumented meeting room of the future will help meetings to be more efficient and productive. One of the basic components of the instrumented meeting room is the speech recording device, in most cases a microphone array. The two basic requirements for this microphone array are portability and cost-efficiency, neither of which are provided by current commercially available arrays. This will change in the near future thanks to the availability of new digital MEMS microphones. This dissertation reports on the first successful implementation of a digital MEMS microphone array. This digital MEMS microphone array was designed, implemented, tested and evaluated and successfully compared with an existing analogue microphone array using a state-of-the-art ASR system and adaptation algorithms. The newly built digital MEMS microphone array compares well with the analogue microphone array on the basis of the word error rate achieved in an automated speech recognition system and is highly portable and economical

Edinburgh Research Archive