87 research outputs found

    Minimum Processing Near-end Listening Enhancement

    Full text link
    The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility in harsh noise environments, in favorable noise conditions the intelligibility of speech reaches a ceiling where it cannot be further enhanced. Actually, the focus of existing methods solely on improving the intelligibility causes unnecessary processing of the speech signal and leads to speech distortions and quality degradations. In this paper, we provide a new rationale for NLE, where the target speech is minimally processed in terms of a processing penalty, provided that a certain performance constraint, e.g., intelligibility, is satisfied. We present a closed-form solution for the case where the performance criterion is an intelligibility estimator based on the approximated speech intelligibility index and the processing penalty is the mean-square error between the processed and the clean speech. This produces an NLE method that adapts to changing noise conditions via a simple gain rule by limiting the processing to the minimum necessary to achieve a desired intelligibility, while at the same time focusing on quality in favorable noise situations by minimizing the amount of speech distortions. Through simulation studies, we show the proposed method attains speech quality on par or better than existing methods in both objective measurements and subjective listening tests, whilst still sustaining objective speech intelligibility performance on par with existing methods

    Data-Driven Speech Intelligibility Prediction

    Get PDF

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    From Algorithmic to Neural Beamforming

    Get PDF
    Human interaction increasingly relies on telecommunication as an addition to or replacement for immediate contact. The direct interaction with smart devices, beyond the use of classical input devices such as the keyboard, has become common practice. Remote participation in conferences, sporting events, or concerts is more common than ever, and with current global restrictions on in-person contact, this has become an inevitable part of many people's reality. The work presented here aims at improving these encounters by enhancing the auditory experience. Augmenting fidelity and intelligibility can increase the perceived quality and enjoyability of such actions and potentially raise acceptance for modern forms of remote experiences. Two approaches to automatic source localization and multichannel signal enhancement are investigated for applications ranging from small conferences to large arenas. Three first-order microphones of fixed relative position and orientation are used to create a compact, reactive tracking and beamforming algorithm, capable of producing pristine audio signals in small and mid-sized acoustic environments. With inaudible beam steering and a highly linear frequency response, this system aims at providing an alternative to manually operated shotgun microphones or sets of individual spot microphones, applicable in broadcast, live events, and teleconferencing or for human-computer interaction. The array design and choice of capsules are discussed, as well as the challenges of preventing coloration for moving signals. The developed algorithm, based on Energy-Based Source Localization, is discussed and the performance is analyzed. Objective results on synthesized audio, as well as on real recordings, are presented. Results of multiple listening tests are presented and real-time considerations are highlighted. Multiple microphones with unknown spatial distribution are combined to create a large-aperture array using an end-to-end Deep-Learning approach. This method combines state-of-the-art single-channel signal separation networks with adaptive, domain-specific channel alignment. The Neural Beamformer is capable of learning to extract detailed spatial relations of channels with respect to a learned signal type, such as speech, and to apply appropriate corrections in order to align the signals. This creates an adaptive beamformer for microphones spaced on the order of up to 100m. The developed modules are analyzed in detail and multiple configurations are considered for different use cases. Signal processing inside the Neural Network is interpreted and objective results are presented on simulated and semi-simulated datasets

    Contribution to quality of user experience provision over wireless networks

    Get PDF
    The widespread expansion of wireless networks has brought new attractive possibilities to end users. In addition to the mobility capabilities provided by unwired devices, it is worth remarking the easy configuration process that a user has to follow to gain connectivity through a wireless network. Furthermore, the increasing bandwidth provided by the IEEE 802.11 family has made possible accessing to high-demanding services such as multimedia communications. Multimedia traffic has unique characteristics that make it greatly vulnerable against network impairments, such as packet losses, delay, or jitter. Voice over IP (VoIP) communications, video-conference, video-streaming, etc., are examples of these high-demanding services that need to meet very strict requirements in order to be served with acceptable levels of quality. Accomplishing these tough requirements will become extremely important during the next years, taking into account that consumer video traffic will be the predominant traffic in the Internet during the next years. In wired systems, these requirements are achieved by using Quality of Service (QoS) techniques, such as Differentiated Services (DiffServ), traffic engineering, etc. However, employing these methodologies in wireless networks is not that simple as many other factors impact on the quality of the provided service, e.g., fading, interferences, etc. Focusing on the IEEE 802.11g standard, which is the most extended technology for Wireless Local Area Networks (WLANs), it defines two different architecture schemes. On one hand, the infrastructure mode consists of a central point, which manages the network, assuming network controlling tasks such as IP assignment, routing, accessing security, etc. The rest of the nodes composing the network act as hosts, i.e., they send and receive traffic through the central point. On the other hand, the IEEE 802.11 ad-hoc configuration mode is less extended than the infrastructure one. Under this scheme, there is not a central point in the network, but all the nodes composing the network assume both host and router roles, which permits the quick deployment of a network without a pre-existent infrastructure. This type of networks, so called Mobile Ad-hoc NETworks (MANETs), presents interesting characteristics for situations when the fast deployment of a communication system is needed, e.g., tactics networks, disaster events, or temporary networks. The benefits provided by MANETs are varied, including high mobility possibilities provided to the nodes, network coverage extension, or network reliability avoiding single points of failure. The dynamic nature of these networks makes the nodes to react to topology changes as fast as possible. Moreover, as aforementioned, the transmission of multimedia traffic entails real-time constraints, necessary to provide these services with acceptable levels of quality. For those reasons, efficient routing protocols are needed, capable of providing enough reliability to the network and with the minimum impact to the quality of the service flowing through the nodes. Regarding quality measurements, the current trend is estimating what the end user actually perceives when consuming the service. This paradigm is called Quality of user Experience (QoE) and differs from the traditional Quality of Service (QoS) approach in the human perspective given to quality estimations. In order to measure the subjective opinion that a user has about a given service, different approaches can be taken. The most accurate methodology is performing subjective tests in which a panel of human testers rates the quality of the service under evaluation. This approach returns a quality score, so-called Mean Opinion Score (MOS), for the considered service in a scale 1 - 5. This methodology presents several drawbacks such as its high expenses and the impossibility of performing tests at real time. For those reasons, several mathematical models have been presented in order to provide an estimation of the QoE (MOS) reached by different multimedia services In this thesis, the focus is on evaluating and understanding the multimedia-content transmission-process in wireless networks from a QoE perspective. To this end, firstly, the QoE paradigm is explored aiming at understanding how to evaluate the quality of a given multimedia service. Then, the influence of the impairments introduced by the wireless transmission channel on the multimedia communications is analyzed. Besides, the functioning of different WLAN schemes in order to test their suitability to support highly demanding traffic such as the multimedia transmission is evaluated. Finally, as the main contribution of this thesis, new mechanisms or strategies to improve the quality of multimedia services distributed over IEEE 802.11 networks are presented. Concretely, the distribution of multimedia services over ad-hoc networks is deeply studied. Thus, a novel opportunistic routing protocol, so-called JOKER (auto-adJustable Opportunistic acK/timEr-based Routing) is presented. This proposal permits better support to multimedia services while reducing the energy consumption in comparison with the standard ad-hoc routing protocols.Universidad Politécnica de CartagenaPrograma Oficial de Doctorado en Tecnologías de la Información y Comunicacione

    Speech dereverberation and speaker separation using microphone arrays in realistic environments

    Get PDF
    This thesis concentrates on comparing novel and existing dereverberation and speaker separation techniques using multiple corpora, including a new corpus collected using a microphone array. Many corpora currently used for these techniques are recorded using head-mounted microphones in anechoic chambers. This novel corpus contains recordings with noise and reverberation made in office and workshop environments. Novel algorithms present a different way of approximating the reverberation, producing results that are competitive with existing algorithms. Dereverberation is evaluated using seven correlation-based algorithms and applied to two different corpora. Three of these are novel algorithms (Hs NTF, Cauchy WPE and Cauchy MIMO WPE). Both non-learning and learning algorithms are tested, with the learning algorithms performing better. For single and multi-channel speaker separation, unsupervised non-negative matrix factorization (NMF) algorithms are compared using three cost functions combined with sparsity, convolution and direction of arrival. The results show that the choice of cost function is important for improving the separation result. Furthermore, six different supervised deep learning algorithms are applied to single channel speaker separation. Historic information improves the result. When comparing NMF to deep learning, NMF is able to converge faster to a solution and provides a better result for the corpora used in this thesis

    Secure covert communications over streaming media using dynamic steganography

    Get PDF
    Streaming technologies such as VoIP are widely embedded into commercial and industrial applications, so it is imperative to address data security issues before the problems get really serious. This thesis describes a theoretical and experimental investigation of secure covert communications over streaming media using dynamic steganography. A covert VoIP communications system was developed in C++ to enable the implementation of the work being carried out. A new information theoretical model of secure covert communications over streaming media was constructed to depict the security scenarios in streaming media-based steganographic systems with passive attacks. The model involves a stochastic process that models an information source for covert VoIP communications and the theory of hypothesis testing that analyses the adversary‘s detection performance. The potential of hardware-based true random key generation and chaotic interval selection for innovative applications in covert VoIP communications was explored. Using the read time stamp counter of CPU as an entropy source was designed to generate true random numbers as secret keys for streaming media steganography. A novel interval selection algorithm was devised to choose randomly data embedding locations in VoIP streams using random sequences generated from achaotic process. A dynamic key updating and transmission based steganographic algorithm that includes a one-way cryptographical accumulator integrated into dynamic key exchange for covert VoIP communications, was devised to provide secure key exchange for covert communications over streaming media. The discrete logarithm problem in mathematics and steganalysis using t-test revealed the algorithm has the advantage of being the most solid method of key distribution over a public channel. The effectiveness of the new steganographic algorithm for covert communications over streaming media was examined by means of security analysis, steganalysis using non parameter Mann-Whitney-Wilcoxon statistical testing, and performance and robustness measurements. The algorithm achieved the average data embedding rate of 800 bps, comparable to other related algorithms. The results indicated that the algorithm has no or little impact on real-time VoIP communications in terms of speech quality (< 5% change in PESQ with hidden data), signal distortion (6% change in SNR after steganography) and imperceptibility, and it is more secure and effective in addressing the security problems than other related algorithms
    • …
    corecore