11 research outputs found

    Advanced signal processing techniques for pitch synchronous sinusoidal speech coders

    Get PDF
    Recent trends in commercial and consumer demand have led to the increasing use of multimedia applications in mobile and Internet telephony. Although audio, video and data communications are becoming more prevalent, a major application is and will remain the transmission of speech. Speech coding techniques suited to these new trends must be developed, not only to provide high quality speech communication but also to minimise the required bandwidth for speech, so as to maximise that available for the new audio, video and data services. The majority of current speech coders employed in mobile and Internet applications employ a Code Excited Linear Prediction (CELP) model. These coders attempt to reproduce the input speech signal and can produce high quality synthetic speech at bit rates above 8 kbps. Sinusoidal speech coders tend to dominate at rates below 6 kbps but due to limitations in the sinusoidal speech coding model, their synthetic speech quality cannot be significantly improved even if their bit rate is increased. Recent developments have seen the emergence and application of Pitch Synchronous (PS) speech coding techniques to these coders in order to remove the limitations of the sinusoidal speech coding model. The aim of the research presented in this thesis is to investigate and eliminate the factors that limit the quality of the synthetic speech produced by PS sinusoidal coders. In order to achieve this innovative signal processing techniques have been developed. New parameter analysis and quantisation techniques have been produced which overcome many of the problems associated with applying PS techniques to sinusoidal coders. In sinusoidal based coders, two of the most important elements are the correct formulation of pitch and voicing values from the' input speech. The techniques introduced here have greatly improved these calculations resulting in a higher quality PS sinusoidal speech coder than was previously available. A new quantisation method which is able to reduce the distortion from quantising speech spectral information has also been developed. When these new techniques are utilised they effectively raise the synthetic speech quality of sinusoidal coders to a level comparable to that produced by CELP based schemes, making PS sinusoidal coders a promising alternative at low to medium bit rates.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Characterisation of noisy speech channels in 2G and 3G mobile networks

    Get PDF
    As the wireless cellular market reaches competitive levels never seen before, network operators need to focus on maintaining Quality of Service (QoS) a main priority if they wish to attract new subscribers while keeping existing customers satisfied. Speech Quality as perceived by the end user is one major example of a characteristic in constant need of maintenance and improvement. It is in this topic that this Master Thesis project fits in. Making use of an intrusive method of speech quality evaluation, as a means to further study and characterize the performance of speech codecs in second-generation (2G) and third-generation (3G) technologies. Trying to find further correlation between codecs with similar bit rates, along with the exploration of certain transmission parameters which may aid in the assessment of speech quality. Due to some limitations concerning the audio analyzer equipment that was to be employed, a different system for recording the test samples was sought out. Although the new designed system is not standard, after extensive testing and optimization of the system's parameters, final results were found reliable and satisfactory. Tests include a set of high and low bit rate codecs for both 2G and 3G, where values were compared and analysed, leading to the outcome that 3G speech codecs perform better, under the approximately same conditions, when compared with 2G. Reinforcing the idea that 3G is, with no doubt, the best choice if the costumer looks for the best possible listening speech quality. Regarding the transmission parameters chosen for the experiment, the Receiver Quality (RxQual) and Received Energy per Chip to the Power Density Ratio (Ec/N0), these were subject to speech quality correlation tests. Final results of RxQual were compared to those of prior studies from different researchers and, are considered to be of important relevance. Leading to the confirmation of RxQual as a reliable indicator of speech quality. As for Ec/N0, it is not possible to state it as a speech quality indicator however, it shows clear thresholds for which the MOS values decrease significantly. The studied transmission parameters show that they can be used not only for network management purposes but, at the same time, give an expected idea to the communications engineer (or technician) of the end-to-end speech quality consequences. With the conclusion of the work new ideas for future studies come to mind. Considering that the fourth-generation (4G) cellular technologies are now beginning to take an important place in the global market, as the first all-IP network structure, it seems of great relevance that 4G speech quality should be subject of evaluation. Comparing it to 3G, not only in narrowband but also adding wideband scenarios with the most recent standard objective method of speech quality assessment, POLQA. Also, new data found on Ec/N0 tests, justifies further research studies with the intention of validating the assumptions made in this work.Com o mercado das redes móveis a atingir níveis de competitividade nunca antes vistos, existe a crescente necessidade por parte dos operadores de rede em focar-se na Qualidade de Serviço (QoS) como principal prioridade, no sentido de atrair novos clientes ao mesmo tempo que asseguram a satisfação dos seus actuais assinantes. A percepção da Qualidade de Voz, por parte do utilizador, é apenas um exemplo de uma característica de QoS em constante necessidade de manutenção e melhoramento. Sendo nesta temática em que se insere a Tese de Mestrado. Aplicando um método intrusivo de avaliação de qualidade de voz, como meio para um estudo mais aprofundado e, ao mesmo tempo, caracterizando o desempenho dos codecs de voz para as tecnologias de segunda-geração (2G) e terceira-geração (3G). Investigando nova informação que possa ser retirada da correlação entre codecs com bit rates semelhantes, juntamente com a exploração de determinados 'parâmetros de transmissão os quais podem auxiliar na avaliação da qualidade de voz. Devido a algumas limitações ligadas ao analisador de áudio (requisito neste tipo de aplicações), existiu a necessidade de procurar um sistema distinto para gravação das amostras de teste. Embora o sistema escolhido não seja padronizado para este tipo de ensaios, após vários testes e consequente optimização dos parâmetros do sistema, os resultados finais consideram-se credíveis e satisfatórios. Os testes efectuados incluem um conjunto de codecs de elevado e baixo bit rate, onde a comparação e análise dos resultados levam a concluir que codecs de voz 3G têm melhor desempenho, sob aproximadamente as mesmas condições, comparativamente com os 2G. Reforçando a ideia generalizada que 3G é, sem dúvida, a melhor escolha se o utilizador procura uma solução superior a nível de qualidade de voz. No que diz respeito aos parâmetros de transmissão escolhidos para a experiência, RxQual (Qualidade do sinal Recebido pela estacão móvel) e Ec/N0 (razão entre Energia por chip e a Densidade Espectral de Potência), estes foram sujeitos a testes de correlação com a qualidade de voz. Os resultados de RxQual foram sujeitos a comparação com estudos prévios de outros investigadores, confirmando este parâmetro como um indicador de qualidade de voz bastante fiável. Quanto a Ec/N0, não é possível declará-lo como um indicador de qualidade de voz, no entanto, este demonstra limites claros para os quais os valores de Mean Opinion Score (MOS) decrescem significativamente. Os parâmetros de transmissão estudados demonstram não só que podem ser utilizados com objectivos de gestão de rede mas como também podem fornecer, ao engenheiro (ou técnico), informação relativa ao impacto que poderá existir na qualidade de voz. Com a finalização deste trabalho é possível constatar que novos estudos devem ser efectuados. Considerando que a tecnologia de quarta-geração (4G) começa agora a dar os seus primeiros passos no mercado das redes móveis, como a primeira com arquitectura de rede totalmente orientada para IP, parece de grande importância que esta tecnologia seja sujeita a avaliação. Comparando-a com 3G, não só para banda-estreita (300 a 3400 Hz) como também para cenários de banda-larga (50 a 7000Hz), aplicando o mais recente método normalizado de avaliação de qualidade de voz, o POLQA. Por fim, também se verifica como pertinente uma continuação do estudo relativo a Ec/N0 a fim de validar as ilações retiradas neste trabalho

    Quality of media traffic over Lossy internet protocol networks: Measurement and improvement.

    Get PDF
    Voice over Internet Protocol (VoIP) is an active area of research in the world of communication. The high revenue made by the telecommunication companies is a motivation to develop solutions that transmit voice over other media rather than the traditional, circuit switching network. However, while IP networks can carry data traffic very well due to their besteffort nature, they are not designed to carry real-time applications such as voice. As such several degradations can happen to the speech signal before it reaches its destination. Therefore, it is important for legal, commercial, and technical reasons to measure the quality of VoIP applications accurately and non-intrusively. Several methods were proposed to measure the speech quality: some of these methods are subjective, others are intrusive-based while others are non-intrusive. One of the non-intrusive methods for measuring the speech quality is the E-model standardised by the International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T). Although the E-model is a non-intrusive method for measuring the speech quality, but it depends on the time-consuming, expensive and hard to conduct subjective tests to calibrate its parameters, consequently it is applicable to a limited number of conditions and speech coders. Also, it is less accurate than the intrusive methods such as Perceptual Evaluation of Speech Quality (PESQ) because it does not consider the contents of the received signal. In this thesis an approach to extend the E-model based on PESQ is proposed. Using this method the E-model can be extended to new network conditions and applied to new speech coders without the need for the subjective tests. The modified E-model calibrated using PESQ is compared with the E-model calibrated using i ii subjective tests to prove its effectiveness. During the above extension the relation between quality estimation using the E-model and PESQ is investigated and a correction formula is proposed to correct the deviation in speech quality estimation. Another extension to the E-model to improve its accuracy in comparison with the PESQ looks into the content of the degraded signal and classifies packet loss into either Voiced or Unvoiced based on the received surrounding packets. The accuracy of the proposed method is evaluated by comparing the estimation of the new method that takes packet class into consideration with the measurement provided by PESQ as a more accurate, intrusive method for measuring the speech quality. The above two extensions for quality estimation of the E-model are combined to offer a method for estimating the quality of VoIP applications accurately, nonintrusively without the need for the time-consuming, expensive, and hard to conduct subjective tests. Finally, the applicability of the E-model or the modified E-model in measuring the quality of services in Service Oriented Computing (SOC) is illustrated

    Speech quality prediction for voice over Internet protocol networks

    Get PDF
    Merged with duplicate record 10026.1/878 on 03.01.2017 by CS (TIS). Merged with duplicate record 10026.1/1657 on 15.03.2017 by CS (TIS)This is a digitised version of a thesis that was deposited in the University Library. If you are the author please contact PEARL Admin ([email protected]) to discuss options.IP networks are on a steep slope of innovation that will make them the long-term carrier of all types of traffic, including voice. However, such networks are not designed to support real-time voice communication because their variable characteristics (e.g. due to delay, delay variation and packet loss) lead to a deterioration in voice quality. A major challenge in such networks is how to measure or predict voice quality accurately and efficiently for QoS monitoring and/or control purposes to ensure that technical and commercial requirements are met. Voice quality can be measured using either subjective or objective methods. Subjective measurement (e.g. MOS) is the benchmark for objective methods, but it is slow, time consuming and expensive. Objective measurement can be intrusive or non-intrusive. Intrusive methods (e.g. ITU PESQ) are more accurate, but normally are unsuitable for monitoring live traffic because of the need for a reference data and to utilise the network. This makes non-intrusive methods(e.g. ITU E-model) more attractive for monitoring voice quality from IP network impairments. However, current non-intrusive methods rely on subjective tests to derive model parameters and as a result are limited and do not meet new and emerging applications. The main goal of the project is to develop novel and efficient models for non-intrusive speech quality prediction to overcome the disadvantages of current subjective-based methods and to demonstrate their usefulness in new and emerging VoIP applications. The main contributions of the thesis are fourfold: (1) a detailed understanding of the relationships between voice quality, IP network impairments (e.g. packet loss, jitter and delay) and relevant parameters associated with speech (e.g. codec type, gender and language) is provided. An understanding of the perceptual effects of these key parameters on voice quality is important as it provides a basis for the development of non-intrusive voice quality prediction models. A fundamental investigation of the impact of the parameters on perceived voice quality was carried out using the latest ITU algorithm for perceptual evaluation of speech quality, PESQ, and by exploiting the ITU E-model to obtain an objective measure of voice quality. (2) a new methodology to predict voice quality non-intrusively was developed. The method exploits the intrusive algorithm, PESQ, and a combined PESQ/E-model structure to provide a perceptually accurate prediction of both listening and conversational voice quality non-intrusively. This avoids time-consuming subjective tests and so removes one of the major obstacles in the development of models for voice quality prediction. The method is generic and as such has wide applicability in multimedia applications. Efficient regression-based models and robust artificial neural network-based learning models were developed for predicting voice quality non-intrusively for VoIP applications. (3) three applications of the new models were investigated: voice quality monitoring/prediction for real Internet VoIP traces, perceived quality driven playout buffer optimization and perceived quality driven QoS control. The neural network and regression models were both used to predict voice quality for real Internet VoIP traces based on international links. A new adaptive playout buffer and a perceptual optimization playout buffer algorithms are presented. A QoS control scheme that combines the strengths of rate-adaptive and priority marking control schemes to provide a superior QoS control in terms of measured perceived voice quality is also provided. (4) a new methodology for Internet-based subjective speech quality measurement which allows rapid assessment of voice quality for VoIP applications is proposed and assessed using both objective and traditional MOS test methods

    MasterVoicing - A whispers to voiced speech assistant

    Get PDF
    A afonia, também conhecida como perda de voz, é uma condição que afecta o sistema fonético humano e que se caracteriza pela inabilidade de uma pessoa produzir sons normais de fala. Esta incapacidade pode variar de grau entre a perda parcial de voz, conhecida como rouquidão, até à perda quase total de voz, onde a voz consiste apenas em sussurros. As suas causas podem ser físicas, relacionadas com ferimentos, procedimentos cirúrgicos ou maus hábitos, como mau uso da voz, ou causas psicológicas, relacionadas com problemas mentais ou traumas experienciados. Sussurrar é uma forma natural de comunicação para as pessoas em certas situações sociais em que a privacidade é desejada ou o silêncio é recomendado. No entanto, para os afónicos, sussurrar é geralmente o seu principal meio de comunicação. Isto pode revelar-se um problema, por causa da dificuldade de comunicar com outras pessoas, e pode até causar problemas no seu dia-a-dia ou trabalho. Existem algumas soluções para este problema relativamente a pacientes laringetomizados, como o uso de uma eletrolaringe, que recria uma voz artificial, o uso da voz esofágica e a prótese traqueoesofágica, mas todas elas tem as suas desvantagens e requerem alguma prática e aprendizagem para conseguir algo semelhante à voz normal. Em termos de tecnologias, existem também interfaces de fala silenciosa, que contudo não são ainda soluções convenientes de utilizar. Existem também aplicações móveis que tentam ajudar com este problema, que são geralmente baseadas na conversão texto-para-fala. Elas requerem a inserção de texto por parte do utilizador, à qual se segue a sua reprodução em fala, o que resulta numa utilização lenta e artificial. Algumas destas aplicações funcionam em tempo real, através de um simples clique em botões com texto predefinido, mas têm também limitações do ponto de vista prático. Tendo isto em consideração, o objectivo desta dissertação é desenvolver uma aplicação móvel, MasterVoicing, para a plataforma iOS, que pretende fornecer aos afónicos outra alternativa de comunicar, utilizando o seu meio natural de comunicar - sussurrar. A sua validação é verificada pela realização de testes de usabilidade e o seu objectivo é funcionar em tempo real, integrando um algoritmo de sussurro-para-fala, que reconstrói fala natural e audível a partir de sussurros, de forma a fornecer aos afónicos uma ferramenta fácil para recuperarem alguma da sua liberdade de comunicação, sem os aborrecimentos dos outros métodos que se encontram disponíveis.Aphonia, also known as loss of voice, is a condition that affects the human phonetic system and is characterized by the inability of a speaker to produce normal speech. It can range from partial loss, known as hoarseness, to an almost complete loss of voice, where the voice is nothing but a whisper. Its causes can vary, from physical disease related with injuries, medical procedures or bad habits, such as voice misuse, to mental disorders. Whispering is a natural form of speech for people in some social situations where privacy is desired or silence is recommended. However, for patients with aphonia, whispering is generally their primary way of communication. This can become a problem, because of the difficulty to communicate with other people, and can even cause problems in daily lives or even work related activities. There are some solutions for this problem regarding laryngectomized patients, like the use of an electrolarynx, that recreates an artificial voice, the use of esophageal speech and the tracheo-esophageal puncture with prosthesis, but all of them have some disadvantages and require some degree of practice to master speaking. In terms of technologies, there also exist silent speech interfaces, that are not yet convenient solutions. There are also mobile applications that try to help with this problem, that are generally based in text-to-speech conversion. They require a text input by the user that is followed by its reproduction in speech, resulting in a slow and unnatural usage. Some of these applications function in real-time by a simple click on predefined buttons with text, which also have limitations. With that in mind, the goal in this dissertation is to develop MasterVoicing, a mobile application, for the iOS platform, whose purpose is to give aphonics another alternative to communicate, using their natural way of communicating - whispering. Its validation is verified by the performance of usability tests and its aim is to work in real-time, integrating a whisper-to-speech algorithm that reconstructs natural, voiced, speech from whispers, giving aphonics an easy tool to regain some of their communication freedom, without the drawbacks of other methods that are available to them

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
    corecore