124 research outputs found

    Robust automatic transcription of lectures

    Get PDF
    Automatic transcription of lectures is becoming an important task. Possible applications can be found in the fields of automatic translation or summarization, information retrieval, digital libraries, education and communication research. Ideally those systems would operate on distant recordings, freeing the presenter from wearing body-mounted microphones. This task, however, is surpassingly difficult, given that the speech signal is severely degraded by background noise and reverberation

    Robust Automatic Transcription of Lectures

    Get PDF
    Die automatische Transkription von Vorträgen, Vorlesungen und Präsentationen wird immer wichtiger und ermöglicht erst die Anwendungen der automatischen Übersetzung von Sprache, der automatischen Zusammenfassung von Sprache, der gezielten Informationssuche in Audiodaten und somit die leichtere Zugänglichkeit in digitalen Bibliotheken. Im Idealfall arbeitet ein solches System mit einem Mikrofon das den Vortragenden vom Tragen eines Mikrofons befreit was der Fokus dieser Arbeit ist

    Reverberation: models, estimation and application

    No full text
    The use of reverberation models is required in many applications such as acoustic measurements, speech dereverberation and robust automatic speech recognition. The aim of this thesis is to investigate different models and propose a perceptually-relevant reverberation model with suitable parameter estimation techniques for different applications. Reverberation can be modelled in both the time and frequency domain. The model parameters give direct information of both physical and perceptual characteristics. These characteristics create a multidimensional parameter space of reverberation, which can be to a large extent captured by a time-frequency domain model. In this thesis, the relationship between physical and perceptual model parameters will be discussed. In the first application, an intrusive technique is proposed to measure the reverberation or reverberance, perception of reverberation and the colouration. The room decay rate parameter is of particular interest. In practical applications, a blind estimate of the decay rate of acoustic energy in a room is required. A statistical model for the distribution of the decay rate of the reverberant signal named the eagleMax distribution is proposed. The eagleMax distribution describes the reverberant speech decay rates as a random variable that is the maximum of the room decay rates and anechoic speech decay rates. Three methods were developed to estimate the mean room decay rate from the eagleMax distributions alone. The estimated room decay rates form a reverberation model that will be discussed in the context of room acoustic measurements, speech dereverberation and robust automatic speech recognition individually

    Channel selection and reverberation-robust automatic speech recognition

    Get PDF
    If speech is acquired by a close-talking microphone in a controlled and noise-free environment, current state-of-the-art recognition systems often show an acceptable error rate. The use of close-talking microphones, however, may be too restrictive in many applications. Alternatively, distant-talking microphones, often placed several meters far from the speaker, may be used. Such setup is less intrusive, since the speaker does not have to wear any microphone, but the Automatic Speech Recognition (ASR) performance is strongly affected by noise and reverberation. The thesis is focused on ASR applications in a room environment, where reverberation is the dominant source of distortion, and considers both single- and multi-microphone setups. If speech is recorded in parallel by several microphones arbitrarily located in the room, the degree of distortion may vary from one channel to another. The difference among the signal quality of each recording may be even more evident if those microphones have different characteristics: some are hanging on the walls, others standing on the table, or others build in the personal communication devices of the people present in the room. In a scenario like that, the ASR system may benefit strongly if the signal with the highest quality is used for recognition. To find such signal, what is commonly referred as Channel Selection (CS), several techniques have been proposed, which are discussed in detail in this thesis. In fact, CS aims to rank the signals according to their quality from the ASR perspective. To create such ranking, a measure that either estimates the intrinsic quality of a given signal, or how well it fits the acoustic models of the recognition system is needed. In this thesis we provide an overview of the CS measures presented in the literature so far, and compare them experimentally. Several new techniques are introduced, that surpass the former techniques in terms of recognition accuracy and/or computational efficiency. A combination of different CS measures is also proposed to further increase the recognition accuracy, or to reduce the computational load without any significant performance loss. Besides, we show that CS may be used together with other robust ASR techniques, and that the recognition improvements are cumulative up to some extent. An online real-time version of the channel selection method based on the variance of the speech sub-band envelopes, which was developed in this thesis, was designed and implemented in a smart room environment. When evaluated in experiments with real distant-talking microphone recordings and with moving speakers, a significant recognition performance improvement was observed. Another contribution of this thesis, that does not require multiple microphones, was developed in cooperation with the colleagues from the chair of Multimedia Communications and Signal Processing at the University of Erlangen-Nuremberg, Erlangen, Germany. It deals with the problem of feature extraction within REMOS (REverberation MOdeling for Speech recognition), which is a generic framework for robust distant-talking speech recognition. In this framework, the use of conventional methods to obtain decorrelated feature vector coefficients, like the discrete cosine transform, is constrained by the inner optimization problem of REMOS, which may become unsolvable in a reasonable time. A new feature extraction method based on frequency filtering was proposed to avoid this problem.Los actuales sistemas de reconocimiento del habla muestran a menudo una tasa de error aceptable si la voz es registrada por micr ofonos próximos a la boca del hablante, en un entorno controlado y libre de ruido. Sin embargo, el uso de estos micr ofonos puede ser demasiado restrictivo en muchas aplicaciones. Alternativamente, se pueden emplear micr ofonos distantes, los cuales a menudo se ubican a varios metros del hablante. Esta con guraci on es menos intrusiva ya que el hablante no tiene que llevar encima ning un micr ofono, pero el rendimiento del reconocimiento autom atico del habla (ASR, del ingl es Automatic Speech Recognition) en dicho caso se ve fuertemente afectado por el ruido y la reverberaci on. Esta tesis se enfoca a aplicaciones ASR en el entorno de una sala, donde la reverberaci on es la causa predominante de distorsi on y se considera tanto el caso de un solo micr ofono como el de m ultiples micr ofonos. Si el habla es grabada en paralelo por varios micr ofonos distribuidos arbitrariamente en la sala, el grado de distorsi on puede variar de un canal a otro. Las diferencias de calidad entre las señales grabadas pueden ser m as acentuadas si dichos micr ofonos muestran diferentes características y colocaciones: unos en las paredes, otros sobre la mesa, u otros integrados en los dispositivos de comunicaci on de las personas presentes en la sala. En dicho escenario el sistema ASR se puede bene ciar enormemente de la utilizaci on de la señal con mayor calidad para el reconocimiento. Para hallar dicha señal se han propuesto diversas t ecnicas, denominadas CS (del ingl es Channel Selection), las cuales se discuten detalladament en esta tesis. De hecho, la selecci on de canal busca ranquear las señales conforme a su calidad desde la perspectiva ASR. Para crear tal ranquin se necesita una medida que tanto estime la calidad intr nseca de una selal, como lo bien que esta se ajusta a los modelos ac usticos del sistema de reconocimiento. En esta tesis proporcionamos un resumen de las medidas CS hasta ahora presentadas en la literatura, compar andolas experimentalmente. Diversas nuevas t ecnicas son presentadas que superan las t ecnicas iniciales en cuanto a exactitud de reconocimiento y/o e ciencia computacional. Tambi en se propone una combinaci on de diferentes medidas CS para incrementar la exactitud de reconocimiento, o para reducir la carga computacional sin ninguna p erdida signi cativa de rendimiento. Adem as mostramos que la CS puede ser empleada junto con otras t ecnicas robustas de ASR, tales como matched condition training o la normalizaci on de la varianza y la media, y que las mejoras de reconocimiento de ambas aproximaciones son hasta cierto punto acumulativas. Una versi on online en tiempo real del m etodo de selecci on de canal basado en la varianza del speech sub-band envelopes, que fue desarrolladas en esta tesis, fue diseñada e implementada en una sala inteligente. Reportamos una mejora signi cativa en el rendimiento del reconocimiento al evaluar experimentalmente grabaciones reales de micr ofonos no pr oximos a la boca con hablantes en movimiento. La otra contribuci on de esta tesis, que no requiere m ultiples micr ofonos, fue desarrollada en colaboraci on con los colegas del departamento de Comunicaciones Multimedia y Procesamiento de Señales de la Universidad de Erlangen-Nuremberg, Erlangen, Alemania. Trata sobre el problema de extracci on de caracter sticas en REMOS (del ingl es REverberation MOdeling for Speech recognition). REMOS es un marco conceptual gen erico para el reconocimiento robusto del habla con micr ofonos lejanos. El uso de los m etodos convencionales para obtener los elementos decorrelados del vector de caracter sticas, como la transformada coseno discreta, est a limitado por el problema de optimizaci on inherente a REMOS, lo que har a que, utilizando las herramientas convencionales, se volviese un problema irresoluble en un tiempo razonable. Para resolver este problema hemos desarrollado un nuevo m etodo de extracci on de caracter sticas basado en fi ltrado frecuencialEls sistemes actuals de reconeixement de la parla mostren sovint una taxa d'error acceptable si la veu es registrada amb micr ofons pr oxims a la boca del parlant, en un entorn controlat i lliure de soroll. No obstant, l' us d'aquests micr ofons pot ser massa restrictiu en moltes aplicacions. Alternativament, es poden utilitzar micr ofons distants, els quals sovint s on ubicats a diversos metres del parlant. Aquesta con guraci o es menys intrusiva, ja que el parlant no ha de portar a sobre cap micr ofon, per o el rendiment del reconeixement autom atic de la parla (ASR, de l'angl es Automatic Speech Recognition) en aquest cas es veu fortament afectat pel soroll i la reverberaci o. Aquesta tesi s'enfoca a aplicacions ASR en un ambient de sala, on la reverberaci o es la causa predominant de distorsi o i es considera tant el cas d'un sol micr ofon com el de m ultiples micr ofons. Si la parla es gravada en paral lel per diversos micr ofons distribuï ts arbitràriament a la sala, el grau de distorsi o pot variar d'un canal a l'altre. Les difer encies en qualitat entre els senyals enregistrats poden ser m es accentuades si els micr ofons tenen diferents caracter stiques i col locacions: uns a les parets, altres sobre la taula, o b e altres integrats en els aparells de comunicaci o de les persones presents a la sala. En un escenari com aquest, el sistema ASR es pot bene ciar enormement de l'utilitzaci o del senyal de m es qualitat per al reconeixement. Per a trobar aquest senyal s'han proposat diverses t ecniques, anomenades CS (de l'angl es Channel Selection), les quals es discuteixen detalladament en aquesta tesi. De fet, la selecci o de canal busca ordenar els senyals conforme a la seva qualitat des de la perspectiva ASR. Per crear tal r anquing es necessita una mesura que estimi la qualitat intr nseca d'un senyal, o b e una que valori com de b e aquest s'ajusta als models ac ustics del sistema de reconeixement. En aquesta tesi proporcionem un resum de les mesures CS ns ara presentades en la literatura, comparant-les experimentalment. A m es, es presenten diverses noves t ecniques que superen les anteriors en termes d'exactitud de reconeixement i / o e ci encia computacional. Tamb e es proposa una combinaci o de diferents mesures CS amb l'objectiu d'incrementar l'exactitud del reconeixement, o per reduir la c arrega computacional sense cap p erdua signi cativa de rendiment. A m es mostrem que la CS pot ser utilitzada juntament amb altres t ecniques robustes d'ASR, com ara matched condition training o la normalitzaci o de la varian ca i la mitjana, i que les millores de reconeixement de les dues aproximacions s on ns a cert punt acumulatives. Una versi o online en temps real del m etode de selecci o de canal basat en la varian ca de les envolvents sub-banda de la parla, desenvolupada en aquesta tesi, va ser dissenyada i implementada en una sala intel ligent. A l'hora d'avaluar experimentalment gravacions reals de micr ofons no pr oxims a la boca amb parlants en moviment, es va observar una millora signi cativa en el rendiment del reconeixement. L'altra contribuci o d'aquesta tesi, que no requereix m ultiples micr ofons, va ser desenvolupada en col laboraci o amb els col legues del departament de Comunicacions Multimedia i Processament de Senyals de la Universitat de Erlangen-Nuremberg, Erlangen, Alemanya. Tracta sobre el problema d'extracci o de caracter stiques a REMOS (de l'angl es REverberation MOdeling for Speech recognition). REMOS es un marc conceptual gen eric per al reconeixement robust de la parla amb micr ofons llunyans. L' us dels m etodes convencionals per obtenir els elements decorrelats del vector de caracter stiques, com ara la transformada cosinus discreta, est a limitat pel problema d'optimitzaci o inherent a REMOS. Aquest faria que, utilitzant les eines convencionals, es torn es un problema irresoluble en un temps raonable. Per resoldre aquest problema hem desenvolupat un nou m etode d'extracci o de caracter ístiques basat en fi ltrat frecuencial

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    Coding Strategies for Cochlear Implants Under Adverse Environments

    Get PDF
    Cochlear implants are electronic prosthetic devices that restores partial hearing in patients with severe to profound hearing loss. Although most coding strategies have significantly improved the perception of speech in quite listening conditions, there remains limitations on speech perception under adverse environments such as in background noise, reverberation and band-limited channels, and we propose strategies that improve the intelligibility of speech transmitted over the telephone networks, reverberated speech and speech in the presence of background noise. For telephone processed speech, we propose to examine the effects of adding low-frequency and high- frequency information to the band-limited telephone speech. Four listening conditions were designed to simulate the receiving frequency characteristics of telephone handsets. Results indicated improvement in cochlear implant and bimodal listening when telephone speech was augmented with high frequency information and therefore this study provides support for design of algorithms to extend the bandwidth towards higher frequencies. The results also indicated added benefit from hearing aids for bimodal listeners in all four types of listening conditions. Speech understanding in acoustically reverberant environments is always a difficult task for hearing impaired listeners. Reverberated sounds consists of direct sound, early reflections and late reflections. Late reflections are known to be detrimental to speech intelligibility. In this study, we propose a reverberation suppression strategy based on spectral subtraction to suppress the reverberant energies from late reflections. Results from listening tests for two reverberant conditions (RT60 = 0.3s and 1.0s) indicated significant improvement when stimuli was processed with SS strategy. The proposed strategy operates with little to no prior information on the signal and the room characteristics and therefore, can potentially be implemented in real-time CI speech processors. For speech in background noise, we propose a mechanism underlying the contribution of harmonics to the benefit of electroacoustic stimulations in cochlear implants. The proposed strategy is based on harmonic modeling and uses synthesis driven approach to synthesize the harmonics in voiced segments of speech. Based on objective measures, results indicated improvement in speech quality. This study warrants further work into development of algorithms to regenerate harmonics of voiced segments in the presence of noise
    corecore