13 research outputs found

    Modelo acústico de língua inglesa falada por portugueses

    Get PDF
    Trabalho de projecto de mestrado em Engenharia Informática, apresentado à Universidade de Lisboa, através da Faculdade de Ciências, 2007No contexto do reconhecimento robusto de fala baseado em modelos de Markov não observáveis (do inglês Hidden Markov Models - HMMs) este trabalho descreve algumas metodologias e experiências tendo em vista o reconhecimento de oradores estrangeiros. Quando falamos em Reconhecimento de Fala falamos obrigatoriamente em Modelos Acústicos também. Os modelos acústicos reflectem a maneira como pronunciamos/articulamos uma língua, modelando a sequência de sons emitidos aquando da fala. Essa modelação assenta em segmentos de fala mínimos, os fones, para os quais existe um conjunto de símbolos/alfabetos que representam a sua pronunciação. É no campo da fonética articulatória e acústica que se estuda a representação desses símbolos, sua articulação e pronunciação. Conseguimos descrever palavras analisando as unidades que as constituem, os fones. Um reconhecedor de fala interpreta o sinal de entrada, a fala, como uma sequência de símbolos codificados. Para isso, o sinal é fragmentado em observações de sensivelmente 10 milissegundos cada, reduzindo assim o factor de análise ao intervalo de tempo onde as características de um segmento de som não variam. Os modelos acústicos dão-nos uma noção sobre a probabilidade de uma determinada observação corresponder a uma determinada entidade. É, portanto, através de modelos sobre as entidades do vocabulário a reconhecer que é possível voltar a juntar esses fragmentos de som. Os modelos desenvolvidos neste trabalho são baseados em HMMs. Chamam-se assim por se fundamentarem nas cadeias de Markov (1856 - 1922): sequências de estados onde cada estado é condicionado pelo seu anterior. Localizando esta abordagem no nosso domínio, há que construir um conjunto de modelos - um para cada classe de sons a reconhecer - que serão treinados por dados de treino. Os dados são ficheiros áudio e respectivas transcrições (ao nível da palavra) de modo a que seja possível decompor essa transcrição em fones e alinhá-la a cada som do ficheiro áudio correspondente. Usando um modelo de estados, onde cada estado representa uma observação ou segmento de fala descrita, os dados vão-se reagrupando de maneira a criar modelos estatísticos, cada vez mais fidedignos, que consistam em representações das entidades da fala de uma determinada língua. O reconhecimento por parte de oradores estrangeiros com pronuncias diferentes da língua para qual o reconhecedor foi concebido, pode ser um grande problema para precisão de um reconhecedor. Esta variação pode ser ainda mais problemática que a variação dialectal de uma determinada língua, isto porque depende do conhecimento que cada orador têm relativamente à língua estrangeira. Usando para uma pequena quantidade áudio de oradores estrangeiros para o treino de novos modelos acústicos, foram efectuadas diversas experiências usando corpora de Portugueses a falar Inglês, de Português Europeu e de Inglês. Inicialmente foi explorado o comportamento, separadamente, dos modelos de Ingleses nativos e Portugueses nativos, quando testados com os corpora de teste (teste com nativos e teste com não nativos). De seguida foi treinado um outro modelo usando em simultâneo como corpus de treino, o áudio de Portugueses a falar Inglês e o de Ingleses nativos. Uma outra experiência levada a cabo teve em conta o uso de técnicas de adaptação, tal como a técnica MLLR, do inglês Maximum Likelihood Linear Regression. Esta última permite a adaptação de uma determinada característica do orador, neste caso o sotaque estrangeiro, a um determinado modelo inicial. Com uma pequena quantidade de dados representando a característica que se quer modelar, esta técnica calcula um conjunto de transformações que serão aplicadas ao modelo que se quer adaptar. Foi também explorado o campo da modelação fonética onde estudou-se como é que o orador estrangeiro pronuncia a língua estrangeira, neste caso um Português a falar Inglês. Este estudo foi feito com a ajuda de um linguista, o qual definiu um conjunto de fones, resultado do mapeamento do inventário de fones do Inglês para o Português, que representam o Inglês falado por Portugueses de um determinado grupo de prestígio. Dada a grande variabilidade de pronúncias teve de se definir este grupo tendo em conta o nível de literacia dos oradores. Este estudo foi posteriormente usado na criação de um novo modelo treinado com os corpora de Portugueses a falar Inglês e de Portugueses nativos. Desta forma representamos um reconhecedor de Português nativo onde o reconhecimento de termos ingleses é possível. Tendo em conta a temática do reconhecimento de fala este projecto focou também a recolha de corpora para português europeu e a compilação de um léxico de Português europeu. Na área de aquisição de corpora o autor esteve envolvido na extracção e preparação dos dados de fala telefónica, para posterior treino de novos modelos acústicos de português europeu. Para compilação do léxico de português europeu usou-se um método incremental semi-automático. Este método consistiu em gerar automaticamente a pronunciação de grupos de 10 mil palavras, sendo cada grupo revisto e corrigido por um linguista. Cada grupo de palavras revistas era posteriormente usado para melhorar as regras de geração automática de pronunciações.The tremendous growth of technology has increased the need of integration of spoken language technologies into our daily applications, providing an easy and natural access to information. These applications are of different nature with different user’s interfaces. Besides voice enabled Internet portals or tourist information systems, automatic speech recognition systems can be used in home user’s experiences where TV and other appliances could be voice controlled, discarding keyboards or mouse interfaces, or in mobile phones and palm-sized computers for a hands-free and eyes-free manipulation. The development of these systems causes several known difficulties. One of them concerns the recognizer accuracy on dealing with non-native speakers with different phonetic pronunciations of a given language. The non-native accent can be more problematic than a dialect variation on the language. This mismatch depends on the individual speaking proficiency and speaker’s mother tongue. Consequently, when the speaker’s native language is not the same as the one that was used to train the recognizer, there is a considerable loss in recognition performance. In this thesis, we examine the problem of non-native speech in a speaker-independent and large-vocabulary recognizer in which a small amount of non-native data was used for training. Several experiments were performed using Hidden Markov models, trained with speech corpora containing European Portuguese native speakers, English native speakers and English spoken by European Portuguese native speakers. Initially it was explored the behaviour of an English native model and non-native English speakers’ model. Then using different corpus weights for the English native speakers and English spoken by Portuguese speakers it was trained a model as a pool of accents. Through adaptation techniques it was used the Maximum Likelihood Linear Regression method. It was also explored how European Portuguese speakers pronounce English language studying the correspondences between the phone sets of the foreign and target languages. The result was a new phone set, consequence of the mapping between the English and the Portuguese phone sets. Then a new model was trained with English Spoken by Portuguese speakers’ data and Portuguese native data. Concerning the speech recognition subject this work has other two purposes: collecting Portuguese corpora and supporting the compilation of a Portuguese lexicon, adopting some methods and algorithms to generate automatic phonetic pronunciations. The collected corpora was processed in order to train acoustic models to be used in the Exchange 2007 domain, namely in Outlook Voice Access

    A comparison of features for large population speaker identification

    Get PDF
    Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean speech, systems fail when presented with speech data from more realistic environments such as telephone channels. The differences using a recognizer in clean and noisy environments are extreme, and this causes one of the major obstacles in producing commercial recognition systems to be used in normal environments. It is the lack of performance of speaker recognition systems with telephone channels that this work addresses. The human auditory system is a speech recognizer with excellent performance, especially in noisy environments. Since humans perform well at ignoring noise more than any machine, auditory-based methods are the promising approaches since they attempt to model the working of the human auditory system. These methods have been shown to outperform more conventional signal processing schemes for speech recognition, speech coding, word-recognition and phone classification tasks. Since speaker identification has received lot of attention in speech processing because of its waiting real-world applications, it is attractive to evaluate the performance using auditory models as features. Firstly, this study rums at improving the results for speaker identification. The improvements were made through the use of parameterized feature-sets together with the application of cepstral mean removal for channel equalization. The study is further extended to compare an auditory-based model, the Ensemble Interval Histogram, with mel-scale features, which was shown to perform almost error-free in clean speech. The previous studies of Elli to be more robust to noise were conducted on speaker dependent, small population, isolated words and now are extended to speaker independent, larger population, continuous speech. This study investigates whether the Elli representation is more resistant to telephone noise than mel-cepstrum as was shown in the previous studies, when now for the first time, it is applied for speaker identification task using the state-of-the-art Gaussian mixture model system

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Applied and Computational Linguistics

    Get PDF
    Розглядається сучасний стан прикладної та комп’ютерної лінгвістики, проаналізовано лінгвістичні теорії 20-го – початку 21-го століть під кутом розмежування різних аспектів мови з метою формалізованого опису у електронних лінгвістичних ресурсах. Запропоновано критичний огляд таких актуальних проблем прикладної (комп’ютерної) лінгвістики як укладання комп’ютерних лексиконів та електронних текстових корпусів, автоматична обробка природної мови, автоматичний синтез та розпізнавання мовлення, машинний переклад, створення інтелектуальних роботів, здатних сприймати інформацію природною мовою. Для студентів та аспірантів гуманітарного профілю, науково-педагогічних працівників вищих навчальних закладів України

    A non-linear polynomial approximation filter for robust speaker verification

    Get PDF
    Bibliography: leaves 101-109

    Temporal Information in Data Science: An Integrated Framework and its Applications

    Get PDF
    Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems.Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems

    Automotive user interfaces for the support of non-driving-related activities

    Get PDF
    Driving a car has changed a lot since the first car was invented. Today, drivers do not only maneuver the car to their destination but also perform a multitude of additional activities in the car. This includes for instance activities related to assistive functions that are meant to increase driving safety and reduce the driver’s workload. However, since drivers spend a considerable amount of time in the car, they often want to perform non-driving-related activities as well. In particular, these activities are related to entertainment, communication, and productivity. The driver’s need for such activities has vastly increased, particularly due to the success of smart phones and other mobile devices. As long as the driver is in charge of performing the actual driving task, such activities can distract the driver and may result in severe accidents. Due to these special requirements of the driving environment, the driver ideally performs such activities by using appropriately designed in-vehicle systems. The challenge for such systems is to enable flexible and easily usable non-driving-related activities while maintaining and increasing driving safety at the same time. The main contribution of this thesis is a set of guidelines and exemplary concepts for automotive user interfaces that offer safe, diverse, and easy-to-use means to perform non-driving-related activities besides the regular driving tasks. Using empirical methods that are commonly used in human-computer interaction, we investigate various aspects of automotive user interfaces with the goal to support the design and development of future interfaces that facilitate non-driving-related activities. The first aspect is related to using physiological data in order to infer information about the driver’s workload. As a second aspect, we propose a multimodal interaction style to facilitate the interaction with multiple activities in the car. In addition, we introduce two concepts for the support of commonly used and demanded non-driving-related activities: For communication with the outside world, we investigate the driver’s needs with regard to sharing ride details with remote persons in order to increase driving safety. Finally, we present a concept of time-adjusted activities (e.g., entertainment and productivity) which enable the driver to make use of times where only little attention is required. Starting with manual, non-automated driving, we also consider the rise of automated driving modes.When cars were invented, they allowed the driver and potential passengers to get to a distant location. The only activities the driver was able and supposed to perform were related to maneuvering the vehicle, i.e., accelerate, decelerate, and steer the car. Today drivers perform many activities that go beyond these driving tasks. This includes for example activities related to driving assistance, location-based information and navigation, entertainment, communication, and productivity. To perform these activities, drivers use functions that are provided by in-vehicle information systems in the car. Many of these functions are meant to increase driving safety or to make the ride more enjoyable. The latter is important since people spend a considerable amount of time in their cars and want to perform similar activities like those to which they are accustomed to from using mobile devices. However, as long as the driver is responsible for driving, these activities can be distracting and pose driver, passengers, and the environment at risk. One goal for the development of automotive user interfaces is therefore to enable an easy and appropriate operation of in-vehicle systems such that driving tasks and non-driving-related activities can be performed easily and safely. The main contribution of this thesis is a set of guidelines and exemplary concepts for automotive user interfaces that offer safe, diverse, and easy-to-use means to perform also non-driving-related activities while driving. Using empirical methods that are commonly used in human-computer interaction, we approach various aspects of automotive user interfaces in order to support the design and development of future interfaces that also enable non-driving-related activities. Starting with manual, non-automated driving, we also consider the transition towards automated driving modes. As a first part, we look at the prerequisites that enable non-driving-related activities in the car. We propose guidelines for the design and development of automotive user interfaces that also support non-driving-related activities. This includes for instance rules on how to adapt or interrupt activities when the level of automation changes. To enable activities in the car, we propose a novel interaction concept that facilitates multimodal interaction in the car by combining speech interaction and touch gestures. Moreover, we reveal aspects on how to infer information about the driver's state (especially mental workload) by using physiological data. We conducted a real-world driving study to extract a data set with physiological and context data. This can help to better understand the driver state, to adapt interfaces to the driver and driving situations, and to adapt the route selection process. Second, we propose two concepts for supporting non-driving-related activities that are frequently used and demanded in the car. For telecommunication, we propose a concept to increase driving safety when communicating with the outside world. This concept enables the driver to share different types of information with remote parties. Thereby, the driver can choose between different levels of details ranging from abstract information such as ``Alice is driving right now'' up to sharing a video of the driving scene. We investigated the drivers' needs on the go and derived guidelines for the design of communication-related functions in the car through an online survey and in-depth interviews. As a second aspect, we present an approach to offer time-adjusted entertainment and productivity tasks to the driver. The idea is to allow time-adjusted tasks during periods where the demand for the driver's attention is low, for instance at traffic lights or during a highly automated ride. Findings from a web survey and a case study demonstrate the feasibility of this approach. With the findings of this thesis we envision to provide a basis for future research and development in the domain of automotive user interfaces and non-driving-related activities in the transition from manual driving to highly and fully automated driving.Als das Auto erfunden wurde, ermöglichte es den Insassen hauptsächlich, entfernte Orte zu erreichen. Die einzigen Tätigkeiten, die Fahrerinnen und Fahrer während der Fahrt erledigen konnten und sollten, bezogen sich auf die Steuerung des Fahrzeugs. Heute erledigen die Fahrerinnen und Fahrer diverse Tätigkeiten, die über die ursprünglichen Aufgaben hinausgehen und sich nicht unbedingt auf die eigentliche Fahraufgabe beziehen. Dies umfasst unter anderem die Bereiche Fahrerassistenz, standortbezogene Informationen und Navigation, Unterhaltung, Kommunikation und Produktivität. Informationssysteme im Fahrzeug stellen den Fahrerinnen und Fahrern Funktionen bereit, um diese Aufgaben auch während der Fahrt zu erledigen. Viele dieser Funktionen verbessern die Fahrsicherheit oder dienen dazu, die Fahrt angenehm zu gestalten. Letzteres wird immer wichtiger, da man inzwischen eine beträchtliche Zeit im Auto verbringt und dabei nicht mehr auf die Aktivitäten und Funktionen verzichten möchte, die man beispielsweise durch die Benutzung von Smartphone und Tablet gewöhnt ist. Solange der Fahrer selbst fahren muss, können solche Aktivitäten von der Fahrtätigkeit ablenken und eine Gefährdung für die Insassen oder die Umgebung darstellen. Ein Ziel bei der Entwicklung automobiler Benutzungsschnittstellen ist daher eine einfache, adäquate Bedienung solcher Systeme, damit Fahraufgabe und Nebentätigkeiten gut und vor allem sicher durchgeführt werden können. Der Hauptbeitrag dieser Arbeit umfasst einen Leitfaden und beispielhafte Konzepte für automobile Benutzungsschnittstellen, die eine sichere, abwechslungsreiche und einfache Durchführung von Tätigkeiten jenseits der eigentlichen Fahraufgabe ermöglichen. Basierend auf empirischen Methoden der Mensch-Computer-Interaktion stellen wir verschiedene Lösungen vor, die die Entwicklung und Gestaltung solcher Benutzungsschnittstellen unterstützen. Ausgehend von der heute üblichen nicht automatisierten Fahrt betrachten wir dabei auch Aspekte des automatisierten Fahrens. Zunächst betrachten wir die notwendigen Voraussetzungen, um Tätigkeiten jenseits der Fahraufgabe zu ermöglichen. Wir stellen dazu einen Leitfaden vor, der die Gestaltung und Entwicklung von automobilen Benutzungsschnittstellen unterstützt, die das Durchführen von Nebenaufgaben erlauben. Dies umfasst zum Beispiel Hinweise, wie Aktivitäten angepasst oder unterbrochen werden können, wenn sich der Automatisierungsgrad während der Fahrt ändert. Um Aktivitäten im Auto zu unterstützen, stellen wir ein neuartiges Interaktionskonzept vor, das eine multimodale Interaktion im Fahrzeug mit Sprachbefehlen und Touch-Gesten ermöglicht. Für automatisierte Fahrzeugsysteme und zur Anpassung der Interaktionsmöglichkeiten an die Fahrsituation stellt der Fahrerzustand (insbesondere die mentale Belastung) eine wichtige Information dar. Durch eine Fahrstudie im realen Straßenverkehr haben wir einen Datensatz generiert, der physiologische Daten und Kontextinformationen umfasst und damit Rückschlüsse auf den Fahrerzustand ermöglicht. Mit diesen Informationen über Fahrerinnen und Fahrer wird es möglich, den Fahrerzustand besser zu verstehen, Benutzungsschnittstellen an die aktuelle Fahrsituation anzupassen und die Routenwahl anzupassen. Außerdem stellen wir zwei konkrete Konzepte zur Unterstützung von Nebentätigkeiten vor, die schon heute regelmäßig bei der Fahrt getätigt oder verlangt werden. Im Bereich der Telekommunikation stellen wir dazu ein Konzept vor, das die Fahrsicherheit beim Kommunizieren mit Personen außerhalb des Autos erhöht. Das Konzept erlaubt es dem Fahrer, unterschiedliche Arten von Kontextinformationen mit Kommunikationspartnern zu teilen. Dies reicht von der abstrakten Information, dass man derzeit im Auto unterwegs ist bis hin zum Teilen eines Live-Videos der aktuellen Fahrsituation. Diesbezüglich haben wir über eine Web-Umfrage und detaillierte Interviews die Bedürfnisse der Nutzer(innen) erhoben und ausgewertet. Zudem stellen wir ein prototypisches Konzept sowie Richtlinien vor, wie künftige Kommunikationsaufgaben im Fahrzeug gestaltet werden sollen. Als ein zweites Konzept betrachten wir zeitbeschränkte Aufgaben zur Unterhaltung und Produktivität im Fahrzeug. Die Idee ist hier, zeitlich begrenzte Aufgaben in Zeiten niedriger Belastung zuzulassen, wie zum Beispiel beim Warten an einer Ampel oder während einer hochautomatisierten (Teil-) Fahrt. Ergebnisse aus einer Web-Umfrage und einer Fallstudie zeigen die Machbarkeit dieses Ansatzes auf. Mit den Ergebnissen dieser Arbeit soll eine Basis für künftige Forschung und Entwicklung gelegt werden, um im Bereich automobiler Benutzungsschnittstellen insbesondere nicht-fahr-bezogene Aufgaben im Übergang zwischen manuellem Fahren und einer hochautomatisierten Autofahrt zu unterstützen
    corecore