251 research outputs found

    Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech

    Get PDF
    We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as Statement, Question, Backchannel, Agreement, Disagreement, and Apology. Our model detects and predicts dialogue acts based on lexical, collocational, and prosodic cues, as well as on the discourse coherence of the dialogue act sequence. The dialogue model is based on treating the discourse structure of a conversation as a hidden Markov model and the individual dialogue acts as observations emanating from the model states. Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram. The statistical dialogue grammar is combined with word n-grams, decision trees, and neural networks modeling the idiosyncratic lexical and prosodic manifestations of each dialogue act. We develop a probabilistic integration of speech recognition with dialogue modeling, to improve both speech recognition and dialogue act classification accuracy. Models are trained and evaluated using a large hand-labeled database of 1,155 conversations from the Switchboard corpus of spontaneous human-to-human telephone speech. We achieved good dialogue act labeling accuracy (65% based on errorful, automatically recognized words and prosody, and 71% based on word transcripts, compared to a chance baseline accuracy of 35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling changed

    Framework for Human Computer Interaction for Learning Dialogue Strategies using Controlled Natural Language in Information Systems

    Get PDF
    Spoken Language systems are going to have a tremendous impact in all the real world applications, be it healthcare enquiry, public transportation system or airline booking system maintaining the language ethnicity for interaction among users across the globe. These system have the capability of interacting with the user in di erent languages that the system supports. Normally when a person interacts with another person there are many non-verbal clues which guide the dialogue and all the utterances have a contextual relationship, which manage the dialogue as its mixed by the two speakers. Human Computer Interaction has a wide impact on the design of the applications and has become one of the emerging interest area of the researchers. All of us are witness to an explosive electronic revolution where lots of gadgets and gizmo's have surrounded us, advanced not only in power, design, applications but the ease of access or what we call user friendly interfaces are designed that we can easily use and control all the functionality of the devices. Since speech is one of the most intuitive form of interaction that humans use. It provides potential bene ts such as handfree access to machines, ergonomics and greater e ciency of interaction. Yet, speech-based interfaces design has been an expert job for a long time. Lot of research has been done in building real spoken Dialogue Systems which can interact with humans using voice interactions and help in performing various tasks as are done by humans. Last two decades have seen utmost advanced research in the automatic speech recognition, dialogue management, text to speech synthesis and Natural Language Processing for various applications which have shown positive results. This dissertation proposes to apply machine learning (ML) techniques to the problem of optimizing the dialogue management strategy selection in the Spoken Dialogue system prototype design. Although automatic speech recognition and system initiated dialogues where the system expects an answer in the form of `yes' or `no' have already been applied to Spoken Dialogue Systems( SDS), no real attempt to use those techniques in order to design a new system from scratch has been made. In this dissertation, we propose some novel ideas in order to achieve the goal of easing the design of Spoken Dialogue Systems and allow novices to have access to voice technologies. A framework for simulating and evaluating dialogues and learning optimal dialogue strategies in a controlled Natural Language is proposed. The simulation process is based on a probabilistic description of a dialogue and on the stochastic modelling of both arti cial NLP modules composing a SDS and the user. This probabilistic model is based on a set of parameters that can be tuned from the prior knowledge from the discourse or learned from data. The evaluation is part of the simulation process and is based on objective measures provided by each module. Finally, the simulation environment is connected to a learning agent using the supplied evaluation metrics as an objective function in order to generate an optimal behaviour for the SDS

    Statistical language modelling of dialogue material in the British national corpus.

    Get PDF
    Statistical language modelling may not only be used to uncover the patterns which underlie the composition of utterances and texts, but also to build practical language processing technology. Contemporary language applications in automatic speech recognition, sentence interpretation and even machine translation exploit statistical models of language. Spoken dialogue systems, where a human user interacts with a machine via a speech interface in order to get information, make bookings, complaints, etc., are example of such systems which are now technologically feasible. The majority of statistical language modelling studies to date have concentrated on written text material (or read versions thereof). However, it is well-known that dialogue is significantly different from written text in its lexical content and sentence structure. Furthermore, there are expected to be significant logical, thematic and lexical connections between successive turns within a dialogue, but "turns" are not generally meaningful in written text. There is therefore a need for statistical language modeling studies to be performed on dialogue, particularly with a longer-term aim to using such models in human-machine dialogue interfaces. In this thesis, I describe the studies I have carried out on statistically modelling the dialogue material within the British National Corpus (BNC) - a very large corpus of modern British English compiled during the 1990s. This thesis presents a general introductory survey of the field of automatic speech recognition. This is followed by a general introduction to some standard techniques of statistical language modelling which will be employed later in the thesis. The structure of dialogue is discussed using some perspectives from linguistic theory, and reviews some previous approaches (not necessarily statistical) to modelling dialogue. Then a qualitative description is given of the BNC and the dialogue data within it, together with some descriptive statistics relating to it and results from constructing simple trigram language models for both dialogue and text data. The main part of the thesis describes experiments on the application of statistical language models based on word caches, word "trigger" pairs, and turn clustering to the dialogue data. Several different approaches are used for each type of model. An analysis of the strengths and weaknesses of these techniques is then presented. The results of the experiments lead to a better understanding of how statistical language modelling might be applied to dialogue for the benefit of future language technologies

    Intelligent Techniques to Accelerate Everyday Text Communication

    Get PDF
    People with some form of speech- or motor-impairments usually use a high-tech augmentative and alternative communication (AAC) device to communicate with other people in writing or in face-to-face conversations. Their text entry rate on these devices is slow due to their motor abilities. Making good letter or word predictions can help accelerate the communication of such users. In this dissertation, we investigated several approaches to accelerate input for AAC users. First, considering that an AAC user is participating in a face-to-face conversation, we investigated whether performing speech recognition on the speaking-side can improve next word predictions. We compared the accuracy of three plausible microphone deployment options and the accuracy of two commercial speech recognition engines. We found that despite recognition word error rates of 7-16%, our ensemble of n-gram and recurrent neural network language models made predictions nearly as good as when they used the reference transcripts. In a user study with 160 participants, we also found that increasing number of prediction slots in a keyboard interface does not necessarily correlate to improved performance. Second, typing every character in a text message may require an AAC user more time or effort than strictly necessary. Skipping spaces or other characters may be able to speed input and reduce an AAC user\u27s physical input effort. We designed a recognizer optimized for expanding noisy abbreviated input where users often omitted spaces and mid-word vowels. We showed using neural language models for selecting conversational-style training text and for rescoring the recognizer\u27s n-best sentences improved accuracy. We found accurate abbreviated input was possible even if a third of characters was omitted. In a study where users had to dwell for a second on each key, we found sentence abbreviated input was competitive with a conventional keyboard with word predictions. Finally, AAC keyboards rely on language modeling to auto-correct noisy typing and to offer word predictions. While today language models can be trained on huge amounts of text, pre-trained models may fail to capture the unique writing style and vocabulary of individual users. We demonstrated improved performance compared to a unigram cache by adapting to a user\u27s text via language models based on prediction by partial match (PPM) and recurrent neural networks. Our best model ensemble increased keystroke savings by 9.6%

    Accessing spoken interaction through dialogue processing [online]

    Get PDF
    Zusammenfassung Unser Leben, unsere Leistungen und unsere Umgebung, alles wird derzeit durch Schriftsprache dokumentiert. Die rasante Fortentwicklung der technischen Möglichkeiten Audio, Bilder und Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt werden um die schriftliche Dokumentation von menschlicher Kommunikation, zum Beispiel Meetings, zu unterstützen, zu ergänzen oder gar zu ersetzen. Diese neuen Technologien können uns in die Lage versetzen Information aufzunehmen, die anderweitig verloren gehen, die Kosten der Dokumentation zu senken und hochwertige Dokumente mit audiovisuellem Material anzureichern. Die Indizierung solcher Aufnahmen stellt die Kerntechnologie dar um dieses Potential auszuschöpfen. Diese Arbeit stellt effektive Alternativen zu schlüsselwortbasierten Indizes vor, die Suchraumeinschränkungen bewirken und teilweise mit einfachen Mitteln zu berechnen sind. Die Indizierung von Sprachdokumenten kann auf verschiedenen Ebenen erfolgen: Ein Dokument gehört stilistisch einer bestimmten Datenbasis an, welche durch sehr einfache Merkmale bei hoher Genauigkeit automatisch bestimmt werden kann. Durch diese Art von Klassifikation kann eine Reduktion des Suchraumes um einen Faktor der Größenordnung 4­10 erfolgen. Die Anwendung von thematischen Merkmalen zur Textklassifikation bei einer Nachrichtendatenbank resultiert in einer Reduktion um einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen sie in thematische Segmente unterteilt werden. Ein neuer probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia­ tive und Stil) liefern vergleichbare oder bessere Resultate als traditionelle schlüsselwortbasierte Ansätze. Diese thematische Segmente können durch die vorherrschende Aktivität charakterisiert werden (erzählen, diskutieren, planen, ...), die durch ein neuronales Netz detektiert werden kann. Die Detektionsraten sind allerdings begrenzt da auch Menschen diese Aktivitäten nur ungenau bestimmen. Eine maximale Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten Daten theoretisch möglich. Eine thematische Klassifikation dieser Segmente wurde ebenfalls auf einer Datenbasis durchgeführt, die Detektionsraten für diesen Index sind jedoch gering. Auf der Ebene der einzelnen Äußerungen können Dialogakte wie Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw. mit einem diskriminativ trainierten Hidden Markov Model erkannt werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen wie Frage/Antwort­Spielen erweitert werden (Dialogspiele). Dialogakte und ­spiele können eingesetzt werden um Klassifikatoren für globale Sprechstile zu bauen. Ebenso könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz erinnern und versuchen, diese in einer grafischen Repräsentation wiederzufinden. In einer Studie mit sehr pessimistischen Annahmen konnten Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische Repräsentation von Aktivität bestimmt. Dialogakte könnte in diesem Szenario ebenso nützlich sein, die Benutzerstudie konnte aufgrund der geringen Datenmenge darüber keinen endgültigen Aufschluß geben. Die Studie konnte allerdings für detailierte Basismerkmale wie Formalität und Sprecheridentität keinen Effekt zeigen. Abstract Written language is one of our primary means for documenting our lives, achievements, and environment. Our capabilities to record, store and retrieve audio, still pictures, and video are undergoing a revolution and may support, supplement or even replace written documentation. This technology enables us to record information that would otherwise be lost, lower the cost of documentation and enhance high­quality documents with original audiovisual material. The indexing of the audio material is the key technology to realize those benefits. This work presents effective alternatives to keyword based indices which restrict the search space and may in part be calculated with very limited resources. Indexing speech documents can be done at a various levels: Stylistically a document belongs to a certain database which can be determined automatically with high accuracy using very simple features. The resulting factor in search space reduction is in the order of 4­10 while topic classification yielded a factor of 18 in a news domain. Since documents can be very long they need to be segmented into topical regions. A new probabilistic segmentation framework as well as new features (speaker initiative and style) prove to be very effective compared to traditional keyword based methods. At the topical segment level activities (storytelling, discussing, planning, ...) can be detected using a machine learning approach with limited accuracy; however even human annotators do not annotate them very reliably. A maximum search space reduction factor of 6 is theoretically possible on the databases used. A topical classification of these regions has been attempted on one database, the detection accuracy for that index, however, was very low. At the utterance level dialogue acts such as statements, questions, backchannels (aha, yeah, ...), etc. are being recognized using a novel discriminatively trained HMM procedure. The procedure can be extended to recognize short sequences such as question/answer pairs, so called dialogue games. Dialog acts and games are useful for building classifiers for speaking style. Similarily a user may remember a certain dialog act sequence and may search for it in a graphical representation. In a study with very pessimistic assumptions users are able to pick one out of four similar and equiprobable meetings correctly with an accuracy ~ 43% using graphical activity information. Dialogue acts may be useful in this situation as well but the sample size did not allow to draw final conclusions. However the user study fails to show any effect for detailed basic features such as formality or speaker identity

    Advances in deep learning methods for speech recognition and understanding

    Full text link
    Ce travail expose plusieurs études dans les domaines de la reconnaissance de la parole et compréhension du langage parlé. La compréhension sémantique du langage parlé est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intéresse depuis longtemps les chercheurs, puisque la parole est une des charactéristiques qui definit l'être humain. Avec le développement du réseau neuronal artificiel, le domaine a connu une évolution rapide à la fois en terme de précision et de perception humaine. Une autre étape importante a été franchie avec le développement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modèle, ce qui augmente ainsi les performances, et ce qui simplifie la procédure d'entrainement. Les modèles de bout en bout sont devenus réalisables avec la quantité croissante de données disponibles, de ressources informatiques et, surtout, avec de nombreux développements architecturaux innovateurs. Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des données difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variété de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composé de différents bruits inconnus, comme une tâche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique à l'époque de l'adaptation du domaine antagoniste. En résumé, ces travaux antérieurs proposaient de former des caractéristiques de manière à ce qu'elles soient distinctives pour la tâche principale, mais non-distinctive pour la tâche secondaire. Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine. Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré. Dans notre travail, nous adoptons cette technique et la modifions pour la tâche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous développons une méthode générale pour la régularisation des réseaux génératif récurrents. Il est connu que les réseaux récurrents ont souvent des difficultés à rester sur le même chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour une meilleure traitement de séquences pour l'apprentissage des charactéristiques, qui n'est pas applicable au cas génératif. Nous avons développé un moyen d'améliorer la cohérence de la production de longues séquences avec des réseaux récurrents. Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel. L'idée centrale est d'utiliser une perte L2 entre les réseaux récurrents génératifs vers l'avant et vers l'arrière. Nous fournissons une évaluation expérimentale sur une multitude de tâches et d'ensembles de données, y compris la reconnaissance vocale, le sous-titrage d'images et la modélisation du langage. Dans le troisième article, nous étudions la possibilité de développer un identificateur d'intention de bout en bout pour la compréhension du langage parlé. La compréhension sémantique du langage parlé est une étape importante vers le développement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances élevées sur les tâches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antérieurs pour développer un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

    An Overview on Language Models: Recent Developments and Outlook

    Full text link
    Language modeling studies the probability distributions over strings of texts. It is one of the most fundamental tasks in natural language processing (NLP). It has been widely used in text generation, speech recognition, machine translation, etc. Conventional language models (CLMs) aim to predict the probability of linguistic sequences in a causal manner. In contrast, pre-trained language models (PLMs) cover broader concepts and can be used in both causal sequential modeling and fine-tuning for downstream applications. PLMs have their own training paradigms (usually self-supervised) and serve as foundation models in modern NLP systems. This overview paper provides an introduction to both CLMs and PLMs from five aspects, i.e., linguistic units, structures, training methods, evaluation methods, and applications. Furthermore, we discuss the relationship between CLMs and PLMs and shed light on the future directions of language modeling in the pre-trained era
    corecore