4 research outputs found

    Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment

    Get PDF
    Recurrent neural network language models (RNNLMs) generally outperform n-gram language models when used in automatic speech recognition. Adapting RNNLMs to new domains is an open problem and current approaches can be categorised as either feature-based and model-based. In feature-based adaptation, the input to the RNNLM is augmented with auxiliary features whilst model-based adaptation includes model fine-tuning and the introduction of adaptation layer(s) in the network. In this paper, the properties of both types of adaptation are investigated on multi-genre broadcast speech recognition. Existing techniques for both types of adaptation are reviewed and the proposed techniques for model-based adaptation, namely the linear hidden network (LHN) adaptation layer and the K-component adaptive RNNLM, are investigated. Moreover, new features derived from the acoustic domain are investigated for RNNLM adaptation. The contributions of this paper include two hybrid adaptation techniques: the fine-tuning of feature-based RNNLMs and a feature-based adaptation layer. Moreover, the semi-supervised adaptation of RNNLMs using genre information is also proposed. The ASR systems were trained using 700h of multi-genre broadcast speech. The gains obtained when using the RNNLM adaptation techniques proposed in this work are consistent when using RNNLMs trained on an in-domain set of 10M words and on a combination of in-domain and out-of-domain sets of 660M words, with approx. 10% perplexity and 2% relative word error rate improvements on a 28.3h. test set. The best RNNLM adaptation techniques for ASR are also evaluated on a lightly supervised alignment of subtitles task for the same data, where the use of RNNLM adaptation leads to an absolute increase in the F-measure of 0.5%

    Identificação de Classes em Texto, Classificação não Supervisionada

    Get PDF
    Na classificação de documentos, são vários os estudos já realizados, sobretudo através da classificação supervisionada. Existem em menor número também alguns que usam classificação não supervisionada. Na classificação supervisionada, tendo cada documento a etiqueta (label) correspondente à classe/tópico a que pertence, está facilitado o processo de classificação, o que permite em geral melhores resultados, em termos de Precisão e Recall, quando comparados com os obtidos pela opção ’não supervisionada’. No entanto, existe uma limitação forte: a classificação de novos elementos está limitada às classes indicadas na fase de treino através da etiqueta, sendo que o sistema não consegue aprender novas classes a não ser por essa indicação explícita. Considerando a alternativa da classificação não supervisionada, onde não existe a indicação explícita da classe, o desafio consiste sobretudo em detetar/minerar que grupos/ classes de tópicos principais estão implícitos nos dados, isto é, nos documentos caracterizados pelos seus atributos (features). Desta forma poder-se-ão aprender de forma dinâmica novas classes, desde que estejam implícitas nos dados, isto é, desde que as features sejam suficientemente caracterizadoras. Um dos objetivos desta dissertação foi a elaboração de um sistema capaz de receber um conjunto de documentos e agrupá-los por tópicos, tendo em conta o seu conteúdo. Um segundo objectivo consistiu em identificar os tópicos/subtópicos principais de cada grupo e também classificar novos documentos de acordo com o que foi aprendido na fase de treino. O trabalho envolveu a selecção e redução de features, a construção dos grupos (clustering) e a classificação propriamente dita.In document classification, there are several studies that have been done, mostly using the supervised classification. There are also some approaches using the unsupervised classification. In supervised classification, with each document having the label corresponding to the class / topic to which it belongs, the classification process is facilitated, which generally allows better results, in terms of Precision and Recall, when compared with those chosen by the option “unsupervised". However, there is a strong limitation: the classification of new elements is limited to the classes indicated in the training phase through the label, and the system is unable to learn new classes except for this explicit indication. Based on the alternative of unsupervised classification, where there is no explicit indication of the class, the challenge consists mainly in detecting/mining which groups/classes of main topics are implicit in the data, in other words, in the documents characterized by their attributes. In this way, new classes can be dynamically learned, as long as they are implicit in the data, in other words, as long as the features are sufficiently characterizing. One of the goals of this dissertation was the development of a system capable of receiving a set of documents and group them by topics, based on their content. Another goal was to identify topics/subtopics of each group and also classify new documents according to what was learned in the training phase. The work involved the selection and reduction of features, the construction of groups (clustering) and a classification itself

    Terminology Integration in Statistical Machine Translation

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs apraksta autora izpētītas metodes un izstrādātus rīkus divvalodu terminoloģijas integrācijai statistiskās mašīntulkošanas sistēmās. Autors darbā piedāvā inovatīvas metodes terminu integrācijai SMT sistēmu trenēšanas fāzē (ar statiskas integrācijas palīdzību) un tulkošanas fāzē (ar dinamiskas integrācijas palīdzību). Darbā uzmanība pievērsta ne tikai metodēm terminu integrācijai SMT, bet arī metodēm valodas resursu, kas nepieciešami dažādu uzdevumu veikšanai terminu integrācijas SMT darbplūsmās, ieguvei. Piedāvātās metodes ir novērtētas automātiskas un manuālas novērtēšanas eksperimentos. Iegūtie rezultāti parāda, ka statiskās un dinamiskās integrācijas metodes ļauj būtiski uzlabot tulkošanas kvalitāti. Darbā aprakstītie rezultāti ir aprobēti vairākos pētniecības projektos un ieviesti praktiskos risinājumos. Atslēgvārdi: statistiskā mašīntulkošana, terminoloģija, starpvalodu informācijas izvilkšanaThe doctoral thesis describes methods and tools researched and developed by the author for bilingual terminology integration into statistical machine translation systems. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources that are necessary for different tasks involved in workflows for terminology integration in SMT systems. The proposed methods have been evaluated using automatic and manual evaluation methods. The results show that both static and dynamic integration methods allow increasing translation quality. The thesis describes also areas where the methods have been approbated in practice. Keywords: statistical machine translation, terminology, cross-lingual information extractio
    corecore