6 research outputs found

    Combining semantic and syntactic structure for language modeling

    Full text link
    Structured language models for speech recognition have been shown to remedy the weaknesses of n-gram models. All current structured language models are, however, limited in that they do not take into account dependencies between non-headwords. We show that non-headword dependencies contribute to significantly improved word error rate, and that a data-oriented parsing model trained on semantically and syntactically annotated data can exploit these dependencies. This paper also contains the first DOP model trained by means of a maximum likelihood reestimation procedure, which solves some of the theoretical shortcomings of previous DOP models.Comment: 4 page

    An improved parser for data-oriented lexical-functional analysis

    Full text link
    We present an LFG-DOP parser which uses fragments from LFG-annotated sentences to parse new sentences. Experiments with the Verbmobil and Homecentre corpora show that (1) Viterbi n best search performs about 100 times faster than Monte Carlo search while both achieve the same accuracy; (2) the DOP hypothesis which states that parse accuracy increases with increasing fragment size is confirmed for LFG-DOP; (3) LFG-DOP's relative frequency estimator performs worse than a discounted frequency estimator; and (4) LFG-DOP significantly outperforms Tree-DOP is evaluated on tree structures only.Comment: 8 page

    Inducing Tree-Substitution Grammars

    Get PDF
    Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model's complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model's efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. 漏 2010 Evangelos Theodorou, Jonas Buchli and Stefan Schaal

    Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing

    Full text link
    Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: 路 Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. 路 Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). 路 Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators.El Procesamiento del Lenguaje Natural (PLN) es un campo de investigaci贸n interdisciplinar de las Ciencias de la Computaci贸n, Ling眉铆stica y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacci贸n Hombre-M谩quina. La mayor铆a de las tareas de investigaci贸n del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducci贸n del lenguaje natural, que se pueden utilizar para construir sistemas autom谩ticos para la transcripci贸n y traducci贸n de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripci贸n se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalizaci贸n de im谩genes s贸lo proporciona, en la mayor铆a de los casos, la b煤squeda por imagen y no por contenidos ling眉铆sticos. La transcripci贸n es a煤n m谩s importante en el caso de los manuscritos hist贸ricos, ya que la mayor铆a de estos documentos son 煤nicos y la preservaci贸n de su contenido es crucial por razones culturales e hist贸ricas. La transcripci贸n de manuscritos hist贸ricos suele ser realizada por pale贸grafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta com煤n para ayudar a los pale贸grafos en su tarea, la cual proporciona un borrador de la transcripci贸n que los pale贸grafos pueden corregir con m茅todos m谩s o menos sofisticados. Este borrador de transcripci贸n es 煤til cuando presenta una tasa de error suficientemente reducida para que el proceso de correcci贸n sea m谩s c贸modo que una completa transcripci贸n desde cero. Por lo tanto, la obtenci贸n de un borrador de transcripci贸n con una baja tasa de error es crucial para que esta tecnolog铆a de PLN sea incorporada en el proceso de transcripci贸n. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripci贸n ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los pale贸grafos para obtener la transcripci贸n de manuscritos hist贸ricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: 路 Multimodalidad: El uso de sistemas RES permite a los pale贸grafos acelerar el proceso de transcripci贸n manual, ya que son capaces de corregir en un borrador de la transcripci贸n. Otra alternativa es obtener el borrador de la transcripci贸n dictando el contenido a un sistema de Reconocimiento Autom谩tico de Habla. Cuando ambas fuentes est谩n disponibles, una combinaci贸n multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hip贸tesis final. 路 Interactividad: El uso de tecnolog铆as asistenciales en el proceso de transcripci贸n permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripci贸n correcta, gracias a la cooperaci贸n entre el sistema asistencial y el pale贸grafo para obtener la transcripci贸n perfecta. La realimentaci贸n multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de informaci贸n adicionales con se帽ales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la se帽al de habla del dictado del contenido de dicha imagen de texto), o se帽ales que representen s贸lo una palabra o car谩cter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla t谩ctil). 路 Crowdsourcing: La colaboraci贸n distribuida y abierta surge como una poderosa herramienta para la transcripci贸n masiva a un costo relativamente bajo, ya que el esfuerzo de supervisi贸n de los pale贸grafos puede ser dr谩sticamente reducido. La combinaci贸n multimodal permite utilizar el dictado del contenido de l铆neas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo m贸vil en lugar de usar ordenadores,El Processament del Llenguatge Natural (PLN) 茅s un camp de recerca interdisciplinar de les Ci猫ncies de la Computaci贸, la Ling眉铆stica i el Reconeixement de Patrons que estudia, entre d'altres, l'煤s del llenguatge natural hum脿 en la interacci贸 Home-M脿quina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del m贸n real. Aquest 茅s el cas del reconeixement i la traducci贸 del llenguatge natural, que es poden utilitzar per construir sistemes autom脿tics per a la transcripci贸 i traducci贸 de documents. Quant als documents manuscrits digitalitzats, la transcripci贸 s'utilitza per facilitar l'acc茅s digital als continguts, ja que la simple digitalitzaci贸 d'imatges nom茅s proporciona, en la majoria dels casos, la cerca per imatge i no per continguts ling眉铆stics (paraules clau, expressions, categories sint脿ctiques o sem脿ntiques). La transcripci贸 茅s encara m茅s important en el cas dels manuscrits hist貌rics, ja que la majoria d'aquests documents s贸n 煤nics i la preservaci贸 del seu contingut 茅s crucial per raons culturals i hist貌riques. La transcripci贸 de manuscrits hist貌rics sol ser realitzada per pale貌grafs, els quals s贸n persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els pale貌grafs en la seua tasca, la qual proporciona un esborrany de la transcripci贸 que els pale貌grafs poden esmenar amb m猫todes m茅s o menys sofisticats. Aquest esborrany de transcripci贸 茅s 煤til quan presenta una taxa d'error prou redu茂da perqu猫 el proc茅s de correcci贸 siga m茅s c貌mode que una completa transcripci贸 des de zero. Per tant, l'obtenci贸 d'un esborrany de transcripci贸 amb un baixa taxa d'error 茅s crucial perqu猫 aquesta tecnologia del PLN siga incorporada en el proc茅s de transcripci贸. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripci贸 ofert per un sistema RES, amb l'objectiu de reduir l'esfor莽 realitzat pels pale貌grafs per obtenir la transcripci贸 de manuscrits hist貌rics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, per貌 complementaris: 路 Multimodalitat: L'煤s de sistemes RES permet als pale貌grafs accelerar el proc茅s de transcripci贸 manual, ja que s贸n capa莽os de corregir un esborrany de la transcripci贸. Una altra alternativa 茅s obtenir l'esborrany de la transcripci贸 dictant el contingut a un sistema de Reconeixement Autom脿tic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinaci贸 multimodal 茅s possible i es pot realitzar un proc茅s iteratiu per refinar la hip貌tesi final. 路 Interactivitat: L'煤s de tecnologies assistencials en el proc茅s de transcripci贸 permet reduir el temps i l'esfor莽 hum脿 requerits per obtenir la transcripci贸 real, gr脿cies a la cooperaci贸 entre el sistema assistencial i el pale貌graf per obtenir la transcripci贸 perfecta. La realimentaci贸 multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informaci贸 addicionals amb senyals que representen la mateixa seq眉encia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen nom茅s una paraula o car脿cter a corregir (per exemple, una paraula manuscrita mitjan莽ant una pantalla t脿ctil). 路 Crowdsourcing: La col路laboraci贸 distribu茂da i oberta sorgeix com una poderosa eina per a la transcripci贸 massiva a un cost relativament baix, ja que l'esfor莽 de supervisi贸 dels pale貌grafs pot ser redu茂t dr脿sticament. La combinaci贸 multimodal permet utilitzar el dictat del contingut de l铆nies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col路laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu m貌bil en lloc d'utilitzar ordinadors d'escriptori o port脿tils, la qual cosa permet ampliar el nombrGranell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/86137TESI

    Data-oriented models of parsing and translation

    Get PDF
    The merits of combining the positive elements of the rule-based and data-driven approaches to MT are clear: a combined model has the potential to be highly accurate, robust, cost-effective to build and adaptable. While the merits are clear, however, how best to combine these techniques into a model which retains the positive characteristics of each approach, while inheriting as few of the disadvantages as possible, remains an unsolved problem. One possible solution to this challenge is the Data-Oriented Translation (DOT) model originally proposed by Poutsma (1998, 2000, 2003), which is based on Data-Oriented Parsing (DOP) (e.g. (Bod, 1992; Bod et al., 2003)) and combines examples, linguistic information and a statistical translation model. In this thesis, we seek to establish how the DOT model of translation relates to the other main MT methodologies currently in use. We find that this model differs from other hybrid models of MT in that it inextricably interweaves the philosophies of the rule-based, example-based and statistical approaches in an integrated framework. Although DOT embodies many positive characteristics on a theoretical level, it also inherits the computational complexity associated with DOP. Previous experiments assessing the performance of the DOT model of translation were small in scale and the training data used was not ideally suited to the task (Poutsma, 2000, 2003). However, the algorithmic limitations of the DOT implementation used to perform these experiments prevented a more informative assessment from being carried out. In this thesis, we look to the innovative solutions developed to meet the challenges of implementing the DOP model, and investigate their application to DOT. This investigation culminates in the development of a DOT system; this system allows us to perform translation experiments which are on a larger scale and incorporate greater translational complexity than heretofore. Our evaluation indicates that the positive characteristics of the model identified on a theoretical level are also in evidence when it is subjected to empirical assessment. For example, in terms of exact match accuracy, the DOT model outperforms an SMT model trained and tested on the same data by up to 89.73%. The DOP and DOT models for which we provide empirical evaluations assume contextfree phrase-structure tree representations. However, such models can also be developed for more sophisticated linguistic formalisms. In this thesis, we also focus on the efforts which have been made to integrate the representations of Lexical-Functional Grammar (LFG) with DOP and DOT. We investigate the usefulness of the algorithms developed for DOP (and adapted here to Tree-DOT) when implementing the (more complex) LFG-DOP and LFG-DOT models. We examine how constraints are employed in these models for more accurate disambiguation and seek an alternative methodology for improved constraint specification. We also hypothesise as to how the constraints used to predict both good parses and good translations might be pruned in a motivated fashion. Finally, we explore the relationship between translational equivalence and limited generalisation reusability for both the tree-based and LFG-based DOT models, focussing on how this relationship differs depending on which formalism is assumed
    corecore