24 research outputs found

    Robust handling of out-of-vocabulary words in deep language processing

    Get PDF
    Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2014Deep grammars handle with precision complex grammatical phenomena and are able to provide a semantic representation of their input sentences in some logic form amenable to computational processing, making such grammars desirable for advanced Natural Language Processing tasks. The robustness of these grammars still has room to be improved. If any of the words in a sentence is not present in the lexicon of the grammar, i.e. if it is an out-of-vocabulary (OOV) word, a full parse of that sentence may not be produced. Given that the occurrence of such words is inevitable, e.g. due to the property of lexical novelty that is intrinsic to natural languages, deep grammars need some mechanism to handle OOV words if they are to be used in applications to analyze unrestricted text. The aim of this work is thus to investigate ways of improving the handling of OOV words in deep grammars. The lexicon of a deep grammar is highly thorough, with words being assigned extremely detailed linguistic information. Accurately assigning similarly detailed information to OOV words calls for the development of novel approaches, since current techniques mostly rely on shallow features and on a limited window of context, while there are many cases where the relevant information is to be found in wider linguistic structure and in long-distance relations. The solution proposed here consists of a classifier, SVM-TK, that is placed between the input to the grammar and the grammar itself. This classifier can take a variety of features and assign to words deep lexical types which can then be used by the grammar when faced with OOV words. The classifier is based on support-vector machines which, through the use of kernels, allows the seamless use of features encoding linguistic structure in the classifier. This dissertation focuses on the HPSG framework, but the method can be used in any framework where the lexical information can be encoded as a word tag. As a case study, we take LX-Gram, a computational grammar for Portuguese, to improve its robustness with respect to OOV verbs. Given that the subcategorization frame of a word is a substantial part of what is encoded in an HPSG deep lexical type, the classifier takes graph encoding grammatical dependencies as features. At runtime, these dependencies are produced by a probabilistic dependency parser. The SVM-TK classifier is compared against the state-of-the-art approaches for OOV handling, which consist of using a standard POS-tagger to assign lexical types, in essence doing POS-tagging with a highly granular tagset. Results show that SVM-TK is able to improve on the state-of-the-art, with the usual data-sparseness bottleneck issues imposing this to happen when the amount of training data is large enough.As gramáticas de processamento profundo lidam de forma precisa com fenómenos linguisticos complexos e são capazes de providenciar uma representação semântica das frases que lhes são dadas, o que torna tais gramáticas desejáveis para tarefas avançadas em Processamento de Linguagem Natural. A robustez destas gramáticas tem ainda espaço para ser melhorada. Se alguma das palavras numa frase não se encontra presente no léxico da gramática (em inglês, uma palavra out-of-vocabulary, ou OOV), pode não ser possível produzir uma análise completa dessa frase. Dado que a ocorrência de tais palavras é algo inevitável, e.g. devido à novidade lexical que é intrínseca às línguas naturais, as gramáticas profundas requerem algum mecanismo que lhes permita lidar com palavras OOV de forma a que possam ser usadas para análise de texto em aplicações. O objectivo deste trabalho é então investigar formas de melhor lidar com palavras OOV numa gramática de processamento profundo. O léxico de uma gramática profunda é altamente granular, sendo cada palavra associada com informação linguística extremamente detalhada. Atribuir corretamente a palavras OOV informação linguística com o nível de detalhe adequado requer que se desenvolvam técnicas inovadoras, dado que as abordagens atuais baseiam-se, na sua maioria, em características superficiais (shallow features) e em janelas de contexto limitadas, apesar de haver muitos casos onde a informação relevante se encontra na estrutura linguística e em relações de longa distância. A solução proposta neste trabalho consiste num classificador, SVM-TK, que é colocado entre o input da gramática e a gramática propriamente dita. Este classificador aceita uma variedade de features e atribui às palavras tipos lexicais profundos que podem então ser usado pela gramática sempre que esta se depare com palavras OOV. O classificador baseia-se em máquinas de vetores de suporte (support-vector machines). Esta técnica, quando combinada com o uso de kernels, permite que o classificador use, de forma transparente, features que codificam estrutura linguística. Esta dissertação foca-se no enquadramento teórico HPSG, embora o método proposto possa ser usado em qualquer enquadramento onde a informação lexical possa ser codificada sob a forma de uma etiqueta atribuída a uma palavra. Como caso de estudo, usamos a LX-Gram, uma gramatica computacional para a língua portuguesa, e melhoramos a sua robustez a verbos OOV. Dado que a grelha de subcategorização de uma palavra é uma parte substancial daquilo que se encontra codificado num tipo lexical profundo em HPSG, o classificador usa features baseados em dependências gramaticais. No momento de execução, estas dependências são produzidas por um analisador de dependências probabilístico. O classificador SVM-TK é comparado com o estado-da-arte para a tarefa de resolução de palavras OOV, que consiste em usar um anotador morfossintático (POS-tagger) para atribuir tipos lexicais, fazendo, no fundo, anotação com um conjunto de etiquetas altamente detalhado. Os resultados mostram que o SVM-TK melhora o estado-da-arte, com os já habituais problemas de esparssez de dados fazendo com que este efeito seja notado quando a quantidade de dados de treino é suficientemente grande.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/41465/2007

    From UBGs to CFGs A practical corpus-driven approach

    Get PDF
    We present a simple and intuitive unsound corpus-driven approximation method for turning unification-based grammars (UBGs), such as HPSG, CLE, or PATR-II into context-free grammars (CFGs). The method is unsound in that it does not generate a CFG whose language is a true superset of the language accepted by the original unification-based grammar. It is a corpus-driven method in that it relies on a corpus of parsed sentences and generates broader CFGs when given more input samples. Our open approach can be fine-tuned in different directions, allowing us to monotonically come close to the original parse trees by shifting more information into the context-free symbols. The approach has been fully implemented in JAVA. This report updates and extends the paper presented at the International Colloquium on Grammatical Inference (ICGI 2004) and presents further measurements

    Learning Efficient Disambiguation

    Get PDF
    This dissertation analyses the computational properties of current performance-models of natural language parsing, in particular Data Oriented Parsing (DOP), points out some of their major shortcomings and suggests suitable solutions. It provides proofs that various problems of probabilistic disambiguation are NP-Complete under instances of these performance-models, and it argues that none of these models accounts for attractive efficiency properties of human language processing in limited domains, e.g. that frequent inputs are usually processed faster than infrequent ones. The central hypothesis of this dissertation is that these shortcomings can be eliminated by specializing the performance-models to the limited domains. The dissertation addresses "grammar and model specialization" and presents a new framework, the Ambiguity-Reduction Specialization (ARS) framework, that formulates the necessary and sufficient conditions for successful specialization. The framework is instantiated into specialization algorithms and applied to specializing DOP. Novelties of these learning algorithms are 1) they limit the hypotheses-space to include only "safe" models, 2) are expressed as constrained optimization formulae that minimize the entropy of the training tree-bank given the specialized grammar, under the constraint that the size of the specialized model does not exceed a predefined maximum, and 3) they enable integrating the specialized model with the original one in a complementary manner. The dissertation provides experiments with initial implementations and compares the resulting Specialized DOP (SDOP) models to the original DOP models with encouraging results.Comment: 222 page

    Complexity of Lexical Descriptions and its Relevance to Partial Parsing

    Get PDF
    In this dissertation, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (supertags) that impose complex constraints in a local context. However, increasing the complexity of descriptions makes the number of different descriptions for each lexical item much larger and hence increases the local ambiguity for a parser. This local ambiguity can be resolved by using supertag co-occurrence statistics collected from parsed corpora. We have explored these ideas in the context of Lexicalized Tree-Adjoining Grammar (LTAG) framework wherein supertag disambiguation provides a representation that is an almost parse. We have used the disambiguated supertag sequence in conjunction with a lightweight dependency analyzer to compute noun groups, verb groups, dependency linkages and even partial parses. We have shown that a trigram-based supertagger achieves an accuracy of 92.1‰ on Wall Street Journal (WSJ) texts. Furthermore, we have shown that the lightweight dependency analysis on the output of the supertagger identifies 83‰ of the dependency links accurately. We have exploited the representation of supertags with Explanation-Based Learning to improve parsing effciency. In this approach, parsing in limited domains can be modeled as a Finite-State Transduction. We have implemented such a system for the ATIS domain which improves parsing eciency by a factor of 15. We have used the supertagger in a variety of applications to provide lexical descriptions at an appropriate granularity. In an information retrieval application, we show that the supertag based system performs at higher levels of precision compared to a system based on part-of-speech tags. In an information extraction task, supertags are used in specifying extraction patterns. For language modeling applications, we view supertags as syntactically motivated class labels in a class-based language model. The distinction between recursive and non-recursive supertags is exploited in a sentence simplification application

    A syntactic component for Vietnamese language processing

    Full text link

    Large-scale induction and evaluation of lexical resources from the Penn-II treebank

    Get PDF
    In this paper we present a methodology for extracting subcategorisation frames based on an automatic LFG f-structure annotation algorithm for the Penn-II Treebank. We extract abstract syntactic function-based subcategorisation frames (LFG semantic forms), traditional CFG categorybased subcategorisation frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for particle verbs. Our approach does not predefine frames, associates probabilities with frames conditional on the lemma, distinguishes between active and passive frames, and fully reflects the effects of long-distance dependencies in the source data structures. We extract 3586 verb lemmas, 14348 semantic form types (an average of 4 per lemma) with 577 frame types. We present a large-scale evaluation of the complete set of forms extracted against the full COMLEX resource

    Approximate text generation from non-hierarchical representations in a declarative framework

    Get PDF
    This thesis is on Natural Language Generation. It describes a linguistic realisation system that translates the semantic information encoded in a conceptual graph into an English language sentence. The use of a non-hierarchically structured semantic representation (conceptual graphs) and an approximate matching between semantic structures allows us to investigate a more general version of the sentence generation problem where one is not pre-committed to a choice of the syntactically prominent elements in the initial semantics. We show clearly how the semantic structure is declaratively related to linguistically motivated syntactic representation — we use D-Tree Grammars which stem from work on Tree-Adjoining Grammars. The declarative specification of the mapping between semantics and syntax allows for different processing strategies to be exploited. A number of generation strategies have been considered: a pure topdown strategy and a chart-based generation technique which allows partially successful computations to be reused in other branches of the search space. Having a generator with increased paraphrasing power as a consequence of using non-hierarchical input and approximate matching raises the issue whether certain 'better' paraphrases can be generated before others. We investigate preference-based processing in the context of generation

    Statistical Deep parsing for spanish

    Get PDF
    This document presents the development of a statistical HPSG parser for Spanish. HPSG is a deep linguistic formalism that combines syntactic and semanticinformation in the same representation, and is capable of elegantly modelingmany linguistic phenomena. Our research consists in the following steps: design of the HPSG grammar, construction of the corpus, implementation of theparsing algorithms, and evaluation of the parsers performance. We created a simple yet powerful HPSG grammar for Spanish that modelsmorphosyntactic information of words, syntactic combinatorial valence, and semantic argument structures in its lexical entries. The grammar uses thirteenvery broad rules for attaching specifiers, complements, modifiers, clitics, relative clauses and punctuation symbols, and for modeling coordinations. In asimplification from standard HPSG, the only type of long range dependency wemodel is the relative clause that modifies a noun phrase, and we use semanticrole labeling as our semantic representation. We transformed the Spanish AnCora corpus using a semi-automatic processand analyzed it using our grammar implementation, creating a Spanish HPSGcorpus of 517,237 words in 17,328 sentences (all of AnCora). We implemented several statistical parsing algorithms and trained them overthis corpus. The implemented strategies are: a bottom-up baseline using bi-lexical comparisons or a multilayer perceptron; a CKY approach that uses theresults of a supertagger; and a top-down approach that encodes word sequencesusing a LSTM network. We evaluated the performance of the implemented parsers and compared them with each other and against other existing Spanish parsers. Our LSTM top-down approach seems to be the best performing parser over our test data, obtaining the highest scores (compared to our strategies and also to externalparsers) according to constituency metrics (87.57 unlabeled F1, 82.06 labeled F1), dependency metrics (91.32 UAS, 88.96 LAS), and SRL (87.68 unlabeled,80.66 labeled), but we must take in consideration that the comparison against the external parsers might be noisy due to the post-processing we needed to do in order to adapt them to our format. We also defined a set of metrics to evaluate the identification of some particular language phenomena, and the LSTM top-down parser out performed the baselines in almost all of these metrics as well.Este documento presenta el desarrollo de un parser HPSG estadístico para el español. HPSG es un formalismo lingüístico profundo que combina información sintáctica y semántica en sus representaciones, y es capaz de modelar elegantemente una buena cantidad de fenómenos lingüísticos. Nuestra investigación se compone de los siguiente pasos: diseño de la gramática HPSG, construcción del corpus, implementación de los algoritmos de parsing y evaluación de la performance de los parsers. Diseñamos una gramática HPSG para el español simple y a la vez poderosa, que modela en sus entradas léxicas la información morfosintáctica de las palabras, la valencia combinatoria sintáctica y la estructura argumental semántica. La gramática utiliza trece reglas genéricas para adjuntar especificadores, complementos, clíticos, cláusulas relativas y símbolos de puntuación, y también para modelar coordinaciones. Como simplificación de la teoría HPSG estándar, el único tipo de dependencia de largo alcance que modelamos son las cláusulas relativas que modifican sintagmas nominales, y utilizamos etiquetado de roles semánticos como representación semántica. Transformamos el corpus AnCora en español utilizando un proceso semiautomático y lo analizamos mediante nuestra implementación de la gramática, para crear un corpus HPSG en español de 517,237 palabras en 17,328 oraciones (todo el contenido de AnCora). Implementamos varios algoritmos de parsing estadístico entrenados sobre este corpus. En particular, teníamos como objetivo probar enfoques basados en redes neuronales. Las estrategias implementadas son: una línea base bottom-up que utiliza comparaciones bi-léxicas o un perceptrón multicapa; un enfoque tipo CKY que utiliza los resultados de un supertagger; y un enfoque top-down que codifica las secuencias de palabras mediante redes tipo LSTM. Evaluamos la performance de los parsers implementados y los comparamos entre sí y con un conjunto de parsers existententes para el español. Nuestro enfoque LSTM top-down parece ser el que tiene mejor desempeño para nuestro conjunto de test, obteniendo los mejores puntajes (comparado con nuestras estrategias y también con parsers externos) en cuanto a métricas de constituyentes (87.57 F1 no etiquetada, 82.06 F1 etiquetada), métricas de dependencias (91.32 UAS, 88.96 LAS), y SRL (87.68 no etiquetada, 80.66 etiquetada), pero debemos tener en cuenta que la comparación con parsers externos puede ser ruidosa debido al post procesamiento realizado para adaptarlos a nuestro formato. También definimos un conjunto de métricas para evaluar la identificación de algunos fenómenos particulares del lenguaje, y el parser LSTM top-down obtuvo mejores resultados que las baselines para casi todas estas métricas

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc
    corecore