43 research outputs found

    Statistical Deep parsing for spanish

    Get PDF
    This document presents the development of a statistical HPSG parser for Spanish. HPSG is a deep linguistic formalism that combines syntactic and semanticinformation in the same representation, and is capable of elegantly modelingmany linguistic phenomena. Our research consists in the following steps: design of the HPSG grammar, construction of the corpus, implementation of theparsing algorithms, and evaluation of the parsers performance. We created a simple yet powerful HPSG grammar for Spanish that modelsmorphosyntactic information of words, syntactic combinatorial valence, and semantic argument structures in its lexical entries. The grammar uses thirteenvery broad rules for attaching specifiers, complements, modifiers, clitics, relative clauses and punctuation symbols, and for modeling coordinations. In asimplification from standard HPSG, the only type of long range dependency wemodel is the relative clause that modifies a noun phrase, and we use semanticrole labeling as our semantic representation. We transformed the Spanish AnCora corpus using a semi-automatic processand analyzed it using our grammar implementation, creating a Spanish HPSGcorpus of 517,237 words in 17,328 sentences (all of AnCora). We implemented several statistical parsing algorithms and trained them overthis corpus. The implemented strategies are: a bottom-up baseline using bi-lexical comparisons or a multilayer perceptron; a CKY approach that uses theresults of a supertagger; and a top-down approach that encodes word sequencesusing a LSTM network. We evaluated the performance of the implemented parsers and compared them with each other and against other existing Spanish parsers. Our LSTM top-down approach seems to be the best performing parser over our test data, obtaining the highest scores (compared to our strategies and also to externalparsers) according to constituency metrics (87.57 unlabeled F1, 82.06 labeled F1), dependency metrics (91.32 UAS, 88.96 LAS), and SRL (87.68 unlabeled,80.66 labeled), but we must take in consideration that the comparison against the external parsers might be noisy due to the post-processing we needed to do in order to adapt them to our format. We also defined a set of metrics to evaluate the identification of some particular language phenomena, and the LSTM top-down parser out performed the baselines in almost all of these metrics as well.Este documento presenta el desarrollo de un parser HPSG estadístico para el español. HPSG es un formalismo lingüístico profundo que combina información sintáctica y semántica en sus representaciones, y es capaz de modelar elegantemente una buena cantidad de fenómenos lingüísticos. Nuestra investigación se compone de los siguiente pasos: diseño de la gramática HPSG, construcción del corpus, implementación de los algoritmos de parsing y evaluación de la performance de los parsers. Diseñamos una gramática HPSG para el español simple y a la vez poderosa, que modela en sus entradas léxicas la información morfosintáctica de las palabras, la valencia combinatoria sintáctica y la estructura argumental semántica. La gramática utiliza trece reglas genéricas para adjuntar especificadores, complementos, clíticos, cláusulas relativas y símbolos de puntuación, y también para modelar coordinaciones. Como simplificación de la teoría HPSG estándar, el único tipo de dependencia de largo alcance que modelamos son las cláusulas relativas que modifican sintagmas nominales, y utilizamos etiquetado de roles semánticos como representación semántica. Transformamos el corpus AnCora en español utilizando un proceso semiautomático y lo analizamos mediante nuestra implementación de la gramática, para crear un corpus HPSG en español de 517,237 palabras en 17,328 oraciones (todo el contenido de AnCora). Implementamos varios algoritmos de parsing estadístico entrenados sobre este corpus. En particular, teníamos como objetivo probar enfoques basados en redes neuronales. Las estrategias implementadas son: una línea base bottom-up que utiliza comparaciones bi-léxicas o un perceptrón multicapa; un enfoque tipo CKY que utiliza los resultados de un supertagger; y un enfoque top-down que codifica las secuencias de palabras mediante redes tipo LSTM. Evaluamos la performance de los parsers implementados y los comparamos entre sí y con un conjunto de parsers existententes para el español. Nuestro enfoque LSTM top-down parece ser el que tiene mejor desempeño para nuestro conjunto de test, obteniendo los mejores puntajes (comparado con nuestras estrategias y también con parsers externos) en cuanto a métricas de constituyentes (87.57 F1 no etiquetada, 82.06 F1 etiquetada), métricas de dependencias (91.32 UAS, 88.96 LAS), y SRL (87.68 no etiquetada, 80.66 etiquetada), pero debemos tener en cuenta que la comparación con parsers externos puede ser ruidosa debido al post procesamiento realizado para adaptarlos a nuestro formato. También definimos un conjunto de métricas para evaluar la identificación de algunos fenómenos particulares del lenguaje, y el parser LSTM top-down obtuvo mejores resultados que las baselines para casi todas estas métricas

    Porting a lexicalized-grammar parser to the biomedical domain

    Get PDF
    AbstractThis paper introduces a state-of-the-art, linguistically motivated statistical parser to the biomedical text mining community, and proposes a method of adapting it to the biomedical domain requiring only limited resources for data annotation. The parser was originally developed using the Penn Treebank and is therefore tuned to newspaper text. Our approach takes advantage of a lexicalized grammar formalism, Combinatory Categorial Grammar (ccg), to train the parser at a lower level of representation than full syntactic derivations. The ccg parser uses three levels of representation: a first level consisting of part-of-speech (pos) tags; a second level consisting of more fine-grained ccg lexical categories; and a third, hierarchical level consisting of ccg derivations. We find that simply retraining the pos tagger on biomedical data leads to a large improvement in parsing performance, and that using annotated data at the intermediate lexical category level of representation improves parsing accuracy further. We describe the procedure involved in evaluating the parser, and obtain accuracies for biomedical data in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical resource on which we evaluate. Our conclusion is that porting newspaper parsers to the biomedical domain, at least for parsers which use lexicalized grammars, may not be as difficult as first thought

    Construcción de recursos lingüísticos para una gramática HPSG para el español

    Get PDF
    En este trabajo se presenta la construcción de recursos lingüísticos para trabajar con una gramática HPSG para el español. HPSG es un formalismo gramatical rico debido a que el resultado del análisis sintáctico con este formalismo es una representación de la oración que incluye información tanto sintáctica como semántica. Para el idioma inglés existen parsers estadísticos HPSG de alta performance y cobertura del idioma, pero para el español las herramientas existentes aún no llegan al mismo nivel. Se describe una gramática HPSG para el español, indicando sus estructuras de rasgos principales y sus reglas de combinación de expresiones. Se construyó un corpus de árboles HPSG para el español utilizando la gramática definida. Para esto, se partió del corpus AnCora y se transformaron las oraciones mediante un proceso automático, obteniendo como resultado un nuevo corpus etiquetado según el formalismo HPSG. Las heurísticas de transformación tienen un 95,3% de precisión en detección de núcleos y un 92,5% de precisión en clasificación de argumentos. A partir del corpus se definieron las entradas léxicas y se agruparon las entradas de las categorías léxicas de mayor complejidad combinatoria (verbos, nombres y adjetivos) según su comportamiento sintáctico-semántico. Estas agrupaciones de entradas léxicas se denominan frames léxicos. A partir de esto se construyó un supertagger para identificar los frames léxicos más probables dadas las palabras de una oración. El supertagger tiene un accuracy de 83,58% para verbos, 85,78% para nombres y 81,40% para adjetivos (considerando las tres etiquetas más probables)

    Syntactic phrase-based statistical machine translation

    Get PDF
    Phrase-based statistical machine translation (PBSMT) systems represent the dominant approach in MT today. However, unlike systems in other paradigms, it has proven difficult to date to incorporate syntactic knowledge in order to improve translation quality. This paper improves on recent research which uses 'syntactified' target language phrases, by incorporating supertags as constraints to better resolve parse tree fragments. In addition, we do not impose any sentence-length limit, and using a log-linear decoder, we outperform a state-of-the-art PBSMT system by over 1.3 BLEU points (or 3.51% relative) on the NIST 2003 Arabic-English test corpus

    Parsing with sparse annotated resources

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 67-73).This thesis focuses on algorithms for parsing within the context of sparse annotated resources. Despite recent progress in parsing techniques, existing methods require significant resources for training. Therefore, current technology is limited when it comes to parsing sentences in new languages or new grammars. We propose methods for parsing when annotated resources are limited. In the first scenario, we explore an automatic method for mapping language-specific part of- speech (POS) tags into a universal tagset. Universal tagsets play a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Our central assumption is that a high-quality mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function. Given the exponential size of the mapping space, we propose a novel method for optimizing the objective over mappings. Our results demonstrate that automatically induced mappings rival their manually designed counterparts when evaluated in the context of multilingual parsing. In the second scenario, we consider the problem of cross-formalism transfer in parsing. We are interested in parsing constituency-based grammars such as HPSG and CCG using a small amount of data annotated in the target formalisms and a large quantity of coarse CFG annotations from the Penn Treebank. While the trees annotated in all of the target formalisms share a similar basic syntactic structure with the Penn Treebank CFG, they also encode additional constraints and semantic features. To handle this apparent difference, we design a probabilistic model that jointly generates CFG and target formalism parses. The model includes features of both parses, enabling transfer between the formalisms, and preserves parsing efficiency. Experimental results show that across a range of formalisms, our model benefits from the coarse annotations.by Yuan Zhang.S.M

    Parsing Combinatory Categorial Grammar with Answer Set Programming: Preliminary Report

    Get PDF
    Combinatory categorial grammar (CCG) is a grammar formalism used for natural language parsing. CCG assigns structured lexical categories to words and uses a small set of combinatory rules to combine these categories to parse a sentence. In this work we propose and implement a new approach to CCG parsing that relies on a prominent knowledge representation formalism, answer set programming (ASP) - a declarative programming paradigm. We formulate the task of CCG parsing as a planning problem and use an ASP computational tool to compute solutions that correspond to valid parses. Compared to other approaches, there is no need to implement a specific parsing algorithm using such a declarative method. Our approach aims at producing all semantically distinct parse trees for a given sentence. From this goal, normalization and efficiency issues arise, and we deal with them by combining and extending existing strategies. We have implemented a CCG parsing tool kit - AspCcgTk - that uses ASP as its main computational means. The C&C supertagger can be used as a preprocessor within AspCcgTk, which allows us to achieve wide-coverage natural language parsing.Comment: 12 pages, 2 figures, Proceedings of the 25th Workshop on Logic Programming (WLP 2011
    corecore