175 research outputs found

    Because Syntax does Matter: Improving Predicate-Argument Structures Parsing Using Syntactic Features

    Get PDF
    International audienceParsing full-fledged predicate-argument structures in a deep syntax framework requires graphs to be predicted. Using the DeepBank (Flickinger et al., 2012) and the Predicate-Argument Structure treebank (Miyao and Tsujii, 2005) as a test field, we show how transition-based parsers, extended to handle connected graphs, benefit from the use of topologically different syntactic features such as dependencies, tree fragments, spines or syntactic paths, bringing a much needed context to the parsing models, improving notably over long distance dependencies and elided coordinate structures. By confirming this positive impact on an accurate 2nd-order graph-based parser (Martins and Almeida, 2014), we establish a new state-of-the-art on these data sets

    Evaluating Parsers with Dependency Constraints

    Get PDF
    Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications

    Undirected dependency parsing

    Get PDF
    Dependency parsers, which are widely used in natural language processing tasks, employ a representation of syntax in which the structure of sentences is expressed in the form of directed links (dependencies) between their words. In this article, we introduce a new approach to transition-based dependency parsing in which the parsing algorithm does not directly construct dependencies, but rather undirected links, which are then assigned a direction in a postprocessing step. We show that this alleviates error propagation, because undirected parsers do not need to observe the single-head constraint, resulting in better accuracy. Undirected parsers can be obtained by transforming existing directed transition-based parsers as long as they satisfy certain conditions. We apply this approach to obtain undirected variants of three different parsers (the Planar, 2-Planar, and Covington algorithms) and perform experiments on several data sets from the CoNLL-X shared tasks and on the Wall Street Journal portion of the Penn Treebank, showing that our approach is successful in reducing error propagation and produces improvements in parsing accuracy in most of the cases and achieving results competitive with state-of-the-art transition-based parsers.Xunta de Galicia | Ref. CN2012/008Xunta de Galicia | Ref. CN2012/317Xunta de Galicia | Ref. CN2012/319Ministerio de Ciencia e Innovación | Ref. TIN2010-18552-C03-01Ministerio de Ciencia e Innovación | Ref. TIN2010-18552-C03-0

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Performance-oriented dependency parsing

    Get PDF
    In the last decade a lot of dependency parsers have been developed. This book describes the motivation for the development of yet another parser - MDParser. The state of the art is presented and the deficits of the current developments are discussed. The main problem of the current parsers is that the task of dependency parsing is treated independently of what happens before and after it. However, in practice parsing is rarely done for the sake of parsing itself, but rather in order to use the results in a follow-up application. Additionally, current parsers are accuracy-oriented and focus only on the quality of the results, neglecting other important properties, especially efficiency. The evaluation of some NLP technologies is sometimes as difficult as the task itself. For dependency parsing it was long thought not to be the case, however, some recent works show that the current evaluation possibilities are limited. This book proposes a methodology to account for the weaknesses and combine the strengths of the current approaches. Finally, MDParser is evaluated against other state-of-the-art parsers. The results show that it is the fastest parser currently available and it is able to process plain text, which other parsers usually cannot. The results are slightly behind the top accuracies in the field, however, it is demonstrated that it is not decisive for applications

    Performance-oriented dependency parsing

    Get PDF
    In the last decade a lot of dependency parsers have been developed. This book describes the motivation for the development of yet another parser - MDParser. The state of the art is presented and the deficits of the current developments are discussed. The main problem of the current parsers is that the task of dependency parsing is treated independently of what happens before and after it. However, in practice parsing is rarely done for the sake of parsing itself, but rather in order to use the results in a follow-up application. Additionally, current parsers are accuracy-oriented and focus only on the quality of the results, neglecting other important properties, especially efficiency. The evaluation of some NLP technologies is sometimes as difficult as the task itself. For dependency parsing it was long thought not to be the case, however, some recent works show that the current evaluation possibilities are limited. This book proposes a methodology to account for the weaknesses and combine the strengths of the current approaches. Finally, MDParser is evaluated against other state-of-the-art parsers. The results show that it is the fastest parser currently available and it is able to process plain text, which other parsers usually cannot. The results are slightly behind the top accuracies in the field, however, it is demonstrated that it is not decisive for applications

    Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anàlisi Sintàctica Automàtica i Avaluació. Millora de qualitat per a Gramàtiques de Dependències

    Get PDF
    Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition. The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà, 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013). Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively. Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it Parsing and Evaluation Improving Dependency Grammars Accuracy only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars. On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars. The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintàctics automàtics actualment. Tot i els progressos que s’han fet en l’àrea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintàc- tics automàtics no han pogut traspassar el llindar de certes ambiguïtats estructurals com ara l’agrupació del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi té com a objectiu aportar millores signiflcatives de quali- tat a l’anàlisi sintàctica automàtica per mitjà de la integració de coneixement lingüístic i estadístic per desambiguar construccions sintàctiques ambigües. El punt de partida de la recerca ha estat el desenvolupament de d’una gramàtica en espanyol i una altra en català basades en regles que segueixen els postulats de la Gramàtica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisició de coneixement automàtic. Per tal de crear dues gramàtiques robustes que analitzin i entenguin l’oració en profunditat, ens hem basat en l’arquitectura de FreeLing (Padró et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anàlisi lingüística automàtica de l’oració. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingüístics per determinar la formació dels sintagmes i les clàusules a la gramàtica per mitjà de la revisió de les propostes teòriques de la Gramàtica Generativa (Chomsky, 1981; Bonet and Solà, 1986; Haege- man, 1991) i de la Gramàtica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relació de dependència que fan servir les regles de les gramàtques. A més a més de l’elaboració d’aquest llistat, s’han establert les correspondències amb l’estàndard d’anotació de les Dependències Universals (Mcdonald et al., 2013). Alhora, s’ha dissenyat un sistema d’avaluació empíric que té en compte l’anàlisi quantitativa i qualitativa per tal de fer una valoració completa dels resultats dels experiments. Precisament, es tracta una tasca empírica pel fet que es comparen les anàlisis generades per les gramàtiques amb dades reals de la llengua. Per tal de dur a terme l’avaluació des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible només en espanyol que és prou extens per construir una anàlisi real de les gramàtiques i que ha estat anotat amb el mateix formalisme que les gramàtiques. En concret, per tal com els criteris de les gramàtiques i del corpus no són coincidents, s’ha dut a terme un procés d’harmonització de cri- teris per mitjà d’unes regles creades manualment que adapten automàticament l’estructura i la relació de dependència del corpus al criteri de les gramàtiques. Pel que fa a l’avaluació qualitativa, pel fet que no hi ha recursos disponibles en espanyol i català, hem dissenyat un reprertori de test de fenòmens sintàctics estructurals i relacionats amb l’ordre de l’oració. Amb l’objectiu de crear un repertori representatiu de les llengües estudiades, s’han fet servir gramàtiques descriptives per fornir el repertori d’estructures sintàctiques (Bosque and Demonte, 1999; Solà et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automàticament l’ordre oracional. Gràcies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anàlisi sintàctica automàtica en millora la qualitat. D’una banda, Parsing and Evaluation Improving Dependency Grammars Accuracy s’ha explorat l’aprenentatge de models de llenguatge per mitjà de models estadístics per tal de proposar solucions a l’agrupació del sintagma preposicional. Més concretament, s’ha desen- volupat un model de llenguatge per mitjà d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A més a més, s’ha après un model de llenguatge per mitjà d’un mètode no supervisat basat en l’aproximació distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat té greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que és superada pel mètode no supervisat pel fet que és capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat és limitat si aprèn a partir de dades lèxiques. Per aquesta raó, és necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ció, trets sintàctics i semàntics. A més a més, cal ampliar el número de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramàtiques. D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramàtiques per mitjà de l’adquisició de coneixement lingüístic. En aquest experiment, s’ha op- tat per extreure automàticament el coneixement en forma de classes de subcategorització verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que conté anotats sintàctica- ment el predicat verbal i els seus arguments. A partir de la informació extreta, s’ha classiflcat les diverses diàtesis verbals en classes de subcategorització verbal en funció dels patrons observats en el corpus. Els resultats de la integració de les classes de subcategorització a les gramàtiques mostren que aquesta informació determina positivament el reconeixement dels arguments. Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramàtiques no són prou expressives per elles mateixes per resoldre ambigüitats complexes del llenguatge. No obstant això, la integració de coneixement sobre aquestes am- bigüitats pot ser decisiu a l’hora de proposar una solució. D’una banda, el coneixement estadístic sobre l’agrupació del sintagma preposicional pot millorar la qualitat de les gramàtiques, però per aflrmar-ho cal incloure informació sintàctica i semàntica en els models d’aprenentatge automàtic i capturar més patrons per contribuir en la desambiguació de fenòmens complexos. D’una al- tra banda, el coneixement lingüístic sobre subcategorització verbal adquirit de recursos lingüís- tics anotats influeix decisivament en la qualitat de les gramàtiques per a l’anàlisi sintàctica au- tomàtica

    A* CCG Parsing with a Supertag-factored Model

    Get PDF
    We introduce a new CCG parsing model which is factored on lexical category as-signments. Parsing is then simply a de-terministic search for the most probable category sequence that supports a CCG derivation. The parser is extremely simple, with a tiny feature set, no POS tagger, and no statistical model of the derivation or dependencies. Formulating the model in this way allows a highly effective heuris-tic for A ∗ parsing, which makes parsing extremely fast. Compared to the standard C&C CCG parser, our model is more ac-curate out-of-domain, is four times faster, has higher coverage, and is greatly simpli-fied. We also show that using our parser improves the performance of a state-of-the-art question answering system.
    corecore