Search CORE

5 research outputs found

Evaluating Parsers with Dependency Constraints

Author: Ng Dominick
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2016
Field of study

Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications

Sydney eScholarship

Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

Author: Hana Jiří
Jelínek Tomáš
Rosen Alexandr
Vidová Hladká Barbora
Škodová Svatava
Štindlová Barbora
Publication venue: 'Charles University in Prague, Karolinum Press'
Publication date: 01/01/2020
Field of study

Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

CU Digital Repository

Directory of Open Access Books (DOAB)

Workshop Proceedings of the 12th edition of the KONVENS conference

Author: Faaß Gertrud
Ruppenhofer Josef
Publication venue: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)
Publication date: 11/07/2023
Field of study

The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

Publikationsserver des Instituts für Deutsche Sprache

Recommended from our members

Automatic annotation of error types for grammatical error correction

Author: Bryant Christopher Jack
Publication venue: University of Cambridge
Publication date: 18/06/2019
Field of study

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting grammatical errors in text. Although previous work has focused on developing systems that target specific error types, the current state of the art uses machine translation to correct all error types simultaneously. A significant disadvantage of this approach is that machine translation does not produce annotated output and so error type information is lost. This means we can only evaluate a system in terms of overall performance and cannot carry out a more detailed analysis of different aspects of system performance. In this thesis, I develop a system to automatically annotate parallel original and corrected sentence pairs with explicit edits and error types. In particular, I first extend the Damerau- Levenshtein alignment algorithm to make use of linguistic information when aligning parallel sentences, and supplement this alignment with a set of merging rules to handle multi-token edits. The output from this algorithm surpasses other edit extraction approaches in terms of approximating human edit annotations and is the current state of the art. Having extracted the edits, I next classify them according to a new rule-based error type framework that depends only on automatically obtained linguistic properties of the data, such as part-of-speech tags. This framework was inspired by existing frameworks, and human judges rated the appropriateness of the predicted error types as ‘Good’ (85%) or ‘Acceptable’ (10%) in a random sample of 200 edits. The whole system is called the ERRor ANnotation Toolkit (ERRANT) and is the first toolkit capable of automatically annotating parallel sentences with error types. I demonstrate the value of ERRANT by applying it to the system output produced by the participants of the CoNLL-2014 shared task, and carry out a detailed error type analysis of system performance for the first time. I also develop a simple language model based approach to GEC, that does not require annotated training data, and show how it can be improved using ERRANT error types

Apollo (Cambridge)