313 research outputs found

    Detecting syntactic errors in dependency treebanks for morphosyntactically rich languages

    Get PDF
    Abstract. The paper introduces a new method for detecting and correcting errors in large dependency treebanks with rich morphosyntactic annotation. The technique uses error correction rules automatically extracted from the treebank. The procedure of rule extraction is based on a comparison of similar -but not identical -subgraphs of dependency structures. The outcome of applying the method to a 3-million-sentence dependency treebank of Polish is presented and evaluated. The method achieves satisfactory precision in the task of automatic error correction and relatively high precision in the task of error detection

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Normalization and parsing algorithms for uncertain input

    Get PDF

    VALICO-UD: annotating an Italian learner corpus

    Get PDF
    Previous work on learner language has highlighted the importance of having annotated resources to describe the development of interlanguage. Despite this, few learner resources, mainly for English L2, feature error and syntactic annotation. This thesis describes the development of a novel parallel learner Italian treebank, VALICO-UD. Its name suggests two main points: where the data comes from—i.e. the corpus VALICO, a collection of non-native Italian texts elicited by comic strips—and what formalism is used for linguistic annotation—i.e. Universal Dependencies (UD) formalism. It is a parallel treebank because the resource provides for each learner sentence (LS) a target hypothesis (TH) (i.e., parallel corrected version written by an Italian native speaker) which is in turn annotated in UD. We developed this treebank to be exploitable for interlanguage research and comparable with the resources employed in Natural Language Processing tasks such as Native Language Identification or Grammatical Error Identification and Correction. VALICO-UD is composed of 237 texts written by English, French, German and Spanish native speakers, which correspond to 2,234 LSs, each associated with a single TH. While all LSs and THs were automatically annotated using UDPipe, only a portion of the treebank made of 398 LSs plus correspondent THs has been manually corrected and released in May 2021 in the UD repository. This core section features also an explicit XML-based annotation of the errors occurring in each sentence. Thus, the treebank is currently organized in two sections: the core gold standard—comprising 398 LSs and their correspondent THs—and the silver standard—consisting of 1,836 LSs and their correspondent THs. In order to contribute to the computational investigation about the peculiar type of texts included in VALICO-UD, this thesis describes the annotation schema of the resource, provides some preliminary tests about the performance of UDPipe models on this treebank, reports on inter-annotator agreement results for both error and linguistic annotation, and suggests some possible applications

    VALICO-UD: Treebanking an Italian Learner Corpus in Universal Dependencies

    Get PDF
    This article describes an ongoing project for the development of a novel Italian treebank in Universal Dependencies format: VALICO-UD. It consists of texts written by Italian L2 learners of different mother tongues (German, French, Spanish and English) drawn from VALICO, an Italian learner corpus elicited by comic strips. Aiming at building a parallel treebank currently missing for Italian L2, comparable with those exploited in Natural Language Processing tasks, we associated each learner sentence with a target hypothesis (i.e. a corrected version of the learner sentence written by an Italian native speaker), which is in turn annotated in Universal Dependencies. The treebank VALICO-UD is composed of 237 texts written by non-native speakers of Italian (2,234 sentences) and the related target hypotheses, all automatically annotated using UDPipe. A portion of this resource (36 texts corresponding to 398 learner sentences and related target hypotheses)—firstly released on May 2021 in the Universal Dependencies repository—is associated with error annotation and the automatic output is fully manually checked. In this article, we focus especially on the challenges addressed in treebanking a resource composed of learner texts. In addition, we report on a preliminary data exploration that makes use of three quantitative measures for assessing the quality of the data and for better understanding the role that this resource can play in tasks lying at the intersection of Computational Linguistics and learner corpus studies

    Semi-Automatic Deep Syntactic Annotations of the French Treebank

    Get PDF
    International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French

    Improving dependency label accuracy using statistical post-editing: A cross-framework study

    Get PDF
    We present a statistical post-editing method for modifying the dependency labels in a dependency analysis. We test the method using two English datasets, three parsing systems and three labelled dependency schemes. We demonstrate how it can be used both to improve dependency label accuracy in parser output and highlight problems with and differences between constituency-to-dependency conversions

    Increased recall in annotation variance detection in treebanks

    Get PDF
    Automatic inconsistency detection in parsed corpora is significantly helpful for building more and larger corpora of annotated texts. Inconsistencies are inevitable and originate from variance in annotation caused by different factors as, for instance, the lack of attention or the absence of clear annotation guidelines. In this paper, some results involving the automatic detection of annotation variance in parsed corpora are presented. In particular, it is shown that a generalization procedure substantially increases the recall of the variant detection algorithm proposed in [1]930257858618th International Conference on Text, Speech and Dialogue (TSD)2015-09RepĂșblica ChecaInt Speech Commun Assoc; Czech Soc Cybernet & Informat; Kerio Technol; Univ West Bohemia, Fac Appl Sci; Masaryk Univ, Fac InformatPilse
    • 

    corecore