4,271 research outputs found

    Native Language Identification on Text and Speech

    Full text link
    This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.Comment: Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA

    Linguistic features of genre and method variation in translation: A computational perspective

    Get PDF
    From The Grammar of Genres and Styles - From Discrete to Non-Discrete Units. Edited by Legallois, D., Charnois, T. and Larjavaara, M.In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation

    Machine translation evaluation resources and methods: a survey

    Get PDF
    We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. The deep learning models for evaluation are very newly proposed. Subsequently, we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT. This paper differs from the existing works\cite {GALEprogram2009, EuroMatrixProject2007} from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content

    Learning Sentence-internal Temporal Relations

    Get PDF
    In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects

    Design, Construction and Use of an Italian Dependency Treebank: Methodological Issues and Empirical Results

    Get PDF
    Treebanks allow for multiple uses: by linguists, which may search for examples (or counter-examples) for a given theory or hypothesis; by psycholinguists, interested in computing construction frequencies and comparing them with human preferences; by computational linguists, for tasks such as lexicon and grammar induction or parser evaluation. Treebanks can also be explored to determine the typology of factors playing a role in specific natural language learning and processing tasks as well as their relative salience. In all cases, the results of Treebank mining are heavily influenced by the design principles underlying the adopted annotation scheme. For Treebanks to be successfully exploited for these many-fold uses, the design of the annotation scheme should fit a list of basic requirements ranging from usability in both real applications and for research purposes, compatibility with different approaches to syntax (either adopted in theoretical or applicative frameworks) to applicability on a wide scale, in a coherent and replicable way, and to different language varieties (e.g. both written and spoken language). The presentation will focus on methodological issues connected with the design, construction and use of an Italian dependency Treebank, originating from the functional annotation layer of ISST, a multi-layered annotated corpus of Italian representing one of the main outcomes of an Italian national project (SI-TAL, 1999-2001), which underwent different revisions aimed at making it compliant to the de-facto CoNLL representation standard and at meeting the basic annotation requirements mentioned above. The Italian dependency Treebank, in its different versions, has been exploited for different purposes, ranging from the induction of subcategorization frames, the discovery and assessment of typologically relevant and linguistically motivated grammatical constraints, to the training of dependency parsers as well as their evaluation in the framework of international parsing evaluation campaigns (CoNLL-2007 and Evalita-2009). Achieved results will be discussed in the light of different annotation choices, trying to contribute to the debate on general issues such as whether and to what extent richness of the annotation scheme or the distinction between head-complement and head-modifier (or head-adjunct) relations found in many contemporary syntactic theories represent real advantages for the multiple uses Treebanks are subjected to
    corecore