221 research outputs found

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora

    Get PDF
    The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries

    Gap between theory and practice: noise sensitive word alignment in machine translation

    Get PDF
    Word alignment is to estimate a lexical translation probability p(e|f), or to estimate the correspondence g(e, f) where a function g outputs either 0 or 1, between a source word f and a target word e for given bilingual sentences. In practice, this formulation does not consider the existence of ‘noise’ (or outlier) which may cause problems depending on the corpus. N-to-m mapping objects, such as paraphrases, non-literal translations, and multiword expressions, may appear as both noise and also as valid training data. From this perspective, this paper tries to answer the following two questions: 1) how to detect stable patterns where noise seems legitimate, and 2) how to reduce such noise, where applicable, by supplying extra information as prior knowledge to a word aligner

    Exploring Properties of Intralingual and Interlingual Association Measures Visually

    Full text link
    We present an interactive interface to explore the properties of intralingual and interlingual association measures. In conjunction, they can be employed for phraseme identification in word-aligned parallel corpora. The customizable component we built to visualize individual results is capable of showing part-of-speech tags, syntactic dependency relations and word alignments next to the tokens of two corresponding sentences

    Just because: In search of objective criteria of subjectivity expressed by causal connectives

    Get PDF
    The connective because can express both highly objective and highly subjective causal relations. In this, it differs from its counterparts in other languages, e.g. Dutch, where two conjunctions omdat and want express more objective and more subjective causal relations, respectively. The present study investigates whether it is possible to anchor the different uses of because in context, examining a large number of syntactic, morphological and semantic cues with a minimal cost of manual annotation. We propose an innovative method of distinguishing between subjective and objective uses of because with the help of information available from an English/Dutch segment of a parallel corpus, which is accompanied by a distributional analysis of contextual features. On the basis of automatic syntactic and morphological annotation of approximately 1500 examples of because, every English sentence is coded semi-automatically for more than twenty contextual variables, such as the part of speech, number, person, semantic class of the subject, modality, etc. We employ logistic regression to determine whether these contextual variables help predict which of the two causal connectives is used in the corresponding Dutch sentences. Our results indicate that a set of semantic and syntactic features that include modality, semantics of referents (subjects), semantic class of the verbal predicate, tense (past vs. non-past) and the presence of evaluative adjectives, are reliable predictors of the more subjective and objective uses of because, demonstrating that this distinction can indeed be anchored in the immediate linguistic context. The proposed method and relevant contextual cues can be used for identification of objective and subjective relationships in discourse
    corecore