67 research outputs found

    Adquisición automática de recursos para traducción automática en el proyecto Abu-MaTran

    Get PDF
    This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been developed with the aim of being useful for industrial exploitation.Este artículo presenta una panorámica de las actividades de investigación y desarrollo destinadas a aliviar el cuello de botella que supone la falta de recursos lingüísticos en el campo de la traducción automática que se han llevado a cabo en el ámbito del proyecto Abu-MaTran. Hemos desarrollado un conjunto de herramientas para la adquisición de los principales recursos requeridos por las dos aproximaciones m as comunes a la traducción automática, modelos estadísticos (corpus) y basados en reglas (diccionarios y reglas). Todas estas herramientas han sido publicadas con licencias libres y han sido desarrolladas con el objetivo de ser útiles para ser explotadas en el ámbito comercial.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    Fine-grained human evaluation of neural versus phrase-based machine translation

    Get PDF
    We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems' outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.Comment: 12 pages, 2 figures, The Prague Bulletin of Mathematical Linguistic

    Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe

    Get PDF
    International audienceDevelopping a method for detailed morphosyntactic tagging of Serbian This paper presents an experience in detailed morphosyntactic tagging of the Serbian subcorpus of the parallel Serbian-French-English ParCoLab corpus. We enriched an existing POS annotation with finer-grained morphosyntactic properties in order to prepare the corpus for subsequent parsing stages. We compared three approaches: 1) manual annotation; 2) pre-annotation with a tagger trained on Croatian, followed by manual correction; 3) retraining the model on a small validated sample of the corpus (20K tokens), followed by automatic annotation and manual correction. The Croatian model maintains its global stability when applied to Serbian texts, but due to the differences between the two tagsets, important manual interventions were still required. A new model was trained on a validated sample of the corpus: it has the same accuracy as the existing model, but the observed acceleration of the manual correction confirms that it is better suited to the task than the first one. MOTS-CLES : Annotation morphosyntaxique, corpus d'entraînement, serbe.Cet article présente une expérience d'annotation morphosyntaxique fine du volet serbe du corpus parallèle ParCoLab (corpus serbe-français-anglais). Elle a consisté à enrichir une annotation existante en parties du discours avec des traits morphosyntaxiques fins, afin de préparer une étape ultérieure de parsing. Nous avons comparé trois approches : 1) annotation manuelle ; 2) pré-annotation avec un étiqueteur entraîné sur le croate suivie d'une correction manuelle ; 3) ré-entraînement de l'outil sur un petit échantillon validé du corpus, suivi de l'annotation automatique et de la correction manuelle. Le modèle croate maintient une stabilité globale en passant au serbe, mais les différences entre les deux jeux d'étiquettes exigent des interventions manuelles importantes. Le modèle ré-entraîné sur un échantillon de taille limité (20K tokens) atteint la même exactitude que le modèle existant et le gain de temps observé montre que cette méthode optimise la phase de correction

    Vektorski prikaz riječi utemeljen na velikim mrežnim korpusima kao moćan leksikografski alat

    Get PDF
    The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.Projekt Aranea sadržava niz usporednih korpusa za 24 (uglavnom europskih) jezika. On pruža prikladan skup podataka za aplikacije za obradu prirodnoga jezika (nLP) koje zahtijevaju učenje na velikoj količini podataka. U radu se prikazuju modeli vektorskoga prikaza riječi koji su uspostavljeni učenjem na korpusima Aranea te mrežno sučelje kako bi se propitali modeli i vizualizirali rezultati. To može biti korisno za leksikografsku praksu, ali i u drugim područjima leksikografskoga proučavanja jer je vektorski prostor vjerodostojan model semantičkoga prostora značenja riječi. Postoje tri moguća modela: prvi za kombinaciju vrste riječi i leme, drugi za sirove forme riječi i treći koji se temelji na algoritmu fastText koji upotrebljava vektore na razini nižoj od riječi i nije ograničen na cijele riječi ili poznate riječi pri pronalaženju semantičkih odnosa. U radu se opisuju sučelje i osnovni modeli njegova funkcioniranja, ali se ne pokušava provesti iscrpna jezična analiza prikazanih primjera

    Clitic Climbing, Finiteness and the Raising-Control Distinction : A Corpus–based study

    Get PDF
    In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholars’ opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.Peer reviewe

    Impersonal constructions in Slavic languages and the agentivity of the verb

    Get PDF
    The paper deals with impersonal constructions like Polish mówiono, Serbo-Croatian govorilo se and Russian govorili ‘they said’. Grammatical descriptions agree that these constructions require an implicit human agent and are ungrammatical without one. However, it is shown here that these impersonals are not always good even with a human agent because their grammaticality also depends on the verb. Both acceptability judgement tests and corpus studies show a clear agentivity cline: The more features of agent prominence the verb entails, the more acceptable (and relatively the more frequent) the construction is

    Annotating Croatian Semantic Type Coercions in CROATPAS

    Get PDF
    This research paper presents the results of a corpus-based metonymy annotation exercise on a sample of 101 Croatian verb entries – corresponding to 457 patters and over 20,000 corpus lines – taken from CROATPAS (Marini & Ježek, 2019), a digital repository of verb argument structures manually annotated with Semantic Type labels on their argument slots following a methodology inspired by Corpus Pattern Analysis (Hanks, 2004 & 2013; Hanks & Pustejovsky, 2005). CROATPAS will be made available online in 2020. Semantic Type labelling is not only well-suited to annotate verbal polysemy, but also metonymic shifts in verb argument combinations, which in Generative Lexicon (Pustejovsky, 1995 & 1998; Pustejovsky & Ježek, 2008) are called Semantic Type coercions. From a sub lexical point of view, Semantic Type coercions can be considered as exploitations of one of the qualia roles of those Semantic Types which do not satisfy a verb’s selectional requirements, but do not trigger a different verb sense. Overall, we were able to identify 62 different Semantic Type coercions linked to 1,052 metonymic corpus lines. In the future, we plan to compare our results with those from an equivalent study on Italian verbs (Romani, 2020) for a crosslinguistic analysis of metonymic shifts

    Otvoreni resursi i tehnologije za obradu srpskog jezika

    Get PDF
    Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use

    CroaTPAS: COMPARING THE MEANINGS OF RELATED ASPECTUAL VERB PAIRS TO INVESTIGATE THE LINK BETWEEN ASPECT, AKTIONSART AND VERB POLYSEMY IN CROATIAN

    Get PDF
    CroaTPAS je hrvatsko-engleski digitalni semantički rječnik koji se sastoji od zbirke valencijskih struktura hrvatskih glagola čije su obavezne dopune ručno označene hijerarhijom semantičkih tipova (tzv. Semantic Types, Ježek 2019). Kao i njegov talijanski blizanac T-PAS (Ježek i dr. 2014) CroaTPAS je utemeljen na korpusnoj analizi obrazaca (Hanks 2013) i na teoriji generativnog leksikona (Pustejovsky 1995, Pustejovsky i Ježek 2008). Oba su resursa pogodna za istraživanje glagolske polisemije jer svakoj valencijskoj strukturi odgovara različito značenje. CroaTPAS se sastoji od točno 180 glagolskih lema (90 nesvršenih, 82 svršena i 8 dvovidnih glagola) koje su prijevod 100 talijanskih glagolskih lema dostupnih u T-PAS-u. Na temelju označenih podataka u ovom radu usporedit ćemo različita značenja odabranih vidskih parova s ciljem istraživanja odnosa između vida, aktionsarta i glagolske polisemije te s namjerom da se CroaTPAS u budućnosti upotrijebi u nastavi hrvatskoga kao drugog i stranog jezika.CroaTPAS is a digital dictionary consisting of a corpus-derived collection of Croatian verb valency structures, whose argument slots have been manually annotated with a set of hierarchically organised semantic labels called Semantic Types (Ježek 2019). Like its Italian sister project T-PAS (Ježek et al. 2014), CroaTPAS is based on CPA (Corpus Pattern Analysis) (Hanks 2013) and on Generative Lexicon Theory (Pustejovsky 1995; Pustejovsky i Ježek 2008). Both resources are primarily tailored for investigating verbal polysemy, since each semantically typed verb argument structure – called pattern – corresponds to a different verb sense. Currently, CroaTPAS contains a total of 180 Croatian verb entries (90 imperfective, 82 perfective and 8 biaspectual verbs) that are the translational equivalents of 100 Italian verb lemmas available in the T-PAS resource. Based on annotated data, in this work we compare patterns from an initial selection of aspectual pairs, aiming at clarifying the relationship among aspect, Aktionsart and verb polysemy keeping in mind CroaTPAS’s possible teaching applications

    hr500k – A Reference Training Corpus of Croatian.

    Get PDF
    In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway
    corecore