4,874 research outputs found

    Multilayer Network of Language: a Unified Framework for Structural Analysis of Linguistic Subsystems

    Get PDF
    Recently, the focus of complex networks research has shifted from the analysis of isolated properties of a system toward a more realistic modeling of multiple phenomena - multilayer networks. Motivated by the prosperity of multilayer approach in social, transport or trade systems, we propose the introduction of multilayer networks for language. The multilayer network of language is a unified framework for modeling linguistic subsystems and their structural properties enabling the exploration of their mutual interactions. Various aspects of natural language systems can be represented as complex networks, whose vertices depict linguistic units, while links model their relations. The multilayer network of language is defined by three aspects: the network construction principle, the linguistic subsystem and the language of interest. More precisely, we construct a word-level (syntax, co-occurrence and its shuffled counterpart) and a subword level (syllables and graphemes) network layers, from five variations of original text (in the modeled language). The obtained results suggest that there are substantial differences between the networks structures of different language subsystems, which are hidden during the exploration of an isolated layer. The word-level layers share structural properties regardless of the language (e.g. Croatian or English), while the syllabic subword level expresses more language dependent structural properties. The preserved weighted overlap quantifies the similarity of word-level layers in weighted and directed networks. Moreover, the analysis of motifs reveals a close topological structure of the syntactic and syllabic layers for both languages. The findings corroborate that the multilayer network framework is a powerful, consistent and systematic approach to model several linguistic subsystems simultaneously and hence to provide a more unified view on language

    Evaluation of Croatian Word Embeddings

    Full text link
    Croatian is poorly resourced and highly inflected language from Slavic language family. Nowadays, research is focusing mostly on English. We created a new word analogy corpus based on the original English Word2vec word analogy corpus and added some of the specific linguistic aspects from Croatian language. Next, we created Croatian WordSim353 and RG65 corpora for a basic evaluation of word similarities. We compared created corpora on two popular word representation models, based on Word2Vec tool and fastText tool. Models has been trained on 1.37B tokens training data corpus and tested on a new robust Croatian word analogy corpus. Results show that models are able to create meaningful word representation. This research has shown that free word order and the higher morphological complexity of Croatian language influences the quality of resulting word embeddings.Comment: In review process on LREC 2018 conferenc

    Conceptualization of Legal Terms in Different Fields of Law: The Need for a Transparent Terminological Approach

    Get PDF
    Researchers often use subject-specific terminology in order to facilitate communication within a given field of law. Difficulties may arise when they must use scientific information that does not belong to their field. The transfer of information from one subject area to another is restricted by the technical vocabulary used in the particular field. If this is so, what happens when lawyers in one field of law use terms from another? Is the concept in question couched in the same term within another field of law as well? The process of conceptualizing one and the same legal term in different legal fields does not always proceed smoothly. As will be illustrated in this paper, the problem of conceptualizing legal terms in different fields of law calls for a transparent terminological approach. While it is true that legal concepts cannot be fully conveyed by terminology, a transparent terminological approach can contribute to the understanding of these concepts and facilitate their use in legal comparisons, thus making such an approach a conditio sine qua non of legal translation

    Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

    Get PDF
    This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs.This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran)

    Sastavljanje Hrvatske ovisnosne banke stabala: početne etape

    Get PDF
    The paper presents work–in–progress on the building of the Croatian Dependency Treebank. Its design principles, procedures and the pilot corpus used within are described. Perspectives for further development of the Croatian Dependency Treebank are presented at the end.Članak donosi međurezultate sastavljanja Hrvatske ovisnosne banke stabala koje je istraživanje u tijeku. Opisuju se njezina načela oblikovanja, postupci i uporabljeni pilot korpus. Na kraju se članka predstavljaju perspektive za daljnji razvitak Hrvatske ovisnosne banke stabala

    First verbs : On the way to mini-paradigms

    Get PDF
    This 18th issue of ZAS-Papers in Linguistics consists of papers on the development of verb acquisition in 9 languages from the very early stages up to the onset of paradigm construction. Each of the 10 papers deals with first-Ianguage developmental processes in one or two children studied via longitudinal data. The languages involved are French, Spanish, Russian, Croatian, Lithuanien, Finnish, English and German. For German two different varieties are examined, one from Berlin and one from Vienna. All papers are based on presentations at the workshop 'Early verbs: On the way to mini-paradigms' held at the ZAS (Berlin) on the 30./31. of September 2000. This workshop brought to a close the first phase of cooperation between two projects on language acquisition which has started in October 1999: a) the project on "Syntaktische Konsequenzen des Morphologieerwerbs" at the ZAS (Berlin) headed by Juergen Weissenborn and Ewald Lang, and financially supported by the Deutsche Forschungsgemeinschaft, and b) the international "Crosslinguistic Project on Pre- and Protomorphology in Language Acquisition" coordinated by Wolfgang U. Dressler in behalf of the Austrian Academy of Sciences

    The Abu-MaTran project: tools for teaching machine translation

    Get PDF
    El autor era colaborador honorífico del Departamento de Lenguajes y Sistemas Informáticos en noviembre de 2016.Presentación de diapositivas del taller "Workshop on Tools for Teaching Machine Translation", impartido por Víctor Manuel Sánchez Cartagena en Dublin City University en noviembre de 2016

    The strategic impact of META-NET on the regional, national and international level

    Get PDF
    This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.Postprint (published version
    corecore