104 research outputs found

    Linking to linguistic data categories in ISOcat

    No full text
    ISO Technical Committee 37, Terminology and other language and content resources, established an ISO 12620:2009 based Data Category Registry (DCR), called ISOcat (see http://www.isocat.org), to foster semantic interoperability of linguistic resources. However, this goal can only be met if the data categories are reused by a wide variety of linguistic resource types. A resource indicates its usage of data categories by linking to them. The small DC Reference XML vocabulary is used to embed links to data categories in XML documents. The link is established by an URI, which servers as the Persistent IDentifier (PID) of a data category. This paper discusses the efforts to mimic the same approach for RDF-based resources. It also introduces the RDF quad store based Relation Registry RELcat, which enables ontological relationships between data categories not supported by ISOcat and thus adds an extra level of linguistic knowledge

    Interchanging lexical resources on the Semantic Web

    Get PDF
    Lexica and terminology databases play a vital role in many NLP applications, but currently most such resources are published in application-specific formats, or with custom access interfaces, leading to the problem that much of this data is in ‘‘data silos’’ and hence difficult to access. The Semantic Web and in particular the Linked Data initiative provide effective solutions to this problem, as well as possibilities for data reuse by inter-lexicon linking, and incorporation of data categories by dereferencable URIs. The Semantic Web focuses on the use of ontologies to describe semantics on the Web, but currently there is no standard for providing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. We present our model, lemon, which aims to address these gap

    RELcat: a Relation Registry for ISOcat data categories

    No full text
    The ISOcat Data Category Registry contains basically a flat and easily extensible list of data category specifications. To foster reuse and standardization only very shallow relationships among data categories are stored in the registry. However, to assist crosswalks, possibly based on personal views, between various (application) domains and to overcome possible proliferation of data categories more types of ontological relationships need to be specified. RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one linguist or a larger community. The basis of the registry is a relation type taxonomy that can easily be extended. This allows on one hand to load existing sets of relations specified in, for example, an OWL (2) ontology or SKOS taxonomy. And on the other hand allows algorithms that query the registry to traverse the stored semantic network to remain ignorant of the original source vocabulary. This paper describes first experiences with RELcat and explains some initial design decisions

    Annotation interoperability for the post-ISOCat era

    Get PDF
    With this paper, we provide an overview over ISOCat successor solutions and annotation standardization efforts since 2010, and we describe the low-cost harmonization of post-ISOCat vocabularies by means of modular, linked ontologies: The CLARIN Concept Registry, LexInfo, Universal Parts of Speech, Universal Dependencies and UniMorph are linked with the Ontologies of Linguistic Annotation and through it with ISOCat, the GOLD ontology, the Typological Database Systems ontology and a large number of annotation schemes

    ISOcat: Remodeling metadata for language resources

    No full text
    The Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, is creating a state-of-the-art web environment for the ISO TC 37 (terminology and other language and content resources) metadata registry. This Data Category Registry (DCR) is called ISOcat and encompasses data categories for a broad range of language resources. Under the governance of the DCR Board, ISOcat provides an open work space for creating data category specifications, defining Data Category Selections (DCSs) (domain-specific groups of data categories), and standardising selected data categories and DCSs. Designers visualise future interactivity among the DCR, reference registries and ontological knowledge space

    Lin|gu|is|tik: building the linguist's pathway to bibliographies, libraries, language resources and linked open data

    Get PDF
    This paper introduces a novel research tool for the field of linguistics: The Lin|gu|is|tik web portal provides a virtual library which offers scientific information on every linguistic subject. It comprises selected internet sources and databases as well as catalogues for linguistic literature, and addresses an interdisciplinary audience. The virtual library is the most recent outcome of the Special Subject Collection Linguistics of the German Research Foundation (DFG), and also integrates the knowledge accumulated in the Bibliography of Linguistic Literature. In addition to the portal, we describe long-term goals and prospects with a special focus on ongoing efforts regarding an extension towards integrating language resources and Linguistic Linked Open Data

    Models to represent linguistic linked data

    Get PDF
    As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced

    Standardizing a component metadata infrastructure

    No full text
    This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities

    Integrating WordNet and Wiktionary with lemon

    Get PDF
    Nowadays, there is a significant quantity of linguistic data available on the Web. However, linguistic resources are often published using proprietary formats and, as such, it can be difficult to interface with one another and they end up confined in “data silos”. The creation of web standards for the publishing of data on the Web and projects to create Linked Data have lead to interest in the creation of resources that can be published using Web principles. One of the most important aspects of “Lexical Linked Data” is the sharing of lexica and machine readable dictionaries. It is for this reason, that the lemon format has been proposed, which we briefly describe. We then consider two resources that seem ideal candidates for the Linked Data cloud, namely WordNet 3.0 and Wiktionary, a large document based dictionary. We discuss the challenges of converting both resources to lemon , and in particular for Wiktionary, the challenge of processing the mark-up, and handling inconsistencies and underspecification in the source material. Finally, we turn to the task of creating links between the two resources and present a novel algorithm for linking lexica as lexical Linked Data
    • 

    corecore