25 research outputs found

    A Registry of Standard Data Categories for Linguistic Annotation

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceIn this paper we describe the most recent work within ISO TC37/SC 4, and in particular the development of a Data Category Registry (DCR) component of the Linguistic Annotation Framework. The DCR will contain a formally defined set of linguistic categories in common use within the language engineering community for reference and use in linguistically annotated resources. We outline the first proposals for creation and management of the DCR, as a solicitation for input from the community

    RELcat: a Relation Registry for ISOcat data categories

    No full text
    The ISOcat Data Category Registry contains basically a flat and easily extensible list of data category specifications. To foster reuse and standardization only very shallow relationships among data categories are stored in the registry. However, to assist crosswalks, possibly based on personal views, between various (application) domains and to overcome possible proliferation of data categories more types of ontological relationships need to be specified. RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one linguist or a larger community. The basis of the registry is a relation type taxonomy that can easily be extended. This allows on one hand to load existing sets of relations specified in, for example, an OWL (2) ontology or SKOS taxonomy. And on the other hand allows algorithms that query the registry to traverse the stored semantic network to remain ignorant of the original source vocabulary. This paper describes first experiences with RELcat and explains some initial design decisions

    Using LMF to Shape a Lexicon for the Biomedical Domain

    Get PDF
    This paper describes the design, implementation and population of the BioLexicon in the framework of BootStrep, an FP6 project. The BioLexicon (BL) is a lexical resource designed for text mining in the bio-domain. It has been conceived to meet both domain requirements and upcoming ISO standards for lexical representation. The data model and data categories are compliant to the ISO Lexical Markup Framework and the Data Category Registry. The BioLexicon integrates features of lexicons and terminologies: term entries (and variants) derived from existing resources are enriched with linguistic features, including sub-categorization and predicate-argument information, extracted from texts. Thus, it is an extendable resource. Furthermore, the lexical entries will be aligned to concepts in the BioOntology, the ontological resource of the project. The BL implementation is an extensible relational database with automatic population procedures. Population relies on a dedicated input data structure allowing to upload terms and their linguistic properties and ?pull-and-push? them in the database. The BioLexicon teaches that the state-of-the-art is mature enough to aim at setting up a standard in this domain. Being conformant to lexical standards, the BioLexicon is interoperable and portable to other areas

    Stabilizing knowledge through standards - A perspective for the humanities

    Get PDF
    It is usual to consider that standards generate mixed feelings among scientists. They are often seen as not really reflecting the state of the art in a given domain and a hindrance to scientific creativity. Still, scientists should theoretically be at the best place to bring their expertise into standard developments, being even more neutral on issues that may typically be related to competing industrial interests. Even if it could be thought of as even more complex to think about developping standards in the humanities, we will show how this can be made feasible through the experience gained both within the Text Encoding Initiative consortium and the International Organisation for Standardisation. By taking the specific case of lexical resources, we will try to show how this brings about new ideas for designing future research infrastructures in the human and social sciences

    TBX goes TEI -- Implementing a TBX basic extension for the Text Encoding Initiative guidelines

    Get PDF
    This paper presents an attempt to customise the TEI (Text Encoding Initiative) guidelines in order to offer the possibility to incorporate TBX (TermBase eXchange) based terminological entries within any kind of TEI documents. After presenting the general historical, conceptual and technical contexts, we describe the various design choices we had to take while creating this customisation, which in turn have led to make various changes in the actual TBX serialisation. Keeping in mind the objective to provide the TEI guidelines with, again, an onomasiological model, we try to identify the best comprise in maintaining both the isomorphism with the existing TBX Basic standard and the characteristics of the TEI framework

    Unification of multi-lingual scientific terminological resources using the ISO 16642 standard. The TermSciences initiative

    Get PDF
    This paper presents the TermSciences portal, which deals with the implementation of a conceptual model that uses the recent ISO 16642 standard (Terminological Markup Framework). This standard turns out to be suitable for concept modeling since it allowed for organizing the original resources by concepts and to associate the various terms for a given concept. Additional structuring is produced by sharing conceptual relationships, that is, cross-linking of resource results through the introduction of semantic relations which may have initially be missing.Comment: 6

    A lexicon for biology and bioinformatics: the BOOTStrep experience

    Get PDF
    This paper describes the design, implementation and population of a lexical resource for biology and bioinformatics (the BioLexicon) developed within an ongoing European project. The aim of this project is text-based knowledge harvesting for support to information extraction and text mining in the biomedical domain. The BioLexicon is a large-scale lexical-terminological resource encoding different information types in one single integrated resource. In the design of the resource we follow the ISO/DIS 24613 ?Lexical Mark-up Framework? standard, which ensures reusability of the information encoded and easy exchange of both data and architecture. The design of the resource also takes into account the needs of our text mining partners who automatically extract syntactic and semantic information from texts and feed it to the lexicon. The present contribution first describes in detail the model of the BioLexicon along its three main layers: morphology, syntax and semantics; then, it briefly describes the database implementation of the model and the population strategy followed within the project, together with an example. The BioLexicon database in fact comes equipped with automatic uploading procedures based on a common exchange XML format, which guarantees that the lexicon can be properly populated with data coming from different sources

    Proposals for a normalized representation of Standard Arabic full form lexica

    No full text
    Standardized lexical resources are an important prerequisite for the development of robust and wide coverage natural language processing application. Therefore, we applied the Lexical Markup Framework, a recent ISO initiative towards standards for designing, implementing and representing lexical resources, on a test bed of data for an Arabic full form lexicon. Besides minor structural accommodation that would be needed in order to take into account the traditional root-based organization of Arabic dictionaries, the LMF proposal appeared to be suitable to our purpose, especially because of the separate management of the hierarchical data structure (LMF core model) and elementary linguistic descriptors (data categories)

    Towards modeling Arabic lexicons compliant LMF in OWL-DL

    Get PDF
    International audienceElaborating reusable lexical databases and especially making interoperability operational are crucial tasks effecting both Natural Language Processing (NLP) and Semantic Web. With this respect, we consider that modeling Lexical Markup Framework (LMF) in Web Ontology Language Description Logics (OWL-DL) can be a beneficial attempt to reach these aims. This proposal will have large repute since it concerns the reference standard LMF for modeling lexical structures. In this paper, we study the requirement for this suggestion. We first make a quick presentation of the LMF framework. Next, we define the three ontology definition sublanguages that may be easily used by specific users: OWL Lite, OWL-DL and OWL Full. After comparing of the three, we have chosen to work with OWL-DL. We then define the ontology language OWL and describe the steps needed to model LMF in OWL. Finally, we apply this model to develop an instance for an Arabic lexicon
    corecore