17 research outputs found

    Creating Lexical Resources in TEI P5 : a Schema for Multi-purpose Digital Dictionaries

    Get PDF
    Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on formats and standards. The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of varied lexicographic projects, both digitising print dictionaries and working on the creation of genuinely digital lexicographic data. This data was designed to serve varying purposes: machine-readability was only one. A second goal was interoperability with digital NLP tools. To achieve this end, a uniform encoding system applicable across all the projects was developed. The paper describes the constraints imposed on the content models of the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as an encoding system not only being used to represent digitised print dictionaries but also for NLP purposes

    Towards Finer Granularity in Metadata

    Get PDF
    In early 2010, the Austrian Academy of Sciences’ ICLTT instituted an experiment in selective metadata creation for a medium-sized collection (<100 million tokens) of digitised periodicals. The project has two main objectives: (a) assigning basic structures to previously digitised texts, so-called divisions in TEI nomenclature, thus creating a set of new digital objects, and (b) the subsequent categorisation of these texts with the purpose of being able to create thematically organised sub-corpora. An additional objective was to have metadata stored as TEI headers. Attempts at streamlining metadata creation are legion, in particular in the library community. Tools to do the job are often incorporated into workflow engines which consist of commercial products (such as docWORKS[e] and C-3) as well as free products such as Goobi, which incorporates the metadata creation tool RusDML, and the Archivists’ Toolkit™. The experimental workflow being tested at the ICLTT is an attempt to capture detailed metadata for a comparatively large collection of digitised periodicals and other collective publications such as yearbooks, readers, commemorative publications, almanacs, and anthologies. While all higher-level digital objects in the corpus were furnished with metadata from the beginning of the digitisation process, the current experiment is designed to enrich this data to more fully describe the contents of the material at hand. To achieve this end, the department’s standard tools were adapted, which had the added benefit of keeping software production costs at a minimum. While in earlier experiments of our group of researchers (metadata creators) created the TEI header for each text division manually, we have been trying to approach the problem by exploiting the contents section of the digitised issues and/or other secondary sources, which has resulted in a tangible acceleration of the process. Together with collecting basic data such as author, title, publication date, and creation date, the project classifies each division with a type of texts and topics, the latter using the standard Dewey Decimal Classification (version 22, German) with supplementary keywords. This paper discusses a number of issues concerning the quality and type of resulting data. It also touches upon the issue of automation and at what points in the process human intervention is indispensible. Particular attention is directed at the software module for creating TEI headers

    Creating Lexical Resources in TEI P5

    Get PDF
    Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on formats and standards. The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of varied lexicographic projects, both digitising print dictionaries and working on the creation of genuinely digital lexicographic data. This data was designed to serve varying purposes: machine-readability was only one. A second goal was interoperability with digital NLP tools. To achieve this end, a uniform encoding system applicable across all the projects was developed. The paper describes the constraints imposed on the content models of the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as an encoding system not only being used to represent digitised print dictionaries but also for NLP purposes

    Modelling frequency data -- Methodological considerations on the relationship between dictionaries and corpora

    Get PDF
    International audienceThe research questions addressed in our paper stem from a bundle of linguistically focused projects which -among other activities- also create glossaries and dictionaries which are intended to be usable both for human readers and particular NLP applications. The paper will comprise two parts: in the first section, the authors will give a concise overview of the projects and their goals. The second part will concentrate on encoding issues involved in the related dictionary production. Particular focus will be put on the modelling of an encoding scheme for statistical information on lexicographic data gleaned from digital corpora

    A Machine-readable Persian-English Dictionary - dc-pes-eng (ELEXIS)

    No full text
    Dictionary of Contemporary Persian. This dictionary project has grown out of a university language course. It is highly experimental in nature. The focus in compiling the dictionary has been on contemporary language. In the beginning, all lexical items needed in the classes were entered into the dictionary. Later we started to integrate data not available in other dictionaries. Particular attention has been paid to neologisms which can not be found in most of the usually older print dictionaries. Usage examples are adapted from various sources, for the most part coming from the Internet. Of particular importance is a Wikipedia version which we have made available as a TEI corpus

    A Machine-readable Dictionary of Damascus Arabic - dc-apc-eng (ELEXIS)

    No full text
    This dictionary has been prepared to support the Syrian Textbook prepared at the University of Vienna. See also: https://hdl.handle.net/11022/0000-0007-C093-

    A Digital Dictionary of Tunis Arabic - TUNICO (ELEXIS)

    No full text
    A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but also on a range of additional sources: data elicited from complementary interviews with young Tunisians and lexical material taken from various published historical sources dating from the middle of the 20th century and earlier. See also: https://hdl.handle.net/11022/0000-0007-C265-

    Modeling Frequency Data: Methodological Considerations on the Relationship between Dictionaries and Corpora

    Get PDF
    International audienceAcademic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In particular this holds true for the crucial role that statistical data obtained from language resources play in lexicographic workflow—a role that also has to be reflected in the model of the data produced in these workflows. Presenting a range of current projects, the authors address two main questions in this area: How can the relationship between a dictionary and other language resources be conceptualized, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? And how can this be documented using the TEI Guidelines? Discussing a variety of options, this paper proposes a customization of the TEI dictionary module that tries to respond to the emerging requirements in an environment of increasingly intertwined language resources
    corecore