Search CORE

17 research outputs found

Creating Lexical Resources in TEI P5 : a Schema for Multi-purpose Digital Dictionaries

Author: Budin Gerhard
Majewski Stefan
Mörth Karlheinz
Publication venue
Publication date: 01/01/2012
Field of study

Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on formats and standards. The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of varied lexicographic projects, both digitising print dictionaries and working on the creation of genuinely digital lexicographic data. This data was designed to serve varying purposes: machine-readability was only one. A second goal was interoperability with digital NLP tools. To achieve this end, a uniform encoding system applicable across all the projects was developed. The paper describes the constraints imposed on the content models of the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as an encoding system not only being used to represent digitised print dictionaries but also for NLP purposes

Publikationsserver des Instituts für Deutsche Sprache

Towards Finer Granularity in Metadata

Author: Budin Gerhard
Kabas Heinrich
Mörth Karlheinz
Publication venue: 'OpenEdition'
Publication date: 12/06/2012
Field of study

In early 2010, the Austrian Academy of Sciences’ ICLTT instituted an experiment in selective metadata creation for a medium-sized collection (<100 million tokens) of digitised periodicals. The project has two main objectives: (a) assigning basic structures to previously digitised texts, so-called divisions in TEI nomenclature, thus creating a set of new digital objects, and (b) the subsequent categorisation of these texts with the purpose of being able to create thematically organised sub-corpora. An additional objective was to have metadata stored as TEI headers. Attempts at streamlining metadata creation are legion, in particular in the library community. Tools to do the job are often incorporated into workflow engines which consist of commercial products (such as docWORKS[e] and C-3) as well as free products such as Goobi, which incorporates the metadata creation tool RusDML, and the Archivists’ Toolkit™. The experimental workflow being tested at the ICLTT is an attempt to capture detailed metadata for a comparatively large collection of digitised periodicals and other collective publications such as yearbooks, readers, commemorative publications, almanacs, and anthologies. While all higher-level digital objects in the corpus were furnished with metadata from the beginning of the digitisation process, the current experiment is designed to enrich this data to more fully describe the contents of the material at hand. To achieve this end, the department’s standard tools were adapted, which had the added benefit of keeping software production costs at a minimum. While in earlier experiments of our group of researchers (metadata creators) created the TEI header for each text division manually, we have been trying to approach the problem by exploiting the contents section of the digitised issues and/or other secondary sources, which has resulted in a tangible acceleration of the process. Together with collecting basic data such as author, title, publication date, and creation date, the project classifies each division with a type of texts and topics, the latter using the standard Dewey Decimal Classification (version 22, German) with supplementary keywords. This paper discusses a number of issues concerning the quality and type of resulting data. It also touches upon the issue of automation and at what points in the process human intervention is indispensible. Particular attention is directed at the software module for creating TEI headers

OpenEdition

Creating Lexical Resources in TEI P5

Author: Budin Gerhard
Majewski Stefan
Mörth Karlheinz
Publication venue: 'OpenEdition'
Publication date: 05/11/2012
Field of study

OpenEdition

Modelling frequency data -- Methodological considerations on the relationship between dictionaries and corpora

Author: Budin Gerhard
Mörth Karlheinz
Romary Laurent
Publication venue: HAL CCSD
Publication date: 02/10/2013
Field of study

International audienceThe research questions addressed in our paper stem from a bundle of linguistically focused projects which -among other activities- also create glossaries and dictionaries which are intended to be usable both for human readers and particular NLP applications. The paper will comprise two parts: in the first section, the authors will give a concise overview of the projects and their goals. The second part will concentrate on encoding issues involved in the related dictionary production. Particular focus will be put on the modelling of an encoding scheme for statistical information on lexicographic data gleaned from digital corpora

INRIA a CCSD electronic archive server

A Machine-readable Persian-English Dictionary - dc-pes-eng (ELEXIS)

Author: Mörth Karlheinz
Publication venue: Austrian Centre for Digital Humanities and Cultural Heritage, The Austrian Academy of Sciences
Publication date: 13/08/2020
Field of study

Dictionary of Contemporary Persian. This dictionary project has grown out of a university language course. It is highly experimental in nature. The focus in compiling the dictionary has been on contemporary language. In the beginning, all lexical items needed in the classes were entered into the dictionary. Later we started to integrate data not available in other dictionaries. Particular attention has been paid to neologisms which can not be found in most of the usually older print dictionaries. Usage examples are adapted from various sources, for the most part coming from the Internet. Of particular importance is a Wikipedia version which we have made available as a TEI corpus

Common Language Resources and Technology Infrastructure - Slovenia

A Machine-readable Dictionary of Damascus Arabic - dc-apc-eng (ELEXIS)

Author: Mörth Karlheinz
Procházka Stephan
Publication venue: Austrian Centre for Digital Humanities and Cultural Heritage, The Austrian Academy of Sciences
Publication date: 13/08/2020
Field of study

This dictionary has been prepared to support the Syrian Textbook prepared at the University of Vienna. See also: https://hdl.handle.net/11022/0000-0007-C093-

Common Language Resources and Technology Infrastructure - Slovenia

Linguistic Variation in the Austrian Media Corpus. Dealing with the Challenges of Large Amounts of Data

Author: Mörth Karlheinz
Ransmayr Jutta
Ďurčo Matej
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

A Digital Dictionary of Tunis Arabic - TUNICO (ELEXIS)

Author: Dallaji Ines
Gabsi Ines
Mörth Karlheinz
Procházka Stephan
Publication venue: Austrian Centre for Digital Humanities and Cultural Heritage, The Austrian Academy of Sciences
Publication date: 13/08/2020
Field of study

A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but also on a range of additional sources: data elicited from complementary interviews with young Tunisians and lexical material taken from various published historical sources dating from the middle of the 20th century and earlier. See also: https://hdl.handle.net/11022/0000-0007-C265-

Common Language Resources and Technology Infrastructure - Slovenia

Modeling Frequency Data: Methodological Considerations on the Relationship between Dictionaries and Corpora

Author: Budin Gerhard
Mörth Karlheinz
Romary Laurent
Schopper Daniel
Publication venue: 'OpenEdition'
Publication date: 28/12/2014
Field of study

International audienceAcademic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In particular this holds true for the crucial role that statistical data obtained from language resources play in lexicographic workflow—a role that also has to be reflected in the model of the data produced in these workflows. Presenting a range of current projects, the authors address two main questions in this area: How can the relationship between a dictionary and other language resources be conceptualized, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? And how can this be documented using the TEI Guidelines? Discussing a variety of options, this paper proposes a customization of the TEI dictionary module that tries to respond to the emerging requirements in an environment of increasingly intertwined language resources

Crossref

INRIA a CCSD electronic archive server

OpenEdition

Hal-Diderot