Search CORE

37 research outputs found

Automatische Wortschatzerschließung großer Textkorpora am Beispiel des DWDS

Author: Geyken Alexander
Publication venue: University of Bern
Publication date: 01/01/2009
Field of study

In the past years a large number of electronic text corpora for German have been created due to the increased availability of electronic resources. Appropriate filtering of lexical material in these corpora is a particular challenge for computational lexicography since machine readable lexicons alone are insufficient for systematic classification. In this paper we show – on the basis of the corpora of the DWDS – how lexical knowledge can be classified in a more fine-grained way with morphological and shallow syntactic parsing methods. One result of this analysis is that the number of different lemmas contained in the corpora exceeds the number of different headwords of current large monolingual German dictionaries by several times

Directory of Open Access Journals

BOP Serials

Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text

Author: Geyken Alexander
Haaf Susanne
Wiegand Frank
Publication venue: 'OpenEdition'
Publication date: 15/03/2013
Field of study

Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures. Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of double-keying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method

OpenEdition

Rediscovering Hashed Random Projections for Efficient Quantization of Contextualized Sentence Embeddings

Author: Geyken Alexander
Gurevych Iryna
Hamster Ulf A.
Lee Ji-Ung
Publication venue
Publication date: 13/03/2023
Field of study

Training and inference on edge devices often requires an efficient setup due to computational limitations. While pre-computing data representations and caching them on a server can mitigate extensive edge device computation, this leads to two challenges. First, the amount of storage required on the server that scales linearly with the number of instances. Second, the bandwidth required to send extensively large amounts of data to an edge device. To reduce the memory footprint of pre-computed data representations, we propose a simple, yet effective approach that uses randomly initialized hyperplane projections. To further reduce their size by up to 98.96%, we quantize the resulting floating-point representations into binary vectors. Despite the greatly reduced size, we show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point

arXiv.org e-Print Archive

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Author: Geyken Alexander
Haaf Susanne
Wiegand Frank
Publication venue: 'OpenEdition'
Publication date: 18/11/2015
Field of study

In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text

OpenEdition

Die dynamische Verknüpfung von Kollokationen mit Korpusbelegen und deren Repräsentationen im DWDS-Wörterbuch

Author: Geyken Alexander
Publication venue: Mannheim : Institut für Deutsche Sprache
Publication date: 01/01/2011
Field of study

In diesem Beitrag soll zunächst der Hintergrund des DWDS-Wörterbuchs dargestellt werden. Im zweiten Abschnitt erfolgt eine kurze Charakterisierung des im DWDS-Wörterbuch verwendeten Kollokationsbegriffs. Dessen Einbettung in die Wörterbuchstruktur des DWDSWörterbuchs wird im dritten Abschnitt beschrieben. Das eigentliche digitale Herzstück der Kollokationsbeschreibung im DWDS-Wörterbuch ist das DWDS-Wortprofil, eine auf syntaktischer Analyse und statistischer Auswertung basierende automatische Kollokationsextraktion, deren Grundlagen und Qualität in Abschnitt 4 dargestellt werden. In Abschnitt 5 soll anhand einiger Beispiele illustriert werden, wie die Arbeitsteilung der automatischen Kollokationen und der lexikographischen Intuition in der täglichen lexikographischen Arbeit aussieht. Schließlich geben wir im letzten Abschnitt einen Ausblick auf die künftige Arbeit

Publikationsserver des Instituts für Deutsche Sprache

Support of self-regulated Learning in Tele-CBT environments

Author: Geyken Alexander
Mandl Heinz
Publication venue: pedocs-Dokumentenserver/DIPF
Publication date: 01/01/1993
Field of study

Ziel dieser Studie ist es, Erkenntnisse darüber zu gewinnen, in welchem Ausmaß selbstgesteuertes Lernen in Tele-CBT Umgebungen unterstützt werden kann. Tele-CBT Umgebungen erlauben einem Lernenden bei auftretenden Schwierigkeiten, mit einem Tele-Tutor Kontakt aufzunehmen. Im Rahmen des DELTA Projekts Malibu wurde eine Pilotstudie mit Servicetechnikern durchgeführt. In der Untersuchung wurden die Akzeptanz, die Lernsituation, der Lernprozess und der Lernerfolg untersucht bei Lernenden, die in der Tele-CBT Umgebung arbeiten im Vergleich zu einer Gruppe, die das CBT allein bearbeitete. Die Ergebnisse zeigen, dass die Lernenden die Tele-CBT Umgebung akzeptierten. Außerdem kamen sie im Vergleich mit den Lernenden in der CBT Umgebung in der Lernsituation und während des Lernprozesses besser zurecht. Schließlich bearbeiteten sie Transferaufgaben erfolgreicher. Diese Ergebnisse werden in Hinblick auf die Einsatzmöglichkeiten von Tele-CBT Umgebungen in der betrieblichen Weiterbildung diskutiert. (DIPF/Orig.)The aim of this article is to examine the extent to which self-regulated learning in Tele-CBT environments can be supported. In Tele-CBT environments, the learner can be supported by a tele-tutor in the event that any difficulties are encountered while working with a CBT. Within the DELTA project, Malibu, a study was conducted with service technicians, which was based on cases analyzed in detail. Particular emphasis was placed on the observation of the learner\u27s acceptance, learning Situation, learning process, and the learning success in comparison with learners who worked with the CBT only. The results show that the learners accepted the Tele-CBT environment, performed better in the learning Situation and the learning process, and were more successful in transfer tasks. In conclusion, these results will be discussed with respect to the options for the employment of tele-CBT environments in the context of corporate training. (DIPF/Orig.

Fachlicher Dokumentenserver Paedagogik/Erziehungswissenschaften

„entsorgen“ – eine kurze Beschreibung der Bedeutungsentwicklung mit dem DWDS

Author: Alexander Geyken
Publication venue: Im Zentrum Sprache
Publication date: 30/08/2017
Field of study

Autoren: Alexander Geyken und Norbert Schrader Mit seiner jüngsten Entgleisung, bei der AfD-Politiker Gauland die SPD-Politikerin und Integrationsbeauftragte der Bundesregierung, Aydan Özoguz, nach Anatolien „entsorgen“ will, wurden Gauland zum wiederholten Male heftige Reaktionen aller Parteien zuteil. Die inhaltliche Diskussion soll in diesem Blog nicht geführt werden (s. hierfür z. B. faz, welt ), wohl aber eine sprachliche Einordnung der Diskussion, die sich an der Wortwahl „entsorgen“ e..

OpenEdition

Refining and Exploiting the Structural Markup of the eWDG

Author: Geyken Alexander
Schmidt Thomas
Storrer Angelika
Publication venue: Barcelona : Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra:
Publication date: 07/05/2014
Field of study

In this paper, the authors describe a semi-automated approach to refine the dictionary-entry structure of the digital version of the Wörterbuch der deutschen Gegenwartssprache (WDG, en.: Dictionary of Present-day German), a dictionary compiled and published between 1952 and 1977 by the Deutsche Akademie der Wissenschaften that comprises six volumes with over 4,500 pages containing more than 120,000 headwords. We discuss the benefits of such a refinement in the context of the dictionary project Digitales Wörterbuch der deutschen Sprache (DWDS, en: Digital Dictionary of the German language). In the current phase of the DWDS project, we aim to integrate multiple dictionary and corpus resources in German language into a digital lexical system (DLS). In this context, we plan to expand the current DWDS interface with several special purpose components, which are adaptive in the sense that they offer specialized data views and search mechanisms for different dictionary functions-e.g. text comprehension, text production-and different user groups-e.g. journalists, translators, linguistic researchers, computational linguists. One prerequisite for generating such data views is the selective access to the lexical items in the article structure of the dictionaries which are the object of study. For this purpose, the representation of the eWDG has to be refined. The focus of this paper is on the semiautomated approach used to transform eWDG into a refined version in which the main structural units can be explicitly accessed. We will show how this refinement opens new and flexible ways of visualizing and querying the lexicographic content of the refined version in the context of the DLS project

Publikationsserver des Instituts für Deutsche Sprache

Die Webkorpora im DWDS – Strategien des Korpusaufbaus und Nutzungsmöglichkeiten

Author: Barbaresi Adrien
Geyken Alexander
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 11/06/2020
Field of study

Die Kernaufgabe der Projektgruppe des DWDS besteht darin, den in den Korpora enthaltenen Wortschatz lexikografisch und korpusbasiert zu beschreiben. In der modernen Lexikografie werden die Aussagen zu den sprachlichen Aspekten und Eigenschaften der beschriebenen Wörter und zu Besonderheiten ihrer Verwendung auf Korpusevidenz gestutzt. Empirisch können riesige Textsammlungen Hypothesen genauer oder ausführlicher belegen. Dabei wird deutlich, wie vielfältig Sprache im Gebrauch tatsachlich realisiert wird. Zu diesem Zweck bieten wir auf der DWDS-Plattform neben den zeitlich und nach Textsorten ausgewogenen Kernkorpora und den Zeitungskorpora eine Reihe von Spezialkorpora an, die hinsichtlich ihres Gegenstandes oder ihrer sprachlichen Charakteristika von den erstgenannten Korpora abweichen. Die Webkorpora bilden einen wesentlichen Bestandteil dieser Spezialkorpora

Publikationsserver des Instituts für Deutsche Sprache

Journal for language technology and computational linguistics. Corpus linguistic software tools

Author: Geyken Alexander
Kupietz Marc
Publication venue: Berlin : Gesellschaft für Sprachtechnologie und Computerlinguistik
Publication date: 23/06/2017
Field of study

With the growing availability and importance of (large) corpora in all fields of linguistics, the role of software tools is gradually moving from useful, possibly intelligent informationtechnological “helpers” towards scientific instruments that are as integral parts of the research process as data, methodology and interpretations. Both aspects are present in this special issue of JLCL on corpus linguistic software tools

Publikationsserver des Instituts für Deutsche Sprache