70 research outputs found

    Automatische Wortschatzerschließung großer Textkorpora am Beispiel des DWDS

    Get PDF
    In the past years a large number of electronic text corpora for German have been created due to the increased availability of electronic resources. Appropriate filtering of lexical material in these corpora is a particular challenge for computational lexicography since machine readable lexicons alone are insufficient for systematic classification. In this paper we show – on the basis of the corpora of the DWDS – how lexical knowledge can be classified in a more fine-grained way with morphological and shallow syntactic parsing methods. One result of this analysis is that the number of different lemmas contained in the corpora exceeds the number of different headwords of current large monolingual German dictionaries by several times

    Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text

    Get PDF
    Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures. Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of double-keying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method

    Analysis to achieve a high penetration of renewable energies in MW-scale electricity Microgrids with the case study of an island in the Pacific

    Get PDF
    As the penetration of intermittent renewable energies consumed in MW-scale electrical grids becomes high, in many countries reaching more than 25 % per year, the need for control, stabilization and storage methods to guarantee a stable and constant supply at any given moment becomes crucial. Many technological solutions exist in the market. Some are more mature than others. A benchmarking of the grid stabilization and energy storage solutions offered by companies is followed by an overview of islands with an existing or planned high penetration of renewable energies. In a next step, a case study of the transition from a diesel powered towards a renewable energy electricity grid in an island in the Pacific is presented. A final discussion about the techno-economical sense of each solution is made comparing factors such as CAPEX and NPC

    Rediscovering Hashed Random Projections for Efficient Quantization of Contextualized Sentence Embeddings

    Full text link
    Training and inference on edge devices often requires an efficient setup due to computational limitations. While pre-computing data representations and caching them on a server can mitigate extensive edge device computation, this leads to two challenges. First, the amount of storage required on the server that scales linearly with the number of instances. Second, the bandwidth required to send extensively large amounts of data to an edge device. To reduce the memory footprint of pre-computed data representations, we propose a simple, yet effective approach that uses randomly initialized hyperplane projections. To further reduce their size by up to 98.96%, we quantize the resulting floating-point representations into binary vectors. Despite the greatly reduced size, we show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point

    The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

    Get PDF
    In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text

    Epistemic and social scripts in computer-supported collaborative learning

    Get PDF
    Collaborative learning in computer-supported learning environments typically means that learners work on tasks together, discussing their individual perspectives via text-based media or videoconferencing, and consequently acquire knowledge. Collaborative learning, however, is often sub-optimal with respect to how learners work on the concepts that are supposed to be learned and how learners interact with each other. One possibility to improve collaborative learning environments is to conceptualize epistemic scripts, which specify how learners work on a given task, and social scripts, which structure how learners interact with each other. In this contribution, two studies will be reported that investigated the effects of epistemic and social scripts in a text-based computer-supported learning environment and in a videoconferencing learning environment in order to foster the individual acquisition of knowledge. In each study the factors ‘epistemic script’ and ‘social script’ have been independently varied in a 2×2-factorial design. 182 university students of Educational Science participated in these two studies. Results of both studies show that social scripts can be substantially beneficial with respect to the individual acquisition of knowledge, whereas epistemic scripts apparently do not to lead to the expected effects

    Event-Related Potentials Reveal Rapid Verification of Predicted Visual Input

    Get PDF
    Human information processing depends critically on continuous predictions about upcoming events, but the temporal convergence of expectancy-based top-down and input-driven bottom-up streams is poorly understood. We show that, during reading, event-related potentials differ between exposure to highly predictable and unpredictable words no later than 90 ms after visual input. This result suggests an extremely rapid comparison of expected and incoming visual information and gives an upper temporal bound for theories of top-down and bottom-up interactions in object recognition

    Eye movements during reading of randomly shuffled text

    Get PDF
    AbstractIn research on eye-movement control during reading, the importance of cognitive processes related to language comprehension relative to visuomotor aspects of saccade generation is the topic of an ongoing debate. Here we investigate various eye-movement measures during reading of randomly shuffled meaningless text as compared to normal meaningful text. To ensure processing of the material, readers were occasionally probed for words occurring in normal or shuffled text. For reading of shuffled text we observed longer fixation times, less word skippings, and more refixations than in normal reading. Shuffled-text reading further differed from normal reading in that low-frequency words were not overall fixated longer than high-frequency words. However, the frequency effect was present on long words, but was reversed for short words. Also, consistent with our prior research we found distinct experimental effects of spatially distributed processing over several words at a time, indicating how lexical word processing affected eye movements. Based on analyses of statistical linear mixed-effect models we argue that the results are compatible with the hypothesis that the perceptual span is more strongly modulated by foveal load in the shuffled reading task than in normal reading. Results are discussed in the context of computational models of reading

    Die dynamische VerknĂŒpfung von Kollokationen mit Korpusbelegen und deren ReprĂ€sentationen im DWDS-Wörterbuch

    Get PDF
    In diesem Beitrag soll zunĂ€chst der Hintergrund des DWDS-Wörterbuchs dargestellt werden. Im zweiten Abschnitt erfolgt eine kurze Charakterisierung des im DWDS-Wörterbuch verwendeten Kollokationsbegriffs. Dessen Einbettung in die Wörterbuchstruktur des DWDSWörterbuchs wird im dritten Abschnitt beschrieben. Das eigentliche digitale HerzstĂŒck der Kollokationsbeschreibung im DWDS-Wörterbuch ist das DWDS-Wortprofil, eine auf syntaktischer Analyse und statistischer Auswertung basierende automatische Kollokationsextraktion, deren Grundlagen und QualitĂ€t in Abschnitt 4 dargestellt werden. In Abschnitt 5 soll anhand einiger Beispiele illustriert werden, wie die Arbeitsteilung der automatischen Kollokationen und der lexikographischen Intuition in der tĂ€glichen lexikographischen Arbeit aussieht. Schließlich geben wir im letzten Abschnitt einen Ausblick auf die kĂŒnftige Arbeit
    • 

    corecore