Search CORE

86 research outputs found

Exploring lexical patterns in text : lexical cohesion analysis with WordNet

Author: Fankhauser Peter
Teich Elke
Publication venue
Publication date: 01/01/2005
Field of study

We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource

TUbiblio

Hochschulschriftenserver - Universität Frankfurt am Main

Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

Author: Hansen Silvia
Teich Elke
Publication venue: Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney.
Publication date: 01/01/2001
Field of study

In the proposed talk we discuss the application of a set of computational text analysis techniques for the analysis of the linguistic features of translations. The goal of this analysis is to test two hypotheses about the specific properties of translations: Baker's hypothesis of normalization (Baker, 1995) and Toury's law of interference (Toury, 1995). The corpus we analyze consists of English and German original texts and translations of those texts into German and English, respectively. The analysis task is complex in a number of respects. First, a multi-level analysis (clause, phrases, words) has to be carried out; second, among the linguistic features selected for analysis are some rather abstract ones, ranging from functional-grammatical features, e.g., Subject, Adverbial of Time, etc, to semantic features, e.g., semantic roles, such as Agent, Goal, Locative, etc.; third, monolingual and contrastive analyses are involved. This places certain requirements on the computational techniques to be employed both regarding corpus encoding, linguistic annotation and information extraction. We show how a combination of commonly available techniques can fulfill these requirements to a large degree and point out their limitations for application to the research questions raised. These techniques range from document encoding (TEI, XML) over automatic corpus annotation (notably part-of-speech tagging; Brants, 2000) and semi-automatic annotation (O'Donnell, 1995) to query systems as implemented in e.g., the IMS Corpus Workbench (Christ, 1994), the MATE system (Mengel & Lezius, 2000) and the Gsearch system (Keller et al., 1999).Hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney

CiteSeerX

Sydney eScholarship

Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

Author: Hansen Silvia
Teich Elke
Publication venue: Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney.
Publication date: 01/01/2001
Field of study

Estudo Geral

Sydney eScholarship

5. Generische Infrastruktur und spezifische Forschung: Angebote und Lösungen

Author: Kermes Hannah
Teich Elke
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2018
Field of study

Die empirische Forschung an natürlichsprachlichen Daten geht mit grundlegenden methodischen Veränderungen einher. Immer mehr Texte stehen in digitaler Form zu Verfügung. Eine rein manuelle Vorgehensweise ist nicht möglich oder extrem zeitaufwendig. Wir zeigen welche Vorteile der Einsatz von generischen Infrastrukturkomponenten für spezifische Forschung haben kann:(i) effiziente Untersuchungen auf größeren Datenmengen, (ii) reproduzierbare und übertragbare Ergebnisse. Wir zeigen an einer konkreten Studie, wie generische Infrastruktur spezifisch angepasst und durch spezifische Lösungen ergänzt werden kann.Die im Artikel beschriebenen Arbeiten wurden durch das Bundesministeriums fürBildung und Forschung im Rahmen des CLARIN-D Projekts unterstützt

Crossref

Universaar

Acronym

Using relative entropy for detection and analysis of periods of diachronic linguistic change

Author: Degaetano-Ortlieb Stefania
Teich Elke
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2018
Field of study

We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.This research is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft) under grants SFB1102: Information Density and Linguistic Encoding (www.sfb1102.uni-saarland.de) and EXC 284: Multimodal Computing and Interaction (www.mmci.uni-saarland.de)

Universaar

Acronym

Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns

Author: Degaetano-Ortlieb Stefania
Teich Elke
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2017
Field of study

We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).This work is funded by Deutsche Forschungsgemeinschaft (DFG) under grants SFB 1102: Information Density and Linguistic Encoding (www.sfb1102.uni-saarland.de) and EXC 284: Multimodal Computing and Interaction (www.mmci.uni-saarland.de)

Universaar

Acronym

Topical Diversification Over Time In The Royal Society Corpus

Author: Fankhauser Peter
Knappen Jörg
Teich Elke
Publication venue: Kraków : Jagiellonian University; Pedagogical University
Publication date: 07/11/2016
Field of study

Publikationsserver des Instituts für Deutsche Sprache

Generating linguistically relevant metadata for the Royal Society Corpus

Author: Knappen Jörg
Menzel Katrin
Teich Elke
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

This paper provides an overview on metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register)

Universaar

Acronym

The Making of the Royal Society Corpus

Author: Fischer Stefan
Kermes Hannah
Knappen Jörg
Teich Elke
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2017
Field of study

The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of partof-speech tagging

Universaar

Acronym