86 research outputs found

    Exploring lexical patterns in text : lexical cohesion analysis with WordNet

    Get PDF
    We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource

    Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

    Get PDF
    In the proposed talk we discuss the application of a set of computational text analysis techniques for the analysis of the linguistic features of translations. The goal of this analysis is to test two hypotheses about the specific properties of translations: Baker's hypothesis of normalization (Baker, 1995) and Toury's law of interference (Toury, 1995). The corpus we analyze consists of English and German original texts and translations of those texts into German and English, respectively. The analysis task is complex in a number of respects. First, a multi-level analysis (clause, phrases, words) has to be carried out; second, among the linguistic features selected for analysis are some rather abstract ones, ranging from functional-grammatical features, e.g., Subject, Adverbial of Time, etc, to semantic features, e.g., semantic roles, such as Agent, Goal, Locative, etc.; third, monolingual and contrastive analyses are involved. This places certain requirements on the computational techniques to be employed both regarding corpus encoding, linguistic annotation and information extraction. We show how a combination of commonly available techniques can fulfill these requirements to a large degree and point out their limitations for application to the research questions raised. These techniques range from document encoding (TEI, XML) over automatic corpus annotation (notably part-of-speech tagging; Brants, 2000) and semi-automatic annotation (O'Donnell, 1995) to query systems as implemented in e.g., the IMS Corpus Workbench (Christ, 1994), the MATE system (Mengel & Lezius, 2000) and the Gsearch system (Keller et al., 1999).Hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney

    Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

    Get PDF
    In the proposed talk we discuss the application of a set of computational text analysis techniques for the analysis of the linguistic features of translations. The goal of this analysis is to test two hypotheses about the specific properties of translations: Baker's hypothesis of normalization (Baker, 1995) and Toury's law of interference (Toury, 1995). The corpus we analyze consists of English and German original texts and translations of those texts into German and English, respectively. The analysis task is complex in a number of respects. First, a multi-level analysis (clause, phrases, words) has to be carried out; second, among the linguistic features selected for analysis are some rather abstract ones, ranging from functional-grammatical features, e.g., Subject, Adverbial of Time, etc, to semantic features, e.g., semantic roles, such as Agent, Goal, Locative, etc.; third, monolingual and contrastive analyses are involved. This places certain requirements on the computational techniques to be employed both regarding corpus encoding, linguistic annotation and information extraction. We show how a combination of commonly available techniques can fulfill these requirements to a large degree and point out their limitations for application to the research questions raised. These techniques range from document encoding (TEI, XML) over automatic corpus annotation (notably part-of-speech tagging; Brants, 2000) and semi-automatic annotation (O'Donnell, 1995) to query systems as implemented in e.g., the IMS Corpus Workbench (Christ, 1994), the MATE system (Mengel & Lezius, 2000) and the Gsearch system (Keller et al., 1999).Hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney

    5. Generische Infrastruktur und spezifische Forschung: Angebote und Lösungen

    Get PDF
    Die empirische Forschung an natürlichsprachlichen Daten geht mit grundlegenden methodischen Veränderungen einher. Immer mehr Texte stehen in digitaler Form zu Verfügung. Eine rein manuelle Vorgehensweise ist nicht möglich oder extrem zeitaufwendig. Wir zeigen welche Vorteile der Einsatz von generischen Infrastrukturkomponenten für spezifische Forschung haben kann:(i) effiziente Untersuchungen auf größeren Datenmengen, (ii) reproduzierbare und übertragbare Ergebnisse. Wir zeigen an einer konkreten Studie, wie generische Infrastruktur spezifisch angepasst und durch spezifische Lösungen ergänzt werden kann.Die im Artikel beschriebenen Arbeiten wurden durch das Bundesministeriums fürBildung und Forschung im Rahmen des CLARIN-D Projekts unterstützt

    Using relative entropy for detection and analysis of periods of diachronic linguistic change

    Get PDF
    We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.This research is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft) under grants SFB1102: Information Density and Linguistic Encoding (www.sfb1102.uni-saarland.de) and EXC 284: Multimodal Computing and Interaction (www.mmci.uni-saarland.de)

    Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns

    Get PDF
    We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).This work is funded by Deutsche Forschungsgemeinschaft (DFG) under grants SFB 1102: Information Density and Linguistic Encoding (www.sfb1102.uni-saarland.de) and EXC 284: Multimodal Computing and Interaction (www.mmci.uni-saarland.de)

    Topical Diversification Over Time In The Royal Society Corpus

    Get PDF

    Generating linguistically relevant metadata for the Royal Society Corpus

    Get PDF
    This paper provides an overview on metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register)

    The Making of the Royal Society Corpus

    Get PDF
    The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of partof-speech tagging
    corecore