Search CORE

7 research outputs found

GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora

Author: Bennett Paul
Durrell Martin
Ensslin Astrid
Publication venue: Dagstuhl Seminar Proceedings. 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Publication date: 01/01/2007
Field of study

Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other. We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, "abperlende" (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity. Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by "conventional" and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation

Dagstuhl Research Online Publication Server

06491 Abstracts Collection -- Digital Historical Corpora- Architecture, Annotation, and Retrieval

Author: Burnard Lou
Dobreva Milena
Fuhr Norbert
Publication venue: Dagstuhl Seminar Proceedings. 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Publication date: 01/01/2007
Field of study

From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if availabl

Dagstuhl Research Online Publication Server

Sosiolingvistisen vaihtelun tarkastelu englannin sananmuodostuksessa historiallisen korpustutkimuksen keinoin

Author: Säily Tanja
Publication venue: Société Néophilologique
Publication date: 31/10/2014
Field of study

This dissertation studies how the productivity of word-formation varies across social groups in the history of the English language. Previous research into variation and change within the morphological productivity of derivational affixes has been hampered by the lack of suitable methods for comparing productivity measures across subcorpora. A further problem has been how to assess the statistical significance of the differences observed. The latter issue is also present in comparisons of word frequencies in diachronic corpus linguistics: previous work has tended to use tests which make the invalid assumption that words occur randomly in texts. Moreover, the question often arises whether the change observed is linguistic, stylistic or an artefact of the corpus. The present work explores sociolinguistic variation and change in the morphological productivity of the nominal suffixes -ness and -ity from Early Modern English to Present-day English, using materials such as the Corpora of Early English Correspondence and the British National Corpus. To do this, it employs robust methods to compare item frequencies over time and across social categories. Developed in collaboration with computer scientists, the methods include non-parametric measures of statistical significance as well as visualisations revealing variability within (sub)corpora and facilitating exploration. In addition to research into individual linguistic features, the methods can be used to compare corpora and study genre continuity at the levels of vocabulary and parts of speech. Besides corpus-linguistic methodology, the work contributes to the theory and description of derivational productivity. Firstly, it shows that each of the social categories studied - gender, social rank, and register in terms of participant relations - may have an influence on productivity, gender being the most consistent factor in the case of -ity. Furthermore, it shows that while productivity measures based on the frequency of hapax legomena, or words occurring only once in the corpus, are unusable in small corpora, they do function as expected in large corpora and remain theoretically valid. These findings should be taken into account in future research, and it is to be hoped that future studies will be significantly facilitated by the methodological contributions presented in this dissertation.Tämä väitöskirja tutkii laajojen elektronisten tekstikorpusten avulla, miten sananmuodostuksen produktiivisuus eli uusien sanojen tuottamisen todennäköisyys vaihtelee eri sosiaaliryhmien välillä englannin kielen historiassa. Kirjassa tarkastellaan, kuinka paljon vaihtelua ja millaisia muutoksia esiintyy englannin substantiivijohtimien -ness ja -ity produktiivisuudessa 1600-luvulta nykypäivään. Tutkimustulokset osoittavat, että johtimien produktiivisuuteen voivat vaikuttaa kaikki tutkituista sosiaalisista kategorioista: kielenkäyttäjien sukupuoli ja yhteiskuntaluokka sekä viestintätilanteen osallistujien väliset suhteet. Englantiin ranskasta ja latinasta lainautuneen -ity-johtimen kannalta merkittävin kategoria on sukupuoli, sillä johtimen käyttö on miesvaltaista jokaisena aineistojen kattamana aikakautena. Tämä saattaa selittyä sukupuolittuneilla kirjoitustyyleillä. Kotoperäisen -ness-johtimen produktiivisuuden vaihtelu on vähäisempää. Produktiivisuuden vaihtelua ja muutosta on aiemmin ollut vaikeaa tutkia korpuslingvistisesti, koska mittaustulosten vertailuun ei ole ollut sopivia menetelmiä. Ongelmallista on ollut myös vaihtelun tilastollisen merkitsevyyden määrittäminen. Sama ongelma on vaivannut kielen muutoksen tutkimusta yleisemminkin. Lisäksi on usein epäselvää, liittyykö havaittu muutos kielen vai kirjoitustyylin muuttumiseen, vai johtuuko se aineiston epätasaisuudesta. Tässä tutkimuksessa käytetään uusia, yhteistyössä tietojenkäsittelytieteilijöiden kanssa kehitettyjä menetelmiä, jotka mahdollistavat kielellisten piirteiden esiintymistiheyden luotettavan vertailun eri ajanjaksojen ja sosiaaliryhmien välillä. Tilastollisen merkitsevyyden mittareiden lisäksi esitellään visualisointimenetelmiä, joiden avulla voidaan perehtyä aineistojen sisäiseen vaihteluun ja löytää uusia tutkimuskohteita. Menetelmiä voidaan käyttää myös aineistojen vertailuun sekä tekstilajien muutoksen tutkimiseen

Helsingin yliopiston digitaalinen arkisto

Proceedings of the Conference on Natural Language Processing 2010

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2010
Field of study

This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natürlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field

Acronym

Digitised Newspapers – A New Eldorado for Historians?

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Digitization technologies applied to historical newspapers have changed the research landscape historians were used to. An Eldorado? Despite unquestionable merits, the new digital affordance of historical newspapers also brings drawbacks and possible pitfalls which need to be carefully assessed

OAPEN Library