7 research outputs found
GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora
Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other.
We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, "abperlende" (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity.
Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by "conventional" and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation
06491 Abstracts Collection -- Digital Historical Corpora- Architecture, Annotation, and Retrieval
From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval\u27\u27 was held
in the International Conference and Research Center (IBFI),
Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if availabl
Sosiolingvistisen vaihtelun tarkastelu englannin sananmuodostuksessa historiallisen korpustutkimuksen keinoin
This dissertation studies how the productivity of word-formation varies across social groups in the history of the English language. Previous research into variation and change within the morphological productivity of derivational affixes has been hampered by the lack of suitable methods for comparing productivity measures across subcorpora. A further problem has been how to assess the statistical significance of the differences observed. The latter issue is also present in comparisons of word frequencies in diachronic corpus linguistics: previous work has tended to use tests which make the invalid assumption that words occur randomly in texts. Moreover, the question often arises whether the change observed is linguistic, stylistic or an artefact of the corpus.
The present work explores sociolinguistic variation and change in the morphological productivity of the nominal suffixes -ness and -ity from Early Modern English to Present-day English, using materials such as the Corpora of Early English Correspondence and the British National Corpus. To do this, it employs robust methods to compare item frequencies over time and across social categories. Developed in collaboration with computer scientists, the methods include non-parametric measures of statistical significance as well as visualisations revealing variability within (sub)corpora and facilitating exploration. In addition to research into individual linguistic features, the methods can be used to compare corpora and study genre continuity at the levels of vocabulary and parts of speech.
Besides corpus-linguistic methodology, the work contributes to the theory and description of derivational productivity. Firstly, it shows that each of the social categories studied - gender, social rank, and register in terms of participant relations - may have an influence on productivity, gender being the most consistent factor in the case of -ity. Furthermore, it shows that while productivity measures based on the frequency of hapax legomena, or words occurring only once in the corpus, are unusable in small corpora, they do function as expected in large corpora and remain theoretically valid. These findings should be taken into account in future research, and it is to be hoped that future studies will be significantly facilitated by the methodological contributions presented in this dissertation.Tämä väitöskirja tutkii laajojen elektronisten tekstikorpusten avulla, miten sananmuodostuksen produktiivisuus eli uusien sanojen tuottamisen todennäköisyys vaihtelee eri sosiaaliryhmien välillä englannin kielen historiassa. Kirjassa tarkastellaan, kuinka paljon vaihtelua ja millaisia muutoksia esiintyy englannin substantiivijohtimien -ness ja -ity produktiivisuudessa 1600-luvulta nykypäivään.
Tutkimustulokset osoittavat, että johtimien produktiivisuuteen voivat vaikuttaa kaikki tutkituista sosiaalisista kategorioista: kielenkäyttäjien sukupuoli ja yhteiskuntaluokka sekä viestintätilanteen osallistujien väliset suhteet. Englantiin ranskasta ja latinasta lainautuneen -ity-johtimen kannalta merkittävin kategoria on sukupuoli, sillä johtimen käyttö on miesvaltaista jokaisena aineistojen kattamana aikakautena. Tämä saattaa selittyä sukupuolittuneilla kirjoitustyyleillä. Kotoperäisen -ness-johtimen produktiivisuuden vaihtelu on vähäisempää.
Produktiivisuuden vaihtelua ja muutosta on aiemmin ollut vaikeaa tutkia korpuslingvistisesti, koska mittaustulosten vertailuun ei ole ollut sopivia menetelmiä. Ongelmallista on ollut myös vaihtelun tilastollisen merkitsevyyden määrittäminen. Sama ongelma on vaivannut kielen muutoksen tutkimusta yleisemminkin. Lisäksi on usein epäselvää, liittyykö havaittu muutos kielen vai kirjoitustyylin muuttumiseen, vai johtuuko se aineiston epätasaisuudesta.
Tässä tutkimuksessa käytetään uusia, yhteistyössä tietojenkäsittelytieteilijöiden kanssa kehitettyjä menetelmiä, jotka mahdollistavat kielellisten piirteiden esiintymistiheyden luotettavan vertailun eri ajanjaksojen ja sosiaaliryhmien välillä. Tilastollisen merkitsevyyden mittareiden lisäksi esitellään visualisointimenetelmiä, joiden avulla voidaan perehtyä aineistojen sisäiseen vaihteluun ja löytää uusia tutkimuskohteita. Menetelmiä voidaan käyttää myös aineistojen vertailuun sekä tekstilajien muutoksen tutkimiseen
Proceedings of the Conference on Natural Language Processing 2010
This book contains state-of-the-art contributions to the 10th
conference on Natural Language Processing, KONVENS 2010
(Konferenz zur Verarbeitung natürlicher Sprache), with a focus
on semantic processing.
The KONVENS in general aims at offering a broad perspective
on current research and developments within the interdisciplinary
field of natural language processing. The central theme
draws specific attention towards addressing linguistic aspects
ofmeaning, covering deep as well as shallow approaches to semantic
processing. The contributions address both knowledgebased
and data-driven methods for modelling and acquiring
semantic information, and discuss the role of semantic information
in applications of language technology.
The articles demonstrate the importance of semantic processing,
and present novel and creative approaches to natural
language processing in general. Some contributions put their
focus on developing and improving NLP systems for tasks like
Named Entity Recognition or Word Sense Disambiguation, or
focus on semantic knowledge acquisition and exploitation with
respect to collaboratively built ressources, or harvesting semantic
information in virtual games. Others are set within the
context of real-world applications, such as Authoring Aids, Text
Summarisation and Information Retrieval. The collection highlights
the importance of semantic processing for different areas
and applications in Natural Language Processing, and provides
the reader with an overview of current research in this field
Digitised Newspapers – A New Eldorado for Historians?
Digitization technologies applied to historical newspapers have changed the research landscape historians were used to. An Eldorado? Despite unquestionable merits, the new digital affordance of historical newspapers also brings drawbacks and possible pitfalls which need to be carefully assessed