3,252 research outputs found

    DARIAH and the Benelux

    Get PDF

    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2). 29 November 2012, Lisbon, Portugal

    Get PDF
    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), held in Lisbon, Portugal on 29 November 2012

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    Spoken Document Retrieval in a Highly Inflectional Language

    Get PDF
    Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 44-50

    Dealing with spelling variation in Early Modern English texts

    Get PDF
    Early English Books Online contains facsimiles of virtually every English work printed between 1473 and 1700; some 125,000 publications. In September 2009, the Text Creation Partnership released the second instalment of transcriptions of the EEBO collection, bringing the total number of transcribed works to 25,000. It has been estimated that this transcribed portion contains 1 billion words of running text. With such large datasets and the increasing variety of historical corpora available from the Early Modern English period, the opportunities for historial corpus linguistic research have never been greater. However, it has been observed in prior research, and quantified on a large-scale for the first time in this thesis, that texts from this period contain significant amounts of spelling variation until the eventual standardisation of orthography in the 18th century. The problems caused by this historical spelling variation are the focus of this thesis. It will be shown that the high levels of spelling variation found have a significant impact on the accuracy of two widely used automatic corpus linguistic methods - Part-of-Speech annotation and key word analysis. The development of historical spelling normalisation methods which can alleviate these issues will then be presented. Methods will be based on techniques used in modern spellchecking, with various analyses of Early Modern English spelling variation dictating how the techniques are applied. With the methods combined into a single procedure, automatic normalisation can be performed on an entire corpus of any size. Evaluation of the normalisation performance shows that after training, 62% of required normalisations are made, with a precision rate of 95%

    New directions in the study of family names

    Get PDF
    This paper explores and explains recent radical developments in resources and methodology for studying the origins, cultural associations, and histories of family names (also called ‘surnames’). It summarizes the current state of the art and outlines new resources and procedures that are now becoming available. It shows how such innovations can enable the correction of errors in previous work and improve the accuracy of dictionaries of family names, with a focus on the English-speaking world. Developments such as the digitization of archives are having a profound effect, not only on the interpretation and understanding of traditional, ‘established’ family names and their histories, but also of names in other languages and other cultures. There are literally millions of different family names in the world today, many of which have never been studied at all. What are good criteria for selection of entries in a dictionary of family names, and what can be said about them? What is the nature of the evidence? How stable (or how variable) are family names over time? What are the effects of factors such as migration? What is the relationship between family names and geographical locations, given that people can and do move around? What is the relationship between traditional philological and historical approaches to the subject and statistical analysis of newly available digitized data? The paper aims to contribute to productive discussion of such questions

    Building a Multilingual Taxonomy of Olfactory Terms with Timestamps

    Get PDF
    Olfactory references play a crucial role in our memory and, more generally, in our experiences, since researchers have shown that smell is the sense that is most directly connected with emotions. Nevertheless, only few works in NLP have tried to capture this sensory dimension from a computational perspective. One of the main challenges is the lack of a systematic and consistent taxonomy of olfactory information, where concepts are organised also in a multi-lingual perspective. WordNet represents a valuable starting point in this direction, which can be semi-automatically extended taking advantage of Google n-grams and of existing language models. In this work we describe the process that has led to the semi-automatic development of a taxonomy for olfactory information in four languages (English, French, German and Italian), detailing the different steps and the intermediate evaluations. Along with being multi-lingual, the taxonomy also encloses temporal marks for olfactory terms thus making it a valuable resource for historical content analysis. The resource has been released and is freely available

    Alchemy in the vernacular: an edition and study of early English witnesses of The Mirror of Alchemy

    Get PDF
    This study concerns an English-language alchemical work called The Mirror of Alchemy (MoA). I examine manuscript copies of MoA from the 15th to 17th centuries as well as a printed edition from 1597. The main aim of my study is to edit a previously unstudied manuscript version of MoA, making this work accessible for future research and contributing to developing editorial methods for early scientific texts. A central aim is to place MoA in its textual and historical contexts to clarify the edited text to readers. I employ theory and methods from the fields of scholarly editing and textual scholarship, and integrate the discussion of manuscript and printed witnesses. MoA is an English translation of the Latin work Speculum alchemiae. This is a well-known alchemical work, formerly attributed to Roger Bacon (c. 1214–1292?). The material for my study consists of the seven extant manuscript copies of MoA, as well as the 1597 printed edition. There is a previous edition of the 1597 witness, but all the manuscript copies were previously unstudied and unedited. My analysis uncovers the textual relationships between the witnesses as well as examining the witnesses as translations, focusing on the translation of specialised alchemical terminology into English. Based on detailed qualitative textual comparisons, my study shows that the witnesses of MoA can be divided into four textual groups. MoA is an example of the gradual shift from Latin to English, as the four Groups represent different translations of Speculum alchemiae. I examine these Groups both from the point of view of their textual relationships and that of the influence of the (potential) source texts on the translations. My analysis shows that a combination of linguistic strategies was used to translate Speculum alchemiae into English multiple times. The differences in the translations are explained by the translation strategies used and diachronic changes in the language of science. A major result of this study is also the best-text edition and its commentary and glossary, as well as transcriptions of the four Groups. The edition’s text also provides some previously unrecorded words and antedatings: these show that editing and studying early alchemical material is a valuable undertaking also from a lexicological perspective.--- Väitöstutkimuksessa tarkastelen englanninkielistä alkemistista teosta nimeltään The Mirror of Alchemy (MoA). Aineistoni koostuu MoA:n 1400–1600-luvuilta olevista käsikirjoituskopioista sekä vuoden 1597 painetusta editiosta. Tutkimukseni päätavoite on editoida MoA:n aiemmin tutkimaton käsikirjoitusversio, mikä tuo aineiston tutkijoiden käyttöön. Editio kehittää myös osaltaan editointimetodeja varhaisille tieteellisille teksteille. Tutkimuksen keskeinen tavoite on asettaa MoA tekstuaalisiin ja historiallisiin konteksteihinsa. Käytän tutkimuksessa tieteellisen editoinnin ja tekstuaalitieteiden teorioita ja metodeja, ja käsittelen painettua ja käsikirjoitusaineistoa yhdessä. MoA on käännös latinankielisestä teoksesta Speculum alchemiae. Tätä tunnettua alkemistista teosta pidettiin ennen Roger Baconin (n. 1214–1292?) kirjoittamana. Tutkimusaineistoni koostuu MoA:n seitsemästä säilyneestä käsikirjoituskopiosta sekä vuonna 1597 painetusta editiosta. Jälkimmäisestä on olemassa tieteellinen editio, mutta käsikirjoituskopioita ei ole tutkittu tai editoitu. Analyysini selvittää tekstien väliset suhteet sekä tarkastelee tekstejä käännöksinä keskittyen erityisesti siihen, miten alkemistista erikoisterminologiaa on käännetty englanniksi. Tutkimus osoittaa tekstikriittisen vertailun pohjalta, että MoA voidaan jakaa neljään tekstiryhmään. MoA on esimerkki tieteen kielen vähittäisestä siirtymästä latinasta englantiin, ja neljä tekstiryhmää edustavatkin eri käännöksiä Speculum alchemiaesta. Tarkastelen näitä käännöksiä tekstien välisten suhteiden näkökulmasta ja tutkin, miten (mahdolliset) lähtötekstit ovat vaikuttaneet käännöksiin. Analyysini osoittaa, että eri käännöksissä oli käytössä oli erilaisia kielellisiä strategioita Speculum alchemiaen kääntämisessä. Käännösten väliset erot selittyvät eri käännösstrategioilla sekä tieteen kielen diakronisilla muutoksilla. Merkittävä tulos on myös tutkimukseen sisältyvä best text -editio, johon kuuluu kommentaari ja sanasto, sekä transkriptiot kaikista neljästä ryhmästä. Edition teksti tuo myös esille joitakin aiemmin tuntemattomia sanoja sekä sanoja, jotka varhaistavat sanakirjojen ensiesiintymiä. Ne kertovat siitä, että varhaisen alkemistisen aineiston editointi ja tutkiminen on kannattavaa myös sanastontutkimuksen näkökulmasta

    Visualización del lenguaje a través de corpus

    Get PDF
    Digital version of the print publication, published in A Coruña: Universidade da Coruña, Servizo de Publicacións, 2010 (ISBN 978-84-9749-401-4)This book contains the papers presented at the Second International Conference on Corpus Linguistics held at the University of A Coruña in 2010 and organised by the MuStE group. The essays deal with different aspects of corpus linguistics both as a methodology and as a branch of Linguistics.[Abstract] The collection of essays we are presenting here are just a mere sample of the interest the topics relating to Corpus Linguistics have arisen everywhere. Such different topics as those related to Computational Linguistics found in “Obtaining computational resources for languages with scarce resources from closely related computationally-developed languages. The Galician and Portuguese case“ or “Corpus-Based Modelling of Lexical Changes in Manic Depression Disorders: The Case of Edgar Allan Poe” belonging to the field of Corpus and Literary Studies can be found in the ensuing pages. Almost all research areas can nowadays be investigated using Corpus Linguistics as a valid methodology. This is reason why Language Windowing through Corpora gathers papers dealing with discourse, variation and change, grammatical studies, lexicology and lexicography, corpus design, contrastive analyses, language acquisition and learning or translation. This work’s title aims at reflecting not only the great variety of topics gathered in it but also the worldwide interest awaken by the computer processing of language. In fact, researchers from many different institutions all over the world have contributed to this book. Apart from the twenty-two Spanish Universities, people from other Higher Education Institutions have authored and co-authored the essays contained here, namely, Russia, Venezuela, Brazil, UK, Finland, Portugal, Poland, Austria, Mexico, Thailand, Iran, the Netherlands, Belgium, Japan, Turkey, China, Italy, Malaysia, Romania and Sweden. All these essays have been alphabetically arranged, by the names of their authors, in two parts. Part 1 contains the papers by authors from A to K and Part 2, those of authors from L to Z
    corecore