214 research outputs found

    Comparison of distance measures for historical spelling variants

    Get PDF
    This paper describes the comparison of selected distance measures in their applicability for supporting retrieval of historical spelling variants (hsv). The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy fulltext search engine for historical text documents. This engine should provide easier text access for experts as well as interested amateurs. The FlexMetric framework enhances the distance measure algorithm found to be most efficient according to the results of the evaluation. This measure can be used for multiple applications, including searching, post-ranking, transformation and even reflection about one’s own language.IFIP International Conference on Artificial Intelligence in Theory and Practice - Speech and Natural LanguageRed de Universidades con Carreras en Informática (RedUNCI

    Rule-based search in historical text databases - Visualization techniques

    Get PDF

    BEA – A multifunctional Hungarian spoken language database

    Get PDF
    In diverse areas of linguistics, the demand for studying actual language use is on the increase. The aim of developing a phonetically-based multi-purpose database of Hungarian spontaneous speech, dubbed BEA2, is to accumulate a large amount of spontaneous speech of various types together with sentence repetition and reading. Presently, the recorded material of BEA amounts to 260 hours produced by 280 present-day Budapest speakers (ages between 20 and 90, 168 females and 112 males), providing also annotated materials for various types of research and practical applications

    06491 Abstracts Collection -- Digital Historical Corpora- Architecture, Annotation, and Retrieval

    Get PDF
    From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if availabl

    Computational analysis of medieval manuscripts: a new tool for analysis and mapping of medieval documents to modern orthography

    Get PDF
    Medieval manuscripts or other written documents from that period contain valuable information about people, religion, and politics of the medieval period, making the study of medieval documents a necessary pre-requisite to gaining in-depth knowledge of medieval history. Although tool-less study of such documents is possible and has been ongoing for centuries, much subtle information remains locked such manuscripts unless it gets revealed by effective means of computational analysis. Automatic analysis of medieval manuscripts is a non-trivial task mainly due to non-conforming styles, spelling peculiarities, or lack of relational structures (hyper-links), which could be used to answer meaningful queries. Natural Language Processing (NLP) tools and algorithms are used to carry out computational analysis of text data. However due to high percentage of spelling variations in medieval manuscripts, NLP tools and algorithms cannot be applied directly for computational analysis. If the spelling variations are mapped to standard dictionary words, then application of standard NLP tools and algorithms becomes possible. In this paper we describe a web-based software tool CAMM (Computational Analysis of Medieval Manuscripts) that maps medieval spelling variations to a modern German dictionary. Here we describe the steps taken to acquire, reformat, and analyze data, produce putative mappings as well as the steps taken to evaluate the findings. At the time of the writing of this paper, CAMM provides access to 11275 manuscripts organized into 54 collections containing a total of 242446 distinctly spelled words. CAMM accurately corrects spelling of 55% percent of the verifiable words.Thanks to Georg Vogeler for his valuable suggestions about the algorithms. Thanks also to Jochen Graf and the Monasterium consortium for having given us access to the medieval dataset and for sharing valuable information about the existing EditMOM tools. Thanks to the Athabasca University, for providing a server to launch this tool, and thanks to theWeb Unit of the Computing Services Department at Athabasca for keeping the link alive.http://www.jucs.org/;internal&action=noaction&Parameter=1208164030958am201

    Grassroots prescriptivism

    Get PDF
    Until the beginning of this century, with few notable exceptions, prescriptivism has received little serious attention among the academic linguistic community as a factor in language variation and change. The five studies included in this book are embedded in the growing research initiative that is attempting to paint a fine-grained picture of linguistic prescriptivism in the English language. In contrast to institutional prescriptivism, or the so-called prescriptivism from above, which is enforced by bodies such as language planning boards, governmental committees, and agencies, this book focuses on grassroots prescriptivism – the attempts of lay people to promote the standard language ideology. Grassroots prescriptivism investigates the metalinguistic comments of language users expressed on traditional (letters to newspaper editors and radio phone-ins) and new media platforms (forum and blog discussions). This book demonstrates that, contrary to popular belief, language users are not passive recipients of language rules, but active participants in matters of linguistic prescriptivism. The diachronic exploration of grassroots prescriptivism reveals a complex picture. While in many respects, twenty-first-century prescriptivism represents a continuation of the 250-year-old prescriptive tradition, the author argues that prescriptivism, like language itself, undergoes change over time. Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO)Language Use in Past and Presen

    Application of a POS Tagger to a Novel Chronological Division of Early Modern German Text

    Get PDF
    This paper describes the application of a part-of-speech tagger to a particular configuration of historical German documents. Most natural language processing (NLP) is done on contemporary documents, and historical documents can present difficulties for these tools. I compared the performance of a single high-quality tagger on two stages of historical German (Early Modern German) materials. I used the TnT (Trigrams 'n' Tags) tagger, a probabilistic tagger developed by Thorsten Brants in a 2000 paper. I applied this tagger to two subcorpora which I derived from the University of Manchester's GerManC corpus, divided by date of creation of the original document, with each one used for both training and testing. I found that the earlier half, from a period with greater variability in the language, was significantly more difficult to tag correctly. The broader tag categories of punctuation and "other" were overrepresented in the errors.Master of Science in Information Scienc

    A History of the Welsh English Dialect in Fiction

    Get PDF
    The systematic study of language varieties in fictional texts have primarily focused upon written material. Recently, linguists have also added audio-visual genres to the analytic framework of literary dialect studies. Studies have traditionally examined writers’ lexical, phonological, and grammatical output; contemporarily, research has begun examining metalinguistic commentaries and linguistic indexing of character stereotypes to this repertoire (Hodson, 2014).Except for minor analysis of early texts (German, 2009), there has been no large-scale investigation of any Welsh English dialect in fiction. This thesis addresses this gap, asking the fundamental question: throughout history, how has Welsh English been represented in fiction? The thesis surveys a large chronological scope covering material from the 12th century until the present day across four narrative-genres: early writings and theatrical writing, novels, films, and, new to literary dialect studies, videogames. In doing so, a historical discussion forms that covers Welsh English’s fictolinguistic output, cross-referencing its linguistic forms with recorded data, identifying forms hitherto unknown to dialectological surveys, and addressing metalinguistic and attitudinal stereotypes in fiction.Key findings include that phonology was an early representational linguistic domain in the literary dialect, whilst lexical and grammatical domains became common from 19th century literature onwards. The commonest phonological and lexical features were glottal fricative drops and tapped /r/; and the endearment terms ‘bach/fach’ and ‘mam’ respectively. Grammatically, ‘Focus Fronting’ and ‘Demonstrative There’ regularly occurred. Regarding linguistic evidence, several authors and filmmakers were prolific lay surveyors of the variety, adding to the historical dialectological record. Concerning dialectal attitudes, Elizabethan playwrights used linguistic stereotyping to create character stereotypes of Welsh people as ‘comical’. By the 19th century, fictive Welsh English representation was the dominion of native-users in literature, film, and videogames; however today, the Comic stereotype, and an emerging stereotype of Welsh English users being Fantastical, appears embedded within the dialect’s representation

    Effects of Orthographic, Phonologic, and Semantic Information Sources on Visual and Auditory Lexical Decision

    Get PDF
    The present study was designed to compare lexical decision latencies in visual and auditory modalities to three word types: (a) words that are inconsistent with two information sources, orthography and semantics (i.e., heterographic homophones such as bite/byte), (b) words that are inconsistent with one information source, semantics (i.e., homographic homophones such as bat), and (c) control words that are not inconsistent with any information source. Participants (N = 76) were randomly assigned to either the visual or auditory condition in which they judged the lexical status (word or nonword) of 180 words (60 heterographic homophones, 60 homographic homophones, and 60 control words) and 180 pronounceable nonsense word foils. Results differed significantly in the visual and auditory modalities. In visual lexical decision, homographic homophones were responded to faster than heterographic homophones or control words, which did not differ significantly. In auditory lexical decision, both homographic homophones and heterographic homophones were responded to faster than control words. Results are used to propose potential modifications to the Cooperative Division of Labor Model of Word Recognition (Harm & Seidenberg, 2004) to enable it to encompass both the visual and auditory modalities and account for the present results
    • 

    corecore