214 research outputs found
Comparison of distance measures for historical spelling variants
This paper describes the comparison of selected distance measures in their applicability for supporting retrieval of historical spelling variants (hsv). The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy fulltext search engine for historical text documents. This engine should provide easier text access for experts as well as interested amateurs.
The FlexMetric framework enhances the distance measure algorithm found to be most efficient according to the results of the evaluation.
This measure can be used for multiple applications, including searching, post-ranking, transformation and even reflection about oneâs own language.IFIP International Conference on Artificial Intelligence in Theory and Practice - Speech and Natural LanguageRed de Universidades con Carreras en InformĂĄtica (RedUNCI
BEA â A multifunctional Hungarian spoken language database
In diverse areas of linguistics, the demand for studying actual language use is on
the increase. The aim of developing a phonetically-based multi-purpose database of
Hungarian spontaneous speech, dubbed BEA2, is to accumulate a large amount of
spontaneous speech of various types together with sentence repetition and reading.
Presently, the recorded material of BEA amounts to 260 hours produced by 280
present-day Budapest speakers (ages between 20 and 90, 168 females and 112
males), providing also annotated materials for various types of research and practical
applications
06491 Abstracts Collection -- Digital Historical Corpora- Architecture, Annotation, and Retrieval
From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval\u27\u27 was held
in the International Conference and Research Center (IBFI),
Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if availabl
Computational analysis of medieval manuscripts: a new tool for analysis and mapping of medieval documents to modern orthography
Medieval manuscripts or other written documents from that period contain
valuable information about people, religion, and politics of the medieval period, making
the study of medieval documents a necessary pre-requisite to gaining in-depth knowledge
of medieval history. Although tool-less study of such documents is possible and has
been ongoing for centuries, much subtle information remains locked such manuscripts
unless it gets revealed by effective means of computational analysis. Automatic analysis
of medieval manuscripts is a non-trivial task mainly due to non-conforming styles,
spelling peculiarities, or lack of relational structures (hyper-links), which could be used
to answer meaningful queries. Natural Language Processing (NLP) tools and algorithms
are used to carry out computational analysis of text data. However due to high
percentage of spelling variations in medieval manuscripts, NLP tools and algorithms
cannot be applied directly for computational analysis. If the spelling variations are
mapped to standard dictionary words, then application of standard NLP tools and algorithms
becomes possible. In this paper we describe a web-based software tool CAMM
(Computational Analysis of Medieval Manuscripts) that maps medieval spelling variations
to a modern German dictionary. Here we describe the steps taken to acquire,
reformat, and analyze data, produce putative mappings as well as the steps taken to
evaluate the findings. At the time of the writing of this paper, CAMM provides access
to 11275 manuscripts organized into 54 collections containing a total of 242446
distinctly spelled words. CAMM accurately corrects spelling of 55% percent of the verifiable
words.Thanks to Georg Vogeler for his valuable suggestions about the algorithms.
Thanks also to Jochen Graf and the Monasterium consortium for having given
us access to the medieval dataset and for sharing valuable information about the
existing EditMOM tools. Thanks to the Athabasca University, for providing a
server to launch this tool, and thanks to theWeb Unit of the Computing Services
Department at Athabasca for keeping the link alive.http://www.jucs.org/;internal&action=noaction&Parameter=1208164030958am201
Grassroots prescriptivism
Until the beginning of this century, with few notable
exceptions, prescriptivism has received little serious attention among the
academic linguistic community as a factor in language variation and change.
The five studies included in this book are embedded in the growing research
initiative that is attempting to paint a fine-grained picture of linguistic
prescriptivism in the English language. In contrast to institutional
prescriptivism, or the so-called prescriptivism from above, which is enforced
by bodies such as language planning boards, governmental committees, and
agencies, this book focuses on grassroots prescriptivism â the attempts of
lay people to promote the standard language ideology.
Grassroots prescriptivism investigates the metalinguistic comments of
language users expressed on traditional (letters to newspaper editors and
radio phone-ins) and new media platforms (forum and blog discussions). This
book demonstrates that, contrary to popular belief, language users are not
passive recipients of language rules, but active participants in matters of
linguistic prescriptivism. The diachronic exploration of grassroots
prescriptivism reveals a complex picture. While in many respects, twenty-first-century
prescriptivism represents a continuation of the 250-year-old prescriptive
tradition, the author argues that prescriptivism, like language itself,
undergoes change over time.
Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO)Language Use in Past and Presen
Application of a POS Tagger to a Novel Chronological Division of Early Modern German Text
This paper describes the application of a part-of-speech tagger to a particular configuration of historical German documents. Most natural language processing (NLP) is done on contemporary documents, and historical documents can present difficulties for these tools. I compared the performance of a single high-quality tagger on two stages of historical German (Early Modern German) materials. I used the TnT (Trigrams 'n' Tags) tagger, a probabilistic tagger developed by Thorsten Brants in a 2000 paper. I applied this tagger to two subcorpora which I derived from the University of Manchester's GerManC corpus, divided by date of creation of the original document, with each one used for both training and testing. I found that the earlier half, from a period with greater variability in the language, was significantly more difficult to tag correctly. The broader tag categories of punctuation and "other" were overrepresented in the errors.Master of Science in Information Scienc
A History of the Welsh English Dialect in Fiction
The systematic study of language varieties in fictional texts have primarily focused upon written material. Recently, linguists have also added audio-visual genres to the analytic framework of literary dialect studies. Studies have traditionally examined writersâ lexical, phonological, and grammatical output; contemporarily, research has begun examining metalinguistic commentaries and linguistic indexing of character stereotypes to this repertoire (Hodson, 2014).Except for minor analysis of early texts (German, 2009), there has been no large-scale investigation of any Welsh English dialect in fiction. This thesis addresses this gap, asking the fundamental question: throughout history, how has Welsh English been represented in fiction? The thesis surveys a large chronological scope covering material from the 12th century until the present day across four narrative-genres: early writings and theatrical writing, novels, films, and, new to literary dialect studies, videogames. In doing so, a historical discussion forms that covers Welsh Englishâs fictolinguistic output, cross-referencing its linguistic forms with recorded data, identifying forms hitherto unknown to dialectological surveys, and addressing metalinguistic and attitudinal stereotypes in fiction.Key findings include that phonology was an early representational linguistic domain in the literary dialect, whilst lexical and grammatical domains became common from 19th century literature onwards. The commonest phonological and lexical features were glottal fricative drops and tapped /r/; and the endearment terms âbach/fachâ and âmamâ respectively. Grammatically, âFocus Frontingâ and âDemonstrative Thereâ regularly occurred. Regarding linguistic evidence, several authors and filmmakers were prolific lay surveyors of the variety, adding to the historical dialectological record. Concerning dialectal attitudes, Elizabethan playwrights used linguistic stereotyping to create character stereotypes of Welsh people as âcomicalâ. By the 19th century, fictive Welsh English representation was the dominion of native-users in literature, film, and videogames; however today, the Comic stereotype, and an emerging stereotype of Welsh English users being Fantastical, appears embedded within the dialectâs representation
Effects of Orthographic, Phonologic, and Semantic Information Sources on Visual and Auditory Lexical Decision
The present study was designed to compare lexical decision latencies in visual and auditory modalities to three word types: (a) words that are inconsistent with two information sources, orthography and semantics (i.e., heterographic homophones such as bite/byte), (b) words that are inconsistent with one information source, semantics (i.e., homographic homophones such as bat), and (c) control words that are not inconsistent with any information source. Participants (N = 76) were randomly assigned to either the visual or auditory condition in which they judged the lexical status (word or nonword) of 180 words (60 heterographic homophones, 60 homographic homophones, and 60 control words) and 180 pronounceable nonsense word foils. Results differed significantly in the visual and auditory modalities. In visual lexical decision, homographic homophones were responded to faster than heterographic homophones or control words, which did not differ significantly. In auditory lexical decision, both homographic homophones and heterographic homophones were responded to faster than control words. Results are used to propose potential modifications to the Cooperative Division of Labor Model of Word Recognition (Harm & Seidenberg, 2004) to enable it to encompass both the visual and auditory modalities and account for the present results
- âŠ