9,762 research outputs found

    An automatic part-of-speech tagger for Middle Low German

    Get PDF
    Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them

    Usage-based and emergentist approaches to language acquisition

    Get PDF
    It was long considered to be impossible to learn grammar based on linguistic experience alone. In the past decade, however, advances in usage-based linguistic theory, computational linguistics, and developmental psychology changed the view on this matter. So-called usage-based and emergentist approaches to language acquisition state that language can be learned from language use itself, by means of social skills like joint attention, and by means of powerful generalization mechanisms. This paper first summarizes the assumptions regarding the nature of linguistic representations and processing. Usage-based theories are nonmodular and nonreductionist, i.e., they emphasize the form-function relationships, and deal with all of language, not just selected levels of representations. Furthermore, storage and processing is considered to be analytic as well as holistic, such that there is a continuum between children's unanalyzed chunks and abstract units found in adult language. In the second part, the empirical evidence is reviewed. Children's linguistic competence is shown to be limited initially, and it is demonstrated how children can generalize knowledge based on direct and indirect positive evidence. It is argued that with these general learning mechanisms, the usage-based paradigm can be extended to multilingual language situations and to language acquisition under special circumstances

    DARIAH and the Benelux

    Get PDF

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Dating Texts without Explicit Temporal Cues

    Full text link
    This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts

    Unsupervised Discovery of Phonological Categories through Supervised Learning of Morphological Rules

    Full text link
    We describe a case study in the application of {\em symbolic machine learning} techniques for the discovery of linguistic rules and categories. A supervised rule induction algorithm is used to learn to predict the correct diminutive suffix given the phonological representation of Dutch nouns. The system produces rules which are comparable to rules proposed by linguists. Furthermore, in the process of learning this morphological task, the phonemes used are grouped into phonologically relevant categories. We discuss the relevance of our method for linguistics and language technology

    Presenting GECO : an eyetracking corpus of monolingual and bilingual sentence reading

    Get PDF
    This paper introduces GECO, the Ghent Eye-tracking Corpus, a monolingual and bilingual corpus of eye-tracking data of participants reading a complete novel. English monolinguals and Dutch-English bilinguals read an entire novel, which was presented in paragraphs on the screen. The bilinguals read half of the novel in their first language, and the other half in their second language. In this paper we describe the distributions and descriptive statistics of the most important reading time measures for the two groups of participants. This large eye-tracking corpus is perfectly suited for both exploratory purposes as well as more directed hypothesis testing, and it can guide the formulation of ideas and theories about naturalistic reading processes in a meaningful context. Most importantly, this corpus has the potential to evaluate the generalizability of monolingual and bilingual language theories and models to reading of long texts and narratives
    • 

    corecore