3,766 research outputs found

    Introducing a corpus of conversational stories. Construction and annotation of the Narrative Corpus

    Get PDF
    Although widely seen as critical both in terms of its frequency and its social significance as a prime means of encoding and perpetuating moral stance and configuring self and identity, conversational narrative has received little attention in corpus linguistics. In this paper we describe the construction and annotation of a corpus that is intended to advance the linguistic theory of this fundamental mode of everyday social interaction: the Narrative Corpus (NC). The NC contains narratives extracted from the demographically-sampled sub-corpus of the British National Corpus (BNC) (XML version). It includes more than 500 narratives, socially balanced in terms of participant sex, age, and social class. We describe the extraction techniques, selection criteria, and sampling methods used in constructing the NC. Further, we describe four levels of annotation implemented in the corpus: speaker (social information on speakers), text (text Ids, title, type of story, type of embedding etc.), textual components (pre-/post-narrative talk, narrative, and narrative-initial/final utterances), and utterance (participation roles, quotatives and reporting modes). A brief rationale is given for each level of annotation, and possible avenues of research facilitated by the annotation are sketched out

    Modelling the flow of discourse in a corpus of written academic English

    Get PDF
    Discourse studies attempt to describe how context affects text, and how text progresses from one sentence to the next. Systemic Functional Linguistics (SFL) offers a model of language to describe how information flow varies according to context and co-text through the Textual metafunction, especially using the functions of Participant Identification and Tracking, Theme and Information Structure. These systems were evaluated by assembling a corpus of academic texts and assessing their information flow. Results of the analysis of the three grammatical systems in the Textual Metafunction demonstrate significant patterns, or unmarked choices, where the participant, thematic and information systems combine to powerful effect. Where the systems are not aligned, there is a recognisable effect on the flow of information

    The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings

    Full text link
    We motivate and describe a new freely available human-human dialogue dataset for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented visual attribute words (such as " burchak " for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self-and other-correction, mid-sentence continuations, interruptions, overlaps, fillers, and hedges. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78% turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rule-based system built previously.Comment: 10 pages, THE 6TH WORKSHOP ON VISION AND LANGUAGE (VL'17

    Documenting Language Use: Some Theoretical and Technical Challenges in Documentary Linguistics

    No full text

    Turn-timing in Norwegian Sign Language: A study of transition durations

    Get PDF
    Turn-taking requires collaboration between interlocutors, and previous research has found that there is a desire for minimal overlap and gap in informal conversations. Because there is limited research on this topic in signed languages, this thesis investigated turn transition durations in question-answer sequences in informal, Norwegian Sign Language (NTS) conversations. By analyzing a selection of files from two data sets within the Norwegian Sign Language Corpus, the aim was to find out whether mean transition durations in NTS are in the range observed for other spoken and signed languages, if transition durations are variable between individuals, and if age or question type affect mean transition durations. As this study relied on previously collected data the participants were not recruited for this research specifically. The transition durations of 159 question-answer sequences were measured in terms of stroke-to-stroke turn boundaries and yielded 100 gaps and 59 overlaps. The results were in line with previous research on turn-timing, measuring a mean turn transition duration within 250 ms of the cross-linguistic average observed for spoken languages, supporting the theory that turn-timing varies very little across languages, no matter the modality. Some individual differences could be observed, but no significant difference was found between question types, nor between age-ranges, due to few examples in each category.Lingvistikk mastergradsoppgaveLING350MAHF-LIN

    Are We There Yet?: The Development of a Corpus Annotated for Social Acts in Multilingual Online Discourse

    Get PDF
    We present the AAWD and AACD corpora, a collection of discussions drawn from Wikipedia talk pages and small group IRC discussions in English, Russian and Mandarin. Our datasets are annotated with labels capturing two kinds of social acts: alignment moves and authority claims. We describe these social acts, describe our annotation process, highlight challenges we encountered and strategies we employed during annotation, and present some analyses of resulting data set which illustrate the utility of our corpus and identify interactions among social acts and between participant status and social acts and in online discourse

    The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

    Get PDF
    Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel)International audienceThe CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective
    corecore