research

Creating A Disability Corpus for Literary Analysis: Pilot Classification Experiments

Abstract

As literary text opens to researchers for distant reading, the computational analysis of large corpora of text for literary scholarship, problems beyond typical data science roadblocks, such as data scale and statistical significance of findings have emerged. For scholars studying character and social representation in literature, the identification of characters within the given classes of study is crucial, painstaking, and often a manual process. However, for characters with disabilities, manual identification is prohibitively difficult to undertake at scale, and especially challenging given the coded textual markers that can be used to refer to disability. There currently exists no corpus of characters in fiction with disabilities, which is the first step to at-scale computational study of this topic. This project seeks to pilot a classification process using manually assigned ground truth on a subset of volumes from the HathiTrust. Having successfully built and evaluated a Naïve Bayes classifier, we suggest full-scale deployment of a statistical classifier on a large corpus of literature in order to assemble a disability corpus. This project also covers preliminary exploratory textual analysis of characters with disabilities to yield potential research questions for further exploration

    Similar works