7 research outputs found
Piloting a workflow for extracting author citations from Samuel Johnson's Dictionary of the English Language
Since the 18th century, English-language dictionaries have used quo- tations from written works to illustrate a word's use in context. These quotations form a link between language authority and literary authority. In this paper we pilot a workflow for identifying, extracting, and counting author citations in Samuel Johnson's Dictionary of the English Language to investigate how au- thors in a defined corpus are represented. We consider how these authors are distributed across the text and compare our results to past studies that used dif- ferent methodologies. We find a consistency that encourages the broader appli- cation of our workflow on other dictionary texts, enabling further study of au- thor citations in dictionaries across time
Creating A Disability Corpus for Literary Analysis: Pilot Classification Experiments
As literary text opens to researchers for distant reading, the computational analysis of large corpora of text for literary scholarship, problems beyond typical data science roadblocks, such as data scale and statistical significance of findings have emerged. For scholars studying character and social representation in literature, the identification of characters within the given classes of study is crucial, painstaking, and often a manual process. However, for characters with disabilities, manual identification is prohibitively difficult to undertake at scale, and especially challenging given the coded textual markers that can be used to refer to disability. There currently exists no corpus of characters in fiction with disabilities, which is the first step to at-scale computational study of this topic. This project seeks to pilot a classification process using manually assigned ground truth on a subset of volumes from the HathiTrust. Having successfully built and evaluated a NaĂ¯ve Bayes classifier, we suggest full-scale deployment of a statistical classifier on a large corpus of literature in order to assemble a disability corpus. This project also covers preliminary exploratory textual analysis of characters with disabilities to yield potential research questions for further exploration
Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust
HathiTrust currently provides metadata, scanned images, and full text for all public domain volumes. However, it’s likely there is content that is of interest to scholars and free from restriction within the front matter of most volumes, regardless of rights status. For example, the title page or table of contents may contain information that is likely non-expressive and useful to understanding the content’s structure and subject matter. It’s also likely that some volumes include materials that have expressive/creative content in the first 20 pages, so front matter cannot be made open for all volumes without understanding the most frequent type of content within the first 20 pages. This task is time-prohibitive for entirely manual exploration, so we seek to evaluate a machine learning approach for this task
Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data
Poster accompanying previously submitted poster abstract
Exploring the Benefits for Users of Linked Open Data for Digitized Special Collections, White paper #2: Analysis of Early User Feedback
This paper reports on a research study conducted to evaluate experimental, LOD-based features of digital special collections, which investigated the question: how do these features affect the use of digital collections for research? Because humanities researchers are the primary user group for cultural collections, this study focused on what humanities researchers might gain from LOD-based enhancements to digital collections.Andrew W. Mellon Foundation Award No. 31500650Ope
Bridging the information gap between structural and note-level musical datasets
While there are an increasing number of datasets containing various features of musical information, the lack of connections between them remains a barrier to their use in research. For example, one dataset might encode the identification of structural segments by musicologists in audio recordings, while another dataset could contain a symbolic encoding of the music notation being played in that audio recording. Without explicit connections, there is a significant extra work in realizing their potential for musicological study. In this paper we investigate how Linked Data can be used to implement such connections, specifically between the McGill Billboard corpus of structural annotations and the MIDI Linked Data Cloud (MIDI-LD). Firstly, we republish structural information from Billboard as RDF. We then align this structural data with matching symbolic encodings in MIDI-LD; before finally linking individual structural annotations from Billboard to note-level sections in the MIDI-LD. Our alignments enable cross-referencing and combined queries for musicological analysis across the enriched union dataset, and serve as a model for the creation of information resources comprising musical structures at varying granularity
Bridging the information gap between structural and note-level musical datasets
While there are an increasing number of datasets containing various features of musical information, the lack of connections between them remains a barrier to their use in research. For example, one dataset might encode the identification of structural segments by musicologists in audio recordings, while another dataset could contain a symbolic encoding of the music notation being played in that audio recording. Without explicit connections, there is a significant extra work in realizing their potential for musicological study. In this paper we investigate how Linked Data can be used to implement such connections, specifically between the McGill Billboard corpus of structural annotations and the MIDI Linked Data Cloud (MIDI-LD). Firstly, we republish structural information from Billboard as RDF. We then align this structural data with matching symbolic encodings in MIDI-LD; before finally linking individual structural annotations from Billboard to note-level sections in the MIDI-LD. Our alignments enable cross-referencing and combined queries for musicological analysis across the enriched union dataset, and serve as a model for the creation of information resources comprising musical structures at varying granularity