Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust

Kristina Hall; NIKOLAUS PARULIAN; Ryan Dubnicek; Stephen Downie; Yuerong Hu

Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust

Authors: Kristina Hall
NIKOLAUS PARULIAN
Ryan Dubnicek
Stephen Downie
Yuerong Hu
Publication date: 1 January 2020
Publisher: 'Modern Language Association'
Doi

Abstract

HathiTrust currently provides metadata, scanned images, and full text for all public domain volumes. However, it’s likely there is content that is of interest to scholars and free from restriction within the front matter of most volumes, regardless of rights status. For example, the title page or table of contents may contain information that is likely non-expressive and useful to understanding the content’s structure and subject matter. It’s also likely that some volumes include materials that have expressive/creative content in the first 20 pages, so front matter cannot be made open for all volumes without understanding the most frequent type of content within the first 20 pages. This task is time-prohibitive for entirely manual exploration, so we seek to evaluate a machine learning approach for this task

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Humanities Commons

oai:hcommons.org/hc:31605

Last time updated on 11/10/2022