Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust

Abstract

HathiTrust currently provides metadata, scanned images, and full text for all public domain volumes. However, it’s likely there is content that is of interest to scholars and free from restriction within the front matter of most volumes, regardless of rights status. For example, the title page or table of contents may contain information that is likely non-expressive and useful to understanding the content’s structure and subject matter. It’s also likely that some volumes include materials that have expressive/creative content in the first 20 pages, so front matter cannot be made open for all volumes without understanding the most frequent type of content within the first 20 pages. This task is time-prohibitive for entirely manual exploration, so we seek to evaluate a machine learning approach for this task

    Similar works