17 research outputs found

    Providing pin-point page-level precision to 1 trillion tokens of text for workset creation

    Get PDF
    We report on the work undertaken developing a web environment that allows users to search over 1 trillion tokens of text -- down to the page-level -- of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis. We present an extended example of the web environment in use, along with details about its implementation

    Access to billions of pages for large-scale text analysis

    Get PDF
    Consortial collections have led to unprecedented scales of digitized corpora, but the insights that they enable are hampered by the complexities of access, particularly to in-copyright or orphan works. Pursuing a principle of non-consumptive access, we developed the Extracted Features (EF) dataset, a dataset of quantitative counts for every page of nearly 5 million scanned books. The EF includes unigram counts, part of speech tagging, header and footer extraction, counts of characters at both sides of the page, and more. Distributing book data with features already extracted saves resource costs associated with large-scale text use, improves the reproducibility of research done on the dataset, and opens the door to datasets on copyrighted books. We describe the coverage of the dataset and demonstrate its useful application through duplicate book alignment and identification of their cleanest scans, topic modeling, word list expansion, and multifaceted visualization.Ope

    Mapping Genre at the Page Level in English-Language Volumes from HathiTrust, 1700-1899

    Get PDF
    Using regularized logistic regression and hidden Markov models, we predict genre at the page level in a collection of 469,000 volumes from HathiTrust Digital Library. Accuracy is comparable to human crowdsourcing.Ope

    Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data

    Get PDF
    Poster accompanying previously submitted poster abstract

    Draft genome sequence of the Tibetan antelope

    Get PDF
    The Tibetan antelope (Pantholops hodgsonii) is endemic to the extremely inhospitable high-altitude environment of the Qinghai-Tibetan Plateau, a region that has a low partial pressure of oxygen and high ultraviolet radiation. Here we generate a draft genome of this artiodactyl and use it to detect the potential genetic bases of highland adaptation. Compared with other plain-dwelling mammals, the genome of the Tibetan antelope shows signals of adaptive evolution and gene-family expansion in genes associated with energy metabolism and oxygen transmission. Both the highland American pika, and the Tibetan antelope have signals of positive selection for genes involved in DNA repair and the production of ATPase. Genes associated with hypoxia seem to have experienced convergent evolution. Thus, our study suggests that common genetic mechanisms might have been utilized to enable high-altitude adaptation

    The Beach System: Building a PC from Many Tiny Computers - A First Step at Virtualization -

    Get PDF
    The emergence of tiny computers, such as smart dust, Berkeley motes and Intel motes, makes it feasible to envision the conversion of a network of tiny computers into a regular computing device (i.e., a "PC" or personal computer). While the falling cost and increasing (yet tiny) computation power of these miniature computers portend well for this vision, there are significant technical hurdles. In this paper, we take a first step at building "PCs" out of such tiny computer networks, in order to run regular PC applications. Our system, called Beach, virtualizes the memory accessed by an application at a single sensor mote (a type of tiny computer), thus enabling this memory to be distributed out over multiple such motes. By using distributed page tables and caching, we transform the puny memory at each mote (few KBs) into several KBs of memory. We present trace-driven experimental results from running regular PC applications (e.g., sorting) on top of the Beach system. Due to the exploratory nature of this research, we ignore scalability and fault-tolerance issues for now. Our work provides initial insight into the pros and cons of the vision

    Text Mining in Python through the HTRC Feature Reader

    No full text
    We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills. The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning many centuries and genres, they offer a way to learn about large-scale trends in history and culture, as well as evidence for changes in language or even the structure of the book. To simplify access to this collection the HathiTrust Research Center (HTRC) has released the Extracted Features dataset (Capitanu et al. 2015): a dataset that provides quantitative information describing every page of every volume in the collection. In this lesson, we introduce the HTRC Feature Reader, a library for working with the HTRC Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. Pandas provides simple structures for holding data and powerful ways to interact with it. The HTRC Feature Reader uses these data structures, so learning how to use it will also cover general data analysis skills in Python
    corecore