17 research outputs found
Providing pin-point page-level precision to 1 trillion tokens of text for workset creation
We report on the work undertaken developing a web environment that allows users to search over 1 trillion tokens of text -- down to the page-level -- of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis. We present an extended example of the web environment in use, along with details about its implementation
Access to billions of pages for large-scale text analysis
Consortial collections have led to unprecedented scales of digitized corpora, but the insights that they enable are hampered by the complexities of access, particularly to in-copyright or orphan works. Pursuing a principle of non-consumptive access, we developed the Extracted Features (EF) dataset, a dataset of quantitative counts for every page of nearly 5 million scanned books. The EF includes unigram counts, part of speech tagging, header and footer extraction, counts of characters at both sides of the page, and more. Distributing book data with features already extracted saves resource costs associated with large-scale text use, improves the reproducibility of research done on the dataset, and opens the door to datasets on copyrighted books. We describe the coverage of the dataset and demonstrate its useful application through duplicate book alignment and identification of their cleanest scans, topic modeling, word list expansion,
and multifaceted visualization.Ope
Mapping Genre at the Page Level in English-Language Volumes from HathiTrust, 1700-1899
Using regularized logistic regression and hidden Markov models, we predict genre at the page level in a collection of 469,000 volumes from HathiTrust Digital Library. Accuracy is comparable to human crowdsourcing.Ope
Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data
Poster accompanying previously submitted poster abstract
Draft genome sequence of the Tibetan antelope
The Tibetan antelope (Pantholops hodgsonii) is endemic to the extremely inhospitable high-altitude environment of the Qinghai-Tibetan Plateau, a region that has a low partial pressure of oxygen and high ultraviolet radiation. Here we generate a draft genome of this artiodactyl and use it to detect the potential genetic bases of highland adaptation. Compared with other plain-dwelling mammals, the genome of the Tibetan antelope shows signals of adaptive evolution and gene-family expansion in genes associated with energy metabolism and oxygen transmission. Both the highland American pika, and the Tibetan antelope have signals of positive selection for genes involved in DNA repair and the production of ATPase. Genes associated with hypoxia seem to have experienced convergent evolution. Thus, our study suggests that common genetic mechanisms might have been utilized to enable high-altitude adaptation
Recommended from our members
Analyses of pig genomes provide insight into porcine demography and evolution
For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars ∼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model
The Beach System: Building a PC from Many Tiny Computers - A First Step at Virtualization -
The emergence of tiny computers, such as smart dust, Berkeley motes and Intel motes, makes it feasible to envision the conversion of a network of tiny computers into a regular computing device (i.e., a "PC" or personal computer). While the falling cost and increasing (yet tiny) computation power of these miniature computers portend well for this vision, there are significant technical hurdles. In this paper, we take a first step at building "PCs" out of such tiny computer networks, in order to run regular PC applications. Our system, called Beach, virtualizes the memory accessed by an application at a single sensor mote (a type of tiny computer), thus enabling this memory to be distributed out over multiple such motes. By using distributed page tables and caching, we transform the puny memory at each mote (few KBs) into several KBs of memory. We present trace-driven experimental results from running regular PC applications (e.g., sorting) on top of the Beach system. Due to the exploratory nature of this research, we ignore scalability and fault-tolerance issues for now. Our work provides initial insight into the pros and cons of the vision
Text Mining in Python through the HTRC Feature Reader
We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills.
The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning many centuries and genres, they offer a way to learn about large-scale trends in history and culture, as well as evidence for changes in language or even the structure of the book. To simplify access to this collection the HathiTrust Research Center (HTRC) has released the Extracted Features dataset (Capitanu et al. 2015): a dataset that provides quantitative information describing every page of every volume in the collection.
In this lesson, we introduce the HTRC Feature Reader, a library for working with the HTRC Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. Pandas provides simple structures for holding data and powerful ways to interact with it. The HTRC Feature Reader uses these data structures, so learning how to use it will also cover general data analysis skills in Python