Search CORE

17 research outputs found

Providing pin-point page-level precision to 1 trillion tokens of text for workset creation

Author: Bainbridge David
Capitanu Boris
Downie J. Stephen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

We report on the work undertaken developing a web environment that allows users to search over 1 trillion tokens of text -- down to the page-level -- of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis. We present an extended example of the web environment in use, along with details about its implementation

Crossref

Research Commons@Waikato

Access to billions of pages for large-scale text analysis

Author: Capitanu Boris
Downie J. Stephen
Organisciak Peter
Underwood Ted
Publication venue
Publication date: 01/01/2017
Field of study

Consortial collections have led to unprecedented scales of digitized corpora, but the insights that they enable are hampered by the complexities of access, particularly to in-copyright or orphan works. Pursuing a principle of non-consumptive access, we developed the Extracted Features (EF) dataset, a dataset of quantitative counts for every page of nearly 5 million scanned books. The EF includes unigram counts, part of speech tagging, header and footer extraction, counts of characters at both sides of the page, and more. Distributing book data with features already extracted saves resource costs associated with large-scale text use, improves the reproducibility of research done on the dataset, and opens the door to datasets on copyrighted books. We describe the coverage of the dataset and demonstrate its useful application through duplicate book alignment and identification of their cleanest scans, topic modeling, word list expansion, and multifaceted visualization.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Mapping Genre at the Page Level in English-Language Volumes from HathiTrust, 1700-1899

Author: Ballard Shawn
Black Michael L.
Capitanu Boris
Underwood Ted
Publication venue
Publication date: 10/07/2014
Field of study

Using regularized logistic regression and hidden Markov models, we predict genre at the page level in a collection of 469,000 volumes from HathiTrust Digital Library. Accuracy is comparable to human crowdsourcing.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data

Author: Boris Capitanu
Deren Kudeki
J Stephen Downie
Jacob Jett
Ryan Dubnicek
Timothy W Cole
Publication venue: 'Modern Language Association'
Publication date: 01/01/2020
Field of study

Poster accompanying previously submitted poster abstract

Humanities Commons

Draft genome sequence of the Tibetan antelope

Author: Auvil Loretta
Bao Haihua
Cai Qingle
Capitanu Boris
Chen Yan
Ge Ri-Li
Geng Chunyu
Hao Meirong
He Rongjun
Huang Ying
Hui Yuanyuan
Irwin David M.
Kim Jaebum
Larkin Denis M.
Lewin Harris A.
Li Yue
Ma Jian
Ma Lan
Murphy Robert W.
Ou Xiaohua
Platt Roy N., II
Ray David A.
San A.
Shen Yong-Yi
Wang Bo
Wang Jian
Wang Jun
Wu Kui
Xing Jinchuan
Xu Jiaohui
Yang Lingfeng
Yang Yingzhong
Yi Xin
Ying Liu
Zhang Guojie
Zhang Xiufeng
Zhang Ya-Ping
Zhang Yong
Zhang Yongfen
Zhou Taicheng
Zhou Weiping
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The Tibetan antelope (Pantholops hodgsonii) is endemic to the extremely inhospitable high-altitude environment of the Qinghai-Tibetan Plateau, a region that has a low partial pressure of oxygen and high ultraviolet radiation. Here we generate a draft genome of this artiodactyl and use it to detect the potential genetic bases of highland adaptation. Compared with other plain-dwelling mammals, the genome of the Tibetan antelope shows signals of adaptive evolution and gene-family expansion in genes associated with energy metabolism and oxygen transmission. Both the highland American pika, and the Tibetan antelope have signals of positive selection for genes involved in DNA repair and the production of ATPase. Genes associated with hypoxia seem to have experienced convergent evolution. Thus, our study suggests that common genetic mechanisms might have been utilized to enable high-altitude adaptation

Copenhagen University Research Information System

PubMed Central

Recommended from our members

Analyses of pig genomes provide insight into porcine demography and evolution

Author: Ahn Hyeonju
Aken Bronwen L.
Anselmo Anna
Anthon Christian
Archibald Alan L.
Auvil Loretta
Badaoui Bouabid
Beattie Craig W.
Bendixen Christian
Berman Daniel
Blecha Frank
Blomberg Jonas
Bolund Lars
Bosse Mirte
Botti Sara
Bujie Zhan
Bystrom Megan
Caccamo Mario
Capitanu Boris
Carvalho-Silva Denise
Chardon Patrick
Chen Celine
Cheng Ryan
Choi Sang-Haeng
Chow William
Churcher Carol
Clark Richard C.
Clee Christopher
Crooijmans Richard P. M. A.
Dawson Harry D.
De Sapio Fioravante
Dehais Patrice
Dibbits Bert
Drou Nizar
Du Zhi-Qiang
Eversole Kellye
Fadista João
Fairley Susan
Faraut Thomas
Faulkner Geoffrey J.
Fowler Katie E.
Frantz Laurent A. F.
Fredholm Merete
Fritz Eric
Gilbert James G. R.
Giuffra Elisabetta
Gorodkin Jan
Griffin Darren K.
Groenen Martien A. M.
Harrow Jennifer L.
Hayward Alexander
Hornshøj Henrik
Howe Kerstin
Hu Zhi-Liang
Humphray Sean J.
Hunt Toby
Jeon Jin-Tae
Jern Patric
Jones Matthew
Jurka Jerzy
Kanamori Hiroyuki
Kapetanovic Ronan
Kim Heebal
Kim Jae-Hwan
Kim Jaebum
Kim Kyu-Won
Kim Tae-Hun
Larkin Denis M.
Larson Greger
Lee Kyooyeol
Lee Kyung-Tai
Leggett Richard
Lewin Harris A.
Li Shengting
Li Yingrui
Liu Wansheng
Loveland Jane E.
Lu Yao
Lunney Joan K.
Ma Jian
Madsen Ole
Mann Katherine
Matthews Lucy
McLaren Stuart
Megens Hendrik-Jan
Milan Denis
Morozumi Takeya
Murtaugh Michael P.
Narayan Jitendra
Ni Peixiang
Oh Song-Jung
Onteru Suneel
Panitz Frank
Park Chankyu
Park Eung-Woo
Park Hong-Seog
Pascal Geraldine
Paudel Yogesh
Perez-Enciso Miguel
Ramirez-Gonzalez Ricardo
Reecy James M.
Rodriguez-Zas Sandra
Rogel-Gaillard Claire
Rogers Jane
Rohrer Gary A.
Rothschild Max F.
Rund Lauretta
Sang Yongming
Schachtschneider Kyle
Schook Lawrence B.
Schraiber Joshua G.
Schwartz John
Scobie Linda
Scott Carol
Searle Stephen
Servin Bertrand
Southey Bruce R.
Sperber Goran
Stadler Peter
Sweedler Jonathan V.
Tafer Hakim
Takeuchi Yasuhiro
Thomsen Bo
Truong Nguyen Dinh
Tuggle Christopher K.
Uenishi Hirohide
Wali Rashmi
Wang Jian
Wang Jun
White Simon
Xu Xun
Yerle Martine
Zhang Guojie
Zhang Jianguo
Zhang Jie
Zhao Shuhong
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars ∼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model

Open Research Online (The Open University)

The Beach System: Building a PC from Many Tiny Computers - A First Step at Virtualization -

Author: Capitanu Boris
Chan Ellick M.
Gupta Indranil
Publication venue
Publication date: 01/04/2005
Field of study

The emergence of tiny computers, such as smart dust, Berkeley motes and Intel motes, makes it feasible to envision the conversion of a network of tiny computers into a regular computing device (i.e., a "PC" or personal computer). While the falling cost and increasing (yet tiny) computation power of these miniature computers portend well for this vision, there are significant technical hurdles. In this paper, we take a first step at building "PCs" out of such tiny computer networks, in order to run regular PC applications. Our system, called Beach, virtualizes the memory accessed by an application at a single sensor mote (a type of tiny computer), thus enabling this memory to be distributed out over multiple such motes. By using distributed page tables and caching, we transform the puny memory at each mote (few KBs) into several KBs of memory. We present trace-driven experimental results from running regular PC applications (e.g., sorting) on top of the Beach system. Due to the exploratory nature of this research, we ignore scalability and fault-tolerance issues for now. Our work provides initial insight into the pros and cons of the vision

Illinois Digital Environment for Access to Learning and Scholarship Repository

Text Mining in Python through the HTRC Feature Reader

Author: Boris Capitanu
Peter Organisciak
Publication venue: Editorial Board of the Programming Historian
Publication date: 01/11/2016
Field of study

We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills. The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning many centuries and genres, they offer a way to learn about large-scale trends in history and culture, as well as evidence for changes in language or even the structure of the book. To simplify access to this collection the HathiTrust Research Center (HTRC) has released the Extracted Features dataset (Capitanu et al. 2015): a dataset that provides quantitative information describing every page of every volume in the collection. In this lesson, we introduce the HTRC Feature Reader, a library for working with the HTRC Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. Pandas provides simple structures for holding data and powerful ways to interact with it. The HTRC Feature Reader uses these data structures, so learning how to use it will also cover general data analysis skills in Python

Directory of Open Access Journals