305 research outputs found

    Metadata and Preservation in Geosciences: Issues at Scale

    Get PDF
    As the environment and climate have increasing impact on the economic sustainability of our country, scientists are being compelled through their own interest or through directives from funding agencies to share the results of their research, which often take the form of collections of data. Sharing collections, particularly at scale where the volumes are large, introduces numerous challenges that we discuss in the context of our research and additional challenges that we point out as unaddressed problems. We discuss in particular provenance collection with a system independent collection tool, Karma, the XMC Cat application schema friendly metadata catalog, and the integration of data streams into a workflow composer, XBaya. We conclude with a discussion of the goals of the Data to Insight Center within the Pervasive Technologies Institute of which the Center for Data and Search Informatics and the Digital Library Program have a role

    Trust threads: minimal provenance and data publication and reuse

    Get PDF
    Presented at the National data integrity conference: enabling research: new challenges & opportunities held on May 7-8, 2015 at Colorado State University, Fort Collins, Colorado. Researchers, administrators and integrity officers are encountering new challenges regarding research data and integrity. This conference aims to provide attendees with both a high level understanding of these challenges and impart practical tools and skills to deal with them. Topics will include data reproducibility, validity, privacy, security, visualization, reuse, access, preservation, rights and management.Beth A. Plale is the Director, Data to Insight Center, Managing Director, Pervasive Technology Institute and a Professor, School of Informatics and Computing Indiana University. Dr. Plale has broad research and governance interest in information, in long-term preservation and access to scientific data, and in enabling computational access to large and complex data for broader use. Her specific research interest are in metadata and data provenance, trusted data repositories and enclaves, data analysis and text mining of big data, and workflow systems. Plale teaches in the Data Science Program at Indiana University Bloomington. She is deeply engaged in interdisciplinary research and education and has substantive experience in developing stable and useable scientific cyberinfrastructure.PowerPoint presentation given on May 8, 2015

    Big Data and HPC: Exploring Role of Research Data Alliance (RDA), a Report On Supercomputing 2013 Birds of a Feather

    Get PDF
    The ubiquity of today's data is not just transforming what is, it is transforming what will be laying the groundwork to drive new innovation. Today, research questions are addressed by complex models, by large data analysis tasks, and by sophisticated data visualization techniques, all requiring data. To address the growing global need for data infrastructure, the Research Data Alliance (RDA) was launched in FY13 as an international community-driven organization. We propose to bring together members of RDA with the HPC community to create a shared conversation around the utility of RDA for data-driven challenges in HPC

    Grand Challenge of Indiana Water: Estimate of Compute and Data Storage Needs

    Get PDF
    This study is undertaken to assess the computational and storage needs for a large-scale research activity to study water in the State of Indiana. It draws its data and compute numbers from the Vortex II Forecast Data study of 2010 carried out by the Data To Insight Center at Indiana University. Detail of the study can be found in each of the archived data products (which contains results of a single weather forecast plus 42 visualizations created for each forecast.) See https://scholarworks.iu.edu/dspace/handle/2022/15153 for example archived data product

    HathiTrust Research Center: Challenges and Opportunities in Big Text Data

    Get PDF
    HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content. The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need. In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools. More about HTRC The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details. [1] http://www.hathitrust.org/statistics_visualization

    Evaluation of Data Storage in HathiTrust Research Center Using Cassandra

    Get PDF
    As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. The HathiTrust Re-search Center (HTRC) was recently established to provision for automated analytical techniques on the over 11 million digitized volumes (books) of the HathiTrust digital repository. The HTRC data store that hosts and provisions access to HathiTrust volumes needs to be efficient, fault-tolerant and large-scale. In this paper, we propose three schema designs of Cassandra NoSQL store to represent HathiTrust corpus and perform extensive performance evaluation using simulated workloads. The experimental results demonstrate that encapsulating the whole volume within a single row with regular columns delivers the best overall performance

    Indiana University Digitization Master Plan

    Get PDF
    In his State of the University address on October 1, 2013, Indiana University President Michael McRobbie emphasized that universities have a critical role to play in the preservation of knowledge. In keeping with this goal, President McRobbie announced a charter for an Indiana University Digitization Master Plan (DMP). The DMP is to look beyond time-based media and formulate a university-wide roadmap to digitize and store in some form all of our existing collections judged by experts and scholars to be of lasting importance to research and scholarship, and to ensure the preservation of all new research and scholarship at IU that is born digital
    corecore