347 research outputs found

    Metadata and Preservation in Geosciences: Issues at Scale

    Get PDF
    As the environment and climate have increasing impact on the economic sustainability of our country, scientists are being compelled through their own interest or through directives from funding agencies to share the results of their research, which often take the form of collections of data. Sharing collections, particularly at scale where the volumes are large, introduces numerous challenges that we discuss in the context of our research and additional challenges that we point out as unaddressed problems. We discuss in particular provenance collection with a system independent collection tool, Karma, the XMC Cat application schema friendly metadata catalog, and the integration of data streams into a workflow composer, XBaya. We conclude with a discussion of the goals of the Data to Insight Center within the Pervasive Technologies Institute of which the Center for Data and Search Informatics and the Digital Library Program have a role

    Trust threads: minimal provenance and data publication and reuse

    Get PDF
    Presented at the National data integrity conference: enabling research: new challenges & opportunities held on May 7-8, 2015 at Colorado State University, Fort Collins, Colorado. Researchers, administrators and integrity officers are encountering new challenges regarding research data and integrity. This conference aims to provide attendees with both a high level understanding of these challenges and impart practical tools and skills to deal with them. Topics will include data reproducibility, validity, privacy, security, visualization, reuse, access, preservation, rights and management.Beth A. Plale is the Director, Data to Insight Center, Managing Director, Pervasive Technology Institute and a Professor, School of Informatics and Computing Indiana University. Dr. Plale has broad research and governance interest in information, in long-term preservation and access to scientific data, and in enabling computational access to large and complex data for broader use. Her specific research interest are in metadata and data provenance, trusted data repositories and enclaves, data analysis and text mining of big data, and workflow systems. Plale teaches in the Data Science Program at Indiana University Bloomington. She is deeply engaged in interdisciplinary research and education and has substantive experience in developing stable and useable scientific cyberinfrastructure.PowerPoint presentation given on May 8, 2015

    Big Data and HPC: Exploring Role of Research Data Alliance (RDA), a Report On Supercomputing 2013 Birds of a Feather

    Get PDF
    The ubiquity of today's data is not just transforming what is, it is transforming what will be laying the groundwork to drive new innovation. Today, research questions are addressed by complex models, by large data analysis tasks, and by sophisticated data visualization techniques, all requiring data. To address the growing global need for data infrastructure, the Research Data Alliance (RDA) was launched in FY13 as an international community-driven organization. We propose to bring together members of RDA with the HPC community to create a shared conversation around the utility of RDA for data-driven challenges in HPC

    Grand Challenge of Indiana Water: Estimate of Compute and Data Storage Needs

    Get PDF
    This study is undertaken to assess the computational and storage needs for a large-scale research activity to study water in the State of Indiana. It draws its data and compute numbers from the Vortex II Forecast Data study of 2010 carried out by the Data To Insight Center at Indiana University. Detail of the study can be found in each of the archived data products (which contains results of a single weather forecast plus 42 visualizations created for each forecast.) See https://scholarworks.iu.edu/dspace/handle/2022/15153 for example archived data product

    Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate

    Get PDF
    In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately .0.90/GBover15−yearspan.Variablecostsareestimatedataslidingscaleof.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of 150 - 100pernewdatasetforup−frontcuration,or100 per new dataset for up-front curation, or 4.87 – 3.22perGB.Variablecostsreflecta3Thetotalprojectedcostofthedataandpaperrepositoryisestimatedat3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at 167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at 5.56.This5.56. This 167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation

    HathiTrust Research Center: Challenges and Opportunities in Big Text Data

    Get PDF
    HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content. The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need. In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools. More about HTRC The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details. [1] http://www.hathitrust.org/statistics_visualization

    Evaluation of Data Storage in HathiTrust Research Center Using Cassandra

    Get PDF
    As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. The HathiTrust Re-search Center (HTRC) was recently established to provision for automated analytical techniques on the over 11 million digitized volumes (books) of the HathiTrust digital repository. The HTRC data store that hosts and provisions access to HathiTrust volumes needs to be efficient, fault-tolerant and large-scale. In this paper, we propose three schema designs of Cassandra NoSQL store to represent HathiTrust corpus and perform extensive performance evaluation using simulated workloads. The experimental results demonstrate that encapsulating the whole volume within a single row with regular columns delivers the best overall performance
    • …
    corecore