127 research outputs found

    Revisiting the Data Lifecycle with Big Data Curation

    Get PDF
    As science becomes more data-intensive and collaborative, researchers increasingly use larger and more complex data to answer research questions. The capacity of storage infrastructure, the increased sophistication and deployment of sensors, the ubiquitous availability of computer clusters, the development of new analysis techniques, and larger collaborations allow researchers to address grand societal challenges in a way that is unprecedented. In parallel, research data repositories have been built to host research data in response to the requirements of sponsors that research data be publicly available. Libraries are re-inventing themselves to respond to a growing demand to manage, store, curate and preserve the data produced in the course of publicly funded research. As librarians and data managers are developing the tools and knowledge they need to meet these new expectations, they inevitably encounter conversations around Big Data. This paper explores definitions of Big Data that have coalesced in the last decade around four commonly mentioned characteristics: volume, variety, velocity, and veracity. We highlight the issues associated with each characteristic, particularly their impact on data management and curation. We use the methodological framework of the data life cycle model, assessing two models developed in the context of Big Data projects and find them lacking. We propose a Big Data life cycle model that includes activities focused on Big Data and more closely integrates curation with the research life cycle. These activities include planning, acquiring, preparing, analyzing, preserving, and discovering, with describing the data and assuring quality being an integral part of each activity. We discuss the relationship between institutional data curation repositories and new long-term data resources associated with high performance computing centers, and reproducibility in computational science. We apply this model by mapping the four characteristics of Big Data outlined above to each of the activities in the model. This mapping produces a set of questions that practitioners should be asking in a Big Data projec

    Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience

    Get PDF
    Open and persistent access to past, present, and future scientific data is fundamental for transparent and reproducible data-driven research. The scientific community is now facing both challenges and opportunities caused by the growingly complex disciplinary data systems. Concerted efforts from domain experts, information professionals, and Internet technology experts are essential to ensure the accessibility and interoperability of the big data. Here we review current practices in building and managing big data within the context of large data infrastructure, using geoscience cyberinfrastructure such as Interdisciplinary Earth Data Alliance (IEDA) and EarthCube as a case study. Geoscience is a data-rich discipline with a rapid expansion of sophisticated and diverse digital data sets. Having started to embrace the digital age, the community have applied big data and data mining tools into the new type of research. We also identified current challenges, key elements, and prospects to construct a more robust and future-proof big data infrastructure for research and publication for the future, as well as the roles, qualifications, and opportunities for librarians/information professionals in the data era

    Curricula analysis for big data stewardship – embedding data curation roles in the big data value chain

    Get PDF
    The growing importance of big data necessitates appropriate data curation to safeguard the data asset. Data curation is the management of data to ensure that finding and gaining access to high-quality, reusable, ethically collected and valuable data, is viable. The focus here is on the phrase ‘valuable data’ as we believe it to be all inclusive. Effective big data curation demands that data curators develop appropriate competencies to fill roles and fulfill re-sponsibilities that add value; for which they require tertiary education and continuous training. An exploratory study, of the data management and data stewardship curricula of 35 selected national and international institutions, revealed subjects and topics taught that could be mapped to curation roles and responsibilities. When the responsibilities were embedded in a big data value chain it was possible to identify potential training gaps. Using a big data value chain as a framework provides a handy instrument to guide cur-riculum development and to illustrate the potential value of data steward-ship

    A European research roadmap for optimizing societal impact of big data on environment and energy efficiency

    Full text link
    We present a roadmap to guide European research efforts towards a socially responsible big data economy that maximizes the positive impact of big data in environment and energy efficiency. The goal of the roadmap is to allow stakeholders and the big data community to identify and meet big data challenges, and to proceed with a shared understanding of the societal impact, positive and negative externalities, and concrete problems worth investigating. It builds upon a case study focused on the impact of big data practices in the context of Earth Observation that reveals both positive and negative effects in the areas of economy, society and ethics, legal frameworks and political issues. The roadmap identifies European technical and non-technical priorities in research and innovation to be addressed in the upcoming five years in order to deliver societal impact, develop skills and contribute to standardization.Comment: 6 pages, 2 figures, 1 tabl

    Towards Fluent Decision Making Experience by Adopting Information Curation Functions

    Get PDF
    Information curation function (ICF) is a function that online review platforms implement to facilitate users’ review reading process given a sheer volume of reviews pertaining to a brand. Based on both fluency and information overload theories, this research attempts to unravel the role of ICF in alleviating information overload and facilitating consumers’ decision making process. Specifically, this study strives to quantify the impact of ICF use on users’ brand selection and satisfaction with selection through an integrated model that explores the effect of both external environmental cues and internal metacognitive cues. The study is expected to not only inform practitioners the usefulness of ICF, but also inspire them to seek solutions to optimize consumers’ decision making

    Granularity analysis of classification and estimation for complex datasets with MOA

    Get PDF
    Dispersed and unstructured datasets are substantial parameters to realize an exact amount of the required space. Depending upon the size and the data distribution, especially, if the classes are significantly associating, the level of granularity to agree a precise classification of the datasets exceeds. The data complexity is one of the major attributes to govern the proper value of the granularity, as it has a direct impact on the performance. Dataset classification exhibits the vital step in complex data analytics and designs to ensure that dataset is prompt to be efficiently scrutinized. Data collections are always causing missing, noisy and out-of-the-range values. Data analytics which has not been wisely classified for problems as such can induce unreliable outcomes. Hence, classifications for complex data sources help comfort the accuracy of gathered datasets by machine learning algorithms. Dataset complexity and pre-processing time reflect the effectiveness of individual algorithm. Once the complexity of datasets is characterized then comparatively simpler datasets can further investigate with parallelism approach. Speedup performance is measured by the execution of MOA simulation. Our proposed classification approach outperforms and improves granularity level of complex datasets

    Synchronic Curation for Assessing Reuse and Integration Fitness of Multiple Data Collections

    Get PDF
    Data driven applications often require using data integrated from different, large, and continuously updated collections. Each of these collections may present gaps, overlapping data, have conflicting information, or complement each other. Thus, a curation need is to continuously assess if data from multiple collections are fit for integration and reuse. To assess different large data collections at the same time, we present the Synchronic Curation (SC) framework. SC involves processing steps to map the different collections to a unifying data model that represents research problems in a scientific area. The data model, which includes the collections' provenance and a data dictionary, is implemented in a graph database where collections are continuously ingested and can be queried. SC has a collection analysis and comparison module to track updates, and to identify gaps, changes, and irregularities within and across collections. Assessment results can be accessed interactively through a web-based interactive graph. In this paper we introduce SC as an interdisciplinary enterprise, and illustrate its capabilities through its implementation in ASTRIAGraph, a space sustainability knowledge system

    Updating the DCC Curation Lifecycle Model

    Get PDF
    The DCC Curation Lifecycle Model has played a vital role in the field of data curation for over a decade. During that time, the scale and complexity of data have changed dramatically, along with the contexts of data production and use. This paper reports on a study examining factors impacting data curation practices and presents recommendations for updating the DCC Curation Lifecycle Model. The study was grounded in a review of other lifecycle models and informed by a site visit to the Digital Curation Centre and consultation with expert practitioners and researchers. Framed by contemporary conditions impacting the conduct of research and provision of data services, the analysis and proposed recommendations account for the prominence of machine-actionable data, the importance of machine learning for data processing and analytics, growth of integrated research workflows, and escalating concerns with fairness, accountability, and transparency of data and algorithms
    • …
    corecore