6,234 research outputs found

    Extracting, Transforming and Archiving Scientific Data

    Get PDF
    It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

    Integrating Industry and National Economic Accounts: First Steps and Future Improvements

    Get PDF
    The integration of the annual I-O accounts with the GDP-by-industry accounts is the most recent in a series of improvements to the industry accounts provided by the BEA in recent years. BEA prepares two sets of national industry accounts: The I-O accounts, which consist of the benchmark I-O accounts and the annual I-O accounts, and the GDPby- industry accounts. Both the I-O accounts and the GDP-by-industry accounts present measures of gross output, intermediate inputs, and value added by industry. However, in the past, they were inconsistent because of the use of different methodologies, classification frameworks, and source data. The integration of these accounts eliminated these inconsistencies and improved the accuracy of both sets of accounts. The integration of the annual industry accounts represents a major advance in the timeliness, accuracy, and consistency of these accounts, and is a result of significant improvements in BEA's estimating methods. The paper describes the new methodology, and the future steps required to integrate the industry accounts with the NIPAs. The new methodology combines source data between the two industry accounts to improve accuracy; it prepares the newly integrated accounts within an I-O framework that balances and reconciles industry production with commodity usage. Moreover, the new methodology allows the acceleration of the release of the annual I-O accounts by 2 years and for the first time, provides a consistent time series of annual I-O accounts. Three appendices are provided: A description of the probability-based method to rank source data by quality; a description of the new balancing produced for producing the annual I-O accounts; and a description of the computation method used to estimate chaintype price and quantity indexes in the GDP-by-industry accounts.

    Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses

    Get PDF
    A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses

    Data Cleaning: Problems and Current Approaches

    Get PDF
    We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning
    corecore