Search CORE

6,234 research outputs found

Extracting, Transforming and Archiving Scientific Data

Author: Lemire Daniel
Vellino Andre
Publication venue
Publication date: 01/03/2011
Field of study

It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

arXiv.org e-Print Archive

R-libre

Semandaq: a data quality system based on conditional functional dependencies

Author: Fan Wenfei
Geerts Floris
Jia Xibei
Publication venue
Publication date: 01/01/2008
Field of study

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Design and management of data warehouses - Report on the DMDW '99 workshop.

Author: Gatziu S.
Jeusfeld M.A.
Staudt M.
Vassiliou Y.
Publication venue
Publication date
Field of study

Research Papers in Economics

Integrating Industry and National Economic Accounts: First Steps and Future Improvements

Author: Ann M. Lawson
Brian C. Moyer
Mark A. Planting
Sumiye Okubo
Publication venue
Publication date
Field of study

The integration of the annual I-O accounts with the GDP-by-industry accounts is the most recent in a series of improvements to the industry accounts provided by the BEA in recent years. BEA prepares two sets of national industry accounts: The I-O accounts, which consist of the benchmark I-O accounts and the annual I-O accounts, and the GDPby- industry accounts. Both the I-O accounts and the GDP-by-industry accounts present measures of gross output, intermediate inputs, and value added by industry. However, in the past, they were inconsistent because of the use of different methodologies, classification frameworks, and source data. The integration of these accounts eliminated these inconsistencies and improved the accuracy of both sets of accounts. The integration of the annual industry accounts represents a major advance in the timeliness, accuracy, and consistency of these accounts, and is a result of significant improvements in BEA's estimating methods. The paper describes the new methodology, and the future steps required to integrate the industry accounts with the NIPAs. The new methodology combines source data between the two industry accounts to improve accuracy; it prepares the newly integrated accounts within an I-O framework that balances and reconciles industry production with commodity usage. Moreover, the new methodology allows the acceleration of the release of the annual I-O accounts by 2 years and for the first time, provides a consistent time series of annual I-O accounts. Three appendices are provided: A description of the probability-based method to rank source data by quality; a description of the new balancing produced for producing the annual I-O accounts; and a description of the computation method used to estimate chaintype price and quantity indexes in the GDP-by-industry accounts.

Research Papers in Economics

Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses

Author: Dillon Tharam S.
Feng Ling
Publication venue
Publication date: 01/01/2003
Field of study

A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses

CiteSeerX

University of Twente Research Information

Data Cleaning: Problems and Current Approaches

Author: Do Hong Hai
Rahm Erhard
Publication venue
Publication date: 04/02/2019
Field of study

We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig