25 research outputs found

    Challenges in Integrating Biological Data Sources

    Get PDF
    this report, we examine the technical challenges to integration, critique the available tools and resources, and compare the cost and advantages of various methodologies. We begin by analyzing the basic steps in strict and complete integration: 1) transformation of the various schemas to a common data model; 2) matching of semantically related schema objects; 3) schema integration; 4) transformation of data to the federated database on demand; and 5) matching of semantically equivalent data. Some progress has been made on generic problems such as (1) and (3) within the wider database community, but issues of semantics (steps (2) and (5)) have only been dealt with any degree of success by domain experts within the biological community. We then look at the solution space of integration strategies as defined by two axes, the "tightness" of federation and the "degree" of instantiation, discuss where various solutions fall on this plane, and examine their cost and advantages/disadvantages. Finally, we examine technical challenges that are not -3- July 12, 199

    Fine-Grained Provenance And Applications To Data Analytics Computation

    Get PDF
    Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Faculty Publications & Presentations, 2006-2007

    Get PDF
    corecore