464 research outputs found

    Hypothetical Reasoning via Provenance Abstraction

    Full text link
    Data analytics often involves hypothetical reasoning: repeatedly modifying the data and observing the induced effect on the computation result of a data-centric application. Previous work has shown that fine-grained data provenance can help make such an analysis more efficient: instead of a costly re-execution of the underlying application, hypothetical scenarios are applied to a pre-computed provenance expression. However, storing provenance for complex queries and large-scale data leads to a significant overhead, which is often a barrier to the incorporation of provenance-based solutions. To this end, we present a framework that allows to reduce provenance size. Our approach is based on reducing the provenance granularity using user defined abstraction trees over the provenance variables; the granularity is based on the anticipated hypothetical scenarios. We formalize the tradeoff between provenance size and supported granularity of the hypothetical reasoning, and study the complexity of the resulting optimization problem, provide efficient algorithms for tractable cases and heuristics for others. We experimentally study the performance of our solution for various queries and abstraction trees. Our study shows that the algorithms generally lead to substantial speedup of hypothetical reasoning, with a reasonable loss of accuracy

    Dataset search: a survey

    Get PDF
    Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference

    The Semantic Grid: A future e-Science infrastructure

    No full text
    e-Science offers a promising vision of how computer and communication technology can support and enhance the scientific process. It does this by enabling scientists to generate, analyse, share and discuss their insights, experiments and results in an effective manner. The underlying computer infrastructure that provides these facilities is commonly referred to as the Grid. At this time, there are a number of grid applications being developed and there is a whole raft of computer technologies that provide fragments of the necessary functionality. However there is currently a major gap between these endeavours and the vision of e-Science in which there is a high degree of easy-to-use and seamless automation and in which there are flexible collaborations and computations on a global scale. To bridge this practice–aspiration divide, this paper presents a research agenda whose aim is to move from the current state of the art in e-Science infrastructure, to the future infrastructure that is needed to support the full richness of the e-Science vision. Here the future e-Science research infrastructure is termed the Semantic Grid (Semantic Grid to Grid is meant to connote a similar relationship to the one that exists between the Semantic Web and the Web). In particular, we present a conceptual architecture for the Semantic Grid. This architecture adopts a service-oriented perspective in which distinct stakeholders in the scientific process, represented as software agents, provide services to one another, under various service level agreements, in various forms of marketplace. We then focus predominantly on the issues concerned with the way that knowledge is acquired and used in such environments since we believe this is the key differentiator between current grid endeavours and those envisioned for the Semantic Grid

    Doctor of Philosophy

    Get PDF
    dissertationServing as a record of what happened during a scientific process, often computational, provenance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been little work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the management of those data is important. One component of provenance is the specification of the computations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and underlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations

    Data Processing Methodologies to Investigate the Association between Depositional Environments and Trace Fossil Occurrence

    Get PDF
    The transition from late Ediacaran to early Cambrian records important paleobiological and paleoecological changes. These are observed in the Fortunian diversification event and the Agronomic Revolution, which describe significant body plan diversification, increased behavioral complexity in trace fossils, and a shift from matgrounds to mixgrounds ecosystems. To provide a more thorough understanding of this dramatic transition, data mining techniques (i.e. visual and statistical analysis) are used to investigate the relationship between depositional environments and trace fossil occurrence. To facilitate analysis, an ichnological database has been designed and implemented using Microsoft Access. The creation of this database is important in that it provides a platform for data digitization and subsequent data mining, while also accounting for fundamental differences between trace fossils and body fossils. Current paleontology databases do not recognize this distinction, which stems from the fact that trace fossils represent organism behavior, while body fossils record the phylogenetic affinities of an organism. Analysis of the ichnologic data compiled is supported with additional datasets, with a large focus on utilizing detrital zircon to infer geodynamic settings and to provide validation of paleogeographic reconstruction models via visual provenance analysis. A more quantified version of detrital zircon provenance analysis by way of Multidimensional Scaling (MDS) was conducted; however, this study has shown that MDS is best utilized at a regional scale. In combining all supplementary datasets, paleogeographic reconstructions for the Ediacaran, Terreneuvian, and Cambrian Epoch 2 have been constructed. With an appropriate spatial and temporal context, visual analysis of ichnologic data displays a global distribution of trace fossils through this transition, implying the utilization of available ecospace and a lack of paleoclimatic restrictions. Statistical analysis in the form of Correspondence Analysis (CA) displays a clear lack of relationships between ichnogenera and depositional environments during the Ediacaran, suggesting trace fossils were facies-crossing prior to Phanerozoic-style ecosystems. CA produces markedly different results during the early Cambrian, displaying ichnogenera differentiation between depositional environments (i.e. increasing beta ichnodiversity) in the relationship between Oldhamia and deep marine depositional environments. These results lend support to the Agronomic Revolution, as microbial matgrounds were forced into increasingly stressful paleoenvironments (i.e., deep marine settings) during this paleoecological revolution
    • …
    corecore