1,317 research outputs found

    Algorithms for Provisioning Queries and Analytics

    Get PDF
    Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates kk hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any kk hypotheticals, one can create a poly-size (in kk) sketch that can then be used to answer the query under any of the 2k2^{k} possible scenarios without accessing the original instance. In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in kk. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in kk. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed

    Using Provenance to support Good Laboratory Practice in Grid Environments

    Get PDF
    Conducting experiments and documenting results is daily business of scientists. Good and traceable documentation enables other scientists to confirm procedures and results for increased credibility. Documentation and scientific conduct are regulated and termed as "good laboratory practice." Laboratory notebooks are used to record each step in conducting an experiment and processing data. Originally, these notebooks were paper based. Due to computerised research systems, acquired data became more elaborate, thus increasing the need for electronic notebooks with data storage, computational features and reliable electronic documentation. As a new approach to this, a scientific data management system (DataFinder) is enhanced with features for traceable documentation. Provenance recording is used to meet requirements of traceability, and this information can later be queried for further analysis. DataFinder has further important features for scientific documentation: It employs a heterogeneous and distributed data storage concept. This enables access to different types of data storage systems (e. g. Grid data infrastructure, file servers). In this chapter we describe a number of building blocks that are available or close to finished development. These components are intended for assembling an electronic laboratory notebook for use in Grid environments, while retaining maximal flexibility on usage scenarios as well as maximal compatibility overlap towards each other. Through the usage of such a system, provenance can successfully be used to trace the scientific workflow of preparation, execution, evaluation, interpretation and archiving of research data. The reliability of research results increases and the research process remains transparent to remote research partners.Comment: Book Chapter for "Data Provenance and Data Management for eScience," of Studies in Computational Intelligence series, Springer. 25 pages, 8 figure

    Extracting, Transforming and Archiving Scientific Data

    Get PDF
    It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

    Open Data, Grey Data, and Stewardship: Universities at the Privacy Frontier

    Full text link
    As universities recognize the inherent value in the data they collect and hold, they encounter unforeseen challenges in stewarding those data in ways that balance accountability, transparency, and protection of privacy, academic freedom, and intellectual property. Two parallel developments in academic data collection are converging: (1) open access requirements, whereby researchers must provide access to their data as a condition of obtaining grant funding or publishing results in journals; and (2) the vast accumulation of 'grey data' about individuals in their daily activities of research, teaching, learning, services, and administration. The boundaries between research and grey data are blurring, making it more difficult to assess the risks and responsibilities associated with any data collection. Many sets of data, both research and grey, fall outside privacy regulations such as HIPAA, FERPA, and PII. Universities are exploiting these data for research, learning analytics, faculty evaluation, strategic decisions, and other sensitive matters. Commercial entities are besieging universities with requests for access to data or for partnerships to mine them. The privacy frontier facing research universities spans open access practices, uses and misuses of data, public records requests, cyber risk, and curating data for privacy protection. This paper explores the competing values inherent in data stewardship and makes recommendations for practice, drawing on the pioneering work of the University of California in privacy and information security, data governance, and cyber risk.Comment: Final published version, Sept 30, 201

    From Sensor to Observation Web with Environmental Enablers in the Future Internet

    Get PDF
    This paper outlines the grand challenges in global sustainability research and the objectives of the FP7 Future Internet PPP program within the Digital Agenda for Europe. Large user communities are generating significant amounts of valuable environmental observations at local and regional scales using the devices and services of the Future Internet. These communities’ environmental observations represent a wealth of information which is currently hardly used or used only in isolation and therefore in need of integration with other information sources. Indeed, this very integration will lead to a paradigm shift from a mere Sensor Web to an Observation Web with semantically enriched content emanating from sensors, environmental simulations and citizens. The paper also describes the research challenges to realize the Observation Web and the associated environmental enablers for the Future Internet. Such an environmental enabler could for instance be an electronic sensing device, a web-service application, or even a social networking group affording or facilitating the capability of the Future Internet applications to consume, produce, and use environmental observations in cross-domain applications. The term ?envirofied? Future Internet is coined to describe this overall target that forms a cornerstone of work in the Environmental Usage Area within the Future Internet PPP program. Relevant trends described in the paper are the usage of ubiquitous sensors (anywhere), the provision and generation of information by citizens, and the convergence of real and virtual realities to convey understanding of environmental observations. The paper addresses the technical challenges in the Environmental Usage Area and the need for designing multi-style service oriented architecture. Key topics are the mapping of requirements to capabilities, providing scalability and robustness with implementing context aware information retrieval. Another essential research topic is handling data fusion and model based computation, and the related propagation of information uncertainty. Approaches to security, standardization and harmonization, all essential for sustainable solutions, are summarized from the perspective of the Environmental Usage Area. The paper concludes with an overview of emerging, high impact applications in the environmental areas concerning land ecosystems (biodiversity), air quality (atmospheric conditions) and water ecosystems (marine asset management)

    Using Machine Learning to Infer Reasoning Provenance from User Interaction Log Data

    Get PDF
    The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems

    Using machine learning to infer reasoning provenance from user interaction log data: based on the data/frame theory of sensemaking

    Get PDF
    The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems

    Using machine learning to infer reasoning provenance from user interaction log data: based on the data/frame theory of sensemaking

    Get PDF
    The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems

    Chemical applications of escience to interfacial spectroscopy

    No full text
    This report is a summary of works carried out by the author between October 2003 and September 2004, in the first year of his PhD studie
    corecore