1,317 research outputs found
Algorithms for Provisioning Queries and Analytics
Provisioning is a technique for avoiding repeated expensive computations in
what-if analysis. Given a query, an analyst formulates hypotheticals, each
retaining some of the tuples of a database instance, possibly overlapping, and
she wishes to answer the query under scenarios, where a scenario is defined by
a subset of the hypotheticals that are "turned on". We say that a query admits
compact provisioning if given any database instance and any hypotheticals,
one can create a poly-size (in ) sketch that can then be used to answer the
query under any of the possible scenarios without accessing the
original instance.
In this paper, we focus on provisioning complex queries that combine
relational algebra (the logical component), grouping, and statistics/analytics
(the numerical component). We first show that queries that compute quantiles or
linear regression (as well as simpler queries that compute count and
sum/average of positive values) can be compactly provisioned to provide
(multiplicative) approximate answers to an arbitrary precision. In contrast,
exact provisioning for each of these statistics requires the sketch size to be
exponential in . We then establish that for any complex query whose logical
component is a positive relational algebra query, as long as the numerical
component can be compactly provisioned, the complex query itself can be
compactly provisioned. On the other hand, introducing negation or recursion in
the logical component again requires the sketch size to be exponential in .
While our positive results use algorithms that do not access the original
instance after a scenario is known, we prove our lower bounds even for the case
when, knowing the scenario, limited access to the instance is allowed
Using Provenance to support Good Laboratory Practice in Grid Environments
Conducting experiments and documenting results is daily business of
scientists. Good and traceable documentation enables other scientists to
confirm procedures and results for increased credibility. Documentation and
scientific conduct are regulated and termed as "good laboratory practice."
Laboratory notebooks are used to record each step in conducting an experiment
and processing data. Originally, these notebooks were paper based. Due to
computerised research systems, acquired data became more elaborate, thus
increasing the need for electronic notebooks with data storage, computational
features and reliable electronic documentation. As a new approach to this, a
scientific data management system (DataFinder) is enhanced with features for
traceable documentation. Provenance recording is used to meet requirements of
traceability, and this information can later be queried for further analysis.
DataFinder has further important features for scientific documentation: It
employs a heterogeneous and distributed data storage concept. This enables
access to different types of data storage systems (e. g. Grid data
infrastructure, file servers). In this chapter we describe a number of building
blocks that are available or close to finished development. These components
are intended for assembling an electronic laboratory notebook for use in Grid
environments, while retaining maximal flexibility on usage scenarios as well as
maximal compatibility overlap towards each other. Through the usage of such a
system, provenance can successfully be used to trace the scientific workflow of
preparation, execution, evaluation, interpretation and archiving of research
data. The reliability of research results increases and the research process
remains transparent to remote research partners.Comment: Book Chapter for "Data Provenance and Data Management for eScience,"
of Studies in Computational Intelligence series, Springer. 25 pages, 8
figure
Extracting, Transforming and Archiving Scientific Data
It is becoming common to archive research datasets that are not only large
but also numerous. In addition, their corresponding metadata and the software
required to analyse or display them need to be archived. Yet the manual
curation of research data can be difficult and expensive, particularly in very
large digital repositories, hence the importance of models and tools for
automating digital curation tasks. The automation of these tasks faces three
major challenges: (1) research data and data sources are highly heterogeneous,
(2) future research needs are difficult to anticipate, (3) data is hard to
index. To address these problems, we propose the Extract, Transform and Archive
(ETA) model for managing and mechanizing the curation of research data.
Specifically, we propose a scalable strategy for addressing the research-data
problem, ranging from the extraction of legacy data to its long-term storage.
We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201
Open Data, Grey Data, and Stewardship: Universities at the Privacy Frontier
As universities recognize the inherent value in the data they collect and
hold, they encounter unforeseen challenges in stewarding those data in ways
that balance accountability, transparency, and protection of privacy, academic
freedom, and intellectual property. Two parallel developments in academic data
collection are converging: (1) open access requirements, whereby researchers
must provide access to their data as a condition of obtaining grant funding or
publishing results in journals; and (2) the vast accumulation of 'grey data'
about individuals in their daily activities of research, teaching, learning,
services, and administration. The boundaries between research and grey data are
blurring, making it more difficult to assess the risks and responsibilities
associated with any data collection. Many sets of data, both research and grey,
fall outside privacy regulations such as HIPAA, FERPA, and PII. Universities
are exploiting these data for research, learning analytics, faculty evaluation,
strategic decisions, and other sensitive matters. Commercial entities are
besieging universities with requests for access to data or for partnerships to
mine them. The privacy frontier facing research universities spans open access
practices, uses and misuses of data, public records requests, cyber risk, and
curating data for privacy protection. This paper explores the competing values
inherent in data stewardship and makes recommendations for practice, drawing on
the pioneering work of the University of California in privacy and information
security, data governance, and cyber risk.Comment: Final published version, Sept 30, 201
From Sensor to Observation Web with Environmental Enablers in the Future Internet
This paper outlines the grand challenges in global sustainability research and the objectives of the FP7 Future Internet PPP program within the Digital Agenda for Europe. Large user communities are generating significant amounts of valuable environmental observations at local and regional scales using the devices and services of the Future Internet. These communities’ environmental observations represent a wealth of information which is currently hardly used or used only in isolation and therefore in need of integration with other information sources. Indeed, this very integration will lead to a paradigm shift from a mere Sensor Web to an Observation Web with semantically enriched content emanating from sensors, environmental simulations and citizens. The paper also describes the research challenges to realize the Observation Web and the associated environmental enablers for the Future Internet. Such an environmental enabler could for instance be an electronic sensing device, a web-service application, or even a social networking group affording or facilitating the capability of the Future Internet applications to consume, produce, and use environmental observations in cross-domain applications. The term ?envirofied? Future Internet is coined to describe this overall target that forms a cornerstone of work in the Environmental Usage Area within the Future Internet PPP program. Relevant trends described in the paper are the usage of ubiquitous sensors (anywhere), the provision and generation of information by citizens, and the convergence of real and virtual realities to convey understanding of environmental observations. The paper addresses the technical challenges in the Environmental Usage Area and the need for designing multi-style service oriented architecture. Key topics are the mapping of requirements to capabilities, providing scalability and robustness with implementing context aware information retrieval. Another essential research topic is handling data fusion and model based computation, and the related propagation of information uncertainty. Approaches to security, standardization and harmonization, all essential for sustainable solutions, are summarized from the perspective of the Environmental Usage Area. The paper concludes with an overview of emerging, high impact applications in the environmental areas concerning land ecosystems (biodiversity), air quality (atmospheric conditions) and water ecosystems (marine asset management)
Using Machine Learning to Infer Reasoning Provenance from User Interaction Log Data
The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems
Using machine learning to infer reasoning provenance from user interaction log data: based on the data/frame theory of sensemaking
The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems
Using machine learning to infer reasoning provenance from user interaction log data: based on the data/frame theory of sensemaking
The reconstruction of analysts’ reasoning processes (reasoning provenance) during complex sensemaking tasks can support reflection and decision making. One potential approach to such reconstruction is to automatically infer reasoning from low-level user interaction logs. We explore a novel method for doing this using machine learning. Two user studies were conducted in which participants performed similar intelligence analysis tasks. In one study, participants used a standard web browser and word processor; in the other, they used a system called INVISQUE (Interactive Visual Search and Query Environment). Interaction logs were manually coded for cognitive actions based on captured think-aloud protocol and posttask interviews based on Klein, Phillips, Rall, and Pelusos’s data/frame model of sensemaking as a conceptual framework. This analysis was then used to train an interaction frame mapper, which employed multiple machine learning models to learn relationships between the interaction logs and the codings. Our results show that, for one study at least, classification accuracy was significantly better than chance and compared reasonably to a reported manual provenance reconstruction method. We discuss our results in terms of variations in feature sets from the two studies and what this means for the development of the method for provenance capture and the evaluation of sensemaking systems
Chemical applications of escience to interfacial spectroscopy
This report is a summary of works carried out by the author between October 2003 and September 2004, in the first year of his PhD studie
- …