30,925 research outputs found
Big Data Analytics in Static and Streaming Provenance
Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate
sensors and model simulations to understand spatial, temporal, social, and ecological
relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However,
provenance can be overwhelming in both volume and complexity; the now forecasting
potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making
A Linked Data Approach to Sharing Workflows and Workflow Results
A bioinformatics analysis pipeline is often highly elaborate, due to the inherent complexity of biological systems and the variety and size of datasets. A digital equivalent of the ‘Materials and Methods’ section in wet laboratory publications would be highly beneficial to bioinformatics, for evaluating evidence and examining data across related experiments, while introducing the potential to find associated resources and integrate them as data and services. We present initial steps towards preserving bioinformatics ‘materials and methods’ by exploiting the workflow paradigm for capturing the design of a data analysis pipeline, and RDF to link the workflow, its component services, run-time provenance, and a personalized biological interpretation of the results. An example shows the reproduction of the unique graph of an analysis procedure, its results, provenance, and personal interpretation of a text mining experiment. It links data from Taverna, myExperiment.org, BioCatalogue.org, and ConceptWiki.org. The approach is relatively ‘light-weight’ and unobtrusive to bioinformatics users
Spatial mapping of the provenance of storm dust: Application of data mining and ensemble modelling
Spatial modelling of storm dust provenance is essential to mitigate its on-site and off-site effects in the arid and
semi-arid environments of the world. Therefore, the main aim of this study was to apply eight data mining algorithms including random forest (RF), support vector machine (SVM), bayesian additive regression trees (BART), radial basis function (RBF), extreme gradient boosting (XGBoost), regression tree analysis (RTA), Cubist model and boosted regression trees (BRT) and an ensemble modelling (EM) approach for generating spatial maps of dust provenance in the Khuzestan province, a main region with active sources for producing dust in southwestern Iran. This study is the first attempt at predicting storm dust provenance by applying individual data mining models and ensemble modelling. We identified and mapped in a geographic information system (GIS), 12 potential effective factors for dust emissions comprising two for climate (wind speed, precipitation), five soil characteristics (texture, bulk density, Ec, organic matter (OM), available water capacity (AWC)), a normalized difference vegetation index (NDVI), land use, geology, a digital elevation model (DEM) and land type, and used a mean decrease accuracy measure (MDAM) to determine the corresponding importance scores (IS). A multicollinearity
test (including the variance inflation factor (VIF) and tolerance coefficient (TC)) was applied to assess relationships between the effective factors, and an existing map of dust provenance was randomly categorized into two groups consisting of training (70%) and validation (30%) data. The individual data mining models were validated using the area under the curve (AUC). Based on the TC and VIF results, no collinearity was detected among the 12 effective factors for dust emissions. The prediction accuracies of the eight data mining models and an EM assessed by the AUC were as follows: EM (with AUC=99.8%) > XGBoost>RBF > Cubist>RF > BART>SVM > BRT > RTA (with AUC=79.1%). Among all models, the EM was found to provide the highest accuracy for predicting storm dust provenance. Using the EM, areas classified as being low, moderate, high and very high susceptibility for storm dust provenance comprised 36, 13, 23 and 28% of the total mapped area, respectively. Based on MDAM results, the highest and lowest IS were obtained for the wind speed (IS=23) and geology (IS=6.5) factors, respectively. Overall, the modelling techniques used in this research are helpful for predicting storm dust provenance and thereby targeting mitigation. Therefore, we recommend applying data mining EM approaches to the spatial mapping of storm dust provenance worldwide
Data Provenance and Management in Radio Astronomy: A Stream Computing Approach
New approaches for data provenance and data management (DPDM) are required
for mega science projects like the Square Kilometer Array, characterized by
extremely large data volume and intense data rates, therefore demanding
innovative and highly efficient computational paradigms. In this context, we
explore a stream-computing approach with the emphasis on the use of
accelerators. In particular, we make use of a new generation of high
performance stream-based parallelization middleware known as InfoSphere
Streams. Its viability for managing and ensuring interoperability and integrity
of signal processing data pipelines is demonstrated in radio astronomy. IBM
InfoSphere Streams embraces the stream-computing paradigm. It is a shift from
conventional data mining techniques (involving analysis of existing data from
databases) towards real-time analytic processing. We discuss using InfoSphere
Streams for effective DPDM in radio astronomy and propose a way in which
InfoSphere Streams can be utilized for large antennae arrays. We present a
case-study: the InfoSphere Streams implementation of an autocorrelating
spectrometer, and using this example we discuss the advantages of the
stream-computing approach and the utilization of hardware accelerators
Automatic vs Manual Provenance Abstractions: Mind the Gap
In recent years the need to simplify or to hide sensitive information in
provenance has given way to research on provenance abstraction. In the context
of scientific workflows, existing research provides techniques to semi
automatically create abstractions of a given workflow description, which is in
turn used as filters over the workflow's provenance traces. An alternative
approach that is commonly adopted by scientists is to build workflows with
abstractions embedded into the workflow's design, such as using sub-workflows.
This paper reports on the comparison of manual versus semi-automated approaches
in a context where result abstractions are used to filter report-worthy results
of computational scientific analyses. Specifically; we take a real-world
workflow containing user-created design abstractions and compare these with
abstractions created by ZOOM UserViews and Workflow Summaries systems. Our
comparison shows that semi-automatic and manual approaches largely overlap from
a process perspective, meanwhile, there is a dramatic mismatch in terms of data
artefacts retained in an abstracted account of derivation. We discuss reasons
and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications
of Provenance, TAPP 201
From scientific workflow patterns to 5-star linked open data
International audienceScientific Workflow management systems have been largely adopted by data-intensive science communities. Many efforts have been dedicated to the representation and exploitation of prove-nance to improve reproducibility in data-intensive sciences. However , few works address the mining of provenance graphs to annotate the produced data with domain-specific context for better interpretation and sharing of results. In this paper, we propose PoeM, a lightweight framework for mining provenance in scientific workflows. PoeM allows to produce linked in silico experiment reports based on workflow runs. PoeM leverages semantic web technologies and reference vocabularies (PROV-O, P-Plan) to generate provenance mining rules and finally assemble linked scientific experiment reports (Micropublications, Experimental Factor Ontology). Preliminary experiments demonstrate that PoeM enables the querying and sharing of Galaxy 1-processed genomic data as 5-star linked datasets
Chemical information matters: an e-Research perspective on information and data sharing in the chemical sciences
Recently, a number of organisations have called for open access to scientific information and especially to the data obtained from publicly funded research, among which the Royal Society report and the European Commission press release are particularly notable. It has long been accepted that building research on the foundations laid by other scientists is both effective and efficient. Regrettably, some disciplines, chemistry being one, have been slow to recognise the value of sharing and have thus been reluctant to curate their data and information in preparation for exchanging it. The very significant increases in both the volume and the complexity of the datasets produced has encouraged the expansion of e-Research, and stimulated the development of methodologies for managing, organising, and analysing "big data". We review the evolution of cheminformatics, the amalgam of chemistry, computer science, and information technology, and assess the wider e-Science and e-Research perspective. Chemical information does matter, as do matters of communicating data and collaborating with data. For chemistry, unique identifiers, structure representations, and property descriptors are essential to the activities of sharing and exchange. Open science entails the sharing of more than mere facts: for example, the publication of negative outcomes can facilitate better understanding of which synthetic routes to choose, an aspiration of the Dial-a-Molecule Grand Challenge. The protagonists of open notebook science go even further and exchange their thoughts and plans. We consider the concepts of preservation, curation, provenance, discovery, and access in the context of the research lifecycle, and then focus on the role of metadata, particularly the ontologies on which the emerging chemical Semantic Web will depend. Among our conclusions, we present our choice of the "grand challenges" for the preservation and sharing of chemical information
- …