59 research outputs found

    Data access and visualization benefits from implementation of a hydrologic information system

    Get PDF
    In 2010, the National Science Foundation (NSF) implemented new guidelines for all scientists applying for grants. A Data Management Plan (DMP) is now required for all proposals in which data are created or gathered while working under the grant. Several organizations have produced templates and applications to assist with the construction of DMPs. The data plans provide a good overview of data processing and storage but do not provide any guidance for managing data during the research process. Large temporal hydrologic data sets can provide a rich insight to complex hydrologic and ecological systems. Complications arise when attempting to query and present the data in ways that are useful for exploring and validating research hypotheses. Common tools, such as Excel or Matlab, may be helpful if you know the exact sequence of data you want to analyze. Frequently, this is not the case. Looking at long term trends, adding and removing additional variables, or comparing local results to external national datasets are difficult or impossible with these tools. To overcome the limitations of current data management methods, a Consortium of Universities for the Advancement of Hydrologic Science Inc. - Hydrologic Information System (CUAHSI-HIS) server was deployed in collaboration with Earth Data Analysis Center (EDAC) and the New Mexico Experimental Program to Stimulate Competitive Research (NMEPSCoR). Data products on the server are stored in a relational database using WaterML, an XML based language introducing standardization to the hydrologic community and facilitating distribution and aggregation of hydrologic data. Four project types from different agencies have been selected to explore the process of obtaining and ingesting data into an HIS. Three of the projects are university based with different stakeholders and the fourth is a state funded project carried out by a contractor. Tools developed by CUAHSI for ingesting measurements into the database made processing the raw data straightforward. After the data were formatted properly, automated processes allowed millions of measurements to migrate from Excel files into the HIS. Aggregating the data and metadata without support from the principal investigator proved difficult. Deciphering the provenance of derived data proved exceptionally difficult from a data manager perspective with little experience in specialized disciplines. Datasets that previously required hours to download, aggregate, and visualize are can now be processed in minutes. Repetitive analysis tasks can be automated within the HIS, integrating local regional, and national datasets by spatial and temporal extent and delivered to the research team in a variety of formats. The CUAHSI-HIS components make data discovery and analysis streamlined in addition to satisfying the NSF DMP requirements

    Chemistry-Inspired Adaptive Stream Processing

    Get PDF
    International audienceStream processing engines have appeared as the next generation of data processing systems, facing the needs for low-delay processing. While these systems have been widely studied recently, their ability to adapt their processing logics at run time upon the detection of some events calling for adaptation is still an open issue. Chemistry-inspired models of computation have been shown to ease the specification of adaptive systems. In this paper, we argue that a higher-order chemical model can be used to specify such an adaptive SPE in a natural way. We also show how such programming abstractions can get enacted in a decentralised environment

    Methods Included:Standardizing Computational Reuse and Portability with the Common Workflow Language

    Get PDF
    A widely used standard for portable multilingual data analysis pipelines would enable considerable benefits to scholarly publication reuse, research/industry collaboration, regulatory cost control, and to the environment. Published research that used multiple computer languages for their analysis pipelines would include a complete and reusable description of that analysis that is runnable on a diverse set of computing environments. Researchers would be able to easier collaborate and reuse these pipelines, adding or exchanging components regardless of programming language used; collaborations with and within the industry would be easier; approval of new medical interventions that rely on such pipelines would be faster. Time will be saved and environmental impact would also be reduced, as these descriptions contain enough information for advanced optimization without user intervention. Workflows are widely used in data analysis pipelines, enabling innovation and decision-making for the modern society. In many domains the analysis components are numerous and written in multiple different computer languages by third parties. However, lacking a standard for reusable and portable multilingual workflows, then reusing published multilingual workflows, collaborating on open problems, and optimizing their execution would be severely hampered. Moreover, only a standard for multilingual data analysis pipelines that was widely used would enable considerable benefits to research-industry collaboration, regulatory cost control, and to preserving the environment. Prior to the start of the CWL project, there was no standard for describing multilingual analysis pipelines in a portable and reusable manner. Even today / currently, although there exist hundreds of single-vendor and other single-source systems that run workflows, none is a general, community-driven, and consensus-built standard

    Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

    Get PDF
    A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure

    Feedback-control & queueing theory-based resource management for streaming applications

    Get PDF
    Recent advances in sensor technologies and instrumentation have led to an extraordinary growth of data sources and streaming applications. A wide variety of devices, from smart phones to dedicated sensors, have the capability of collecting and streaming large amounts of data at unprecedented rates. A number of distinct streaming data models have been proposed. Typical applications for this include smart cites & built environments for instance, where sensor-based infrastructures continue to increase in scale and variety. Understanding how such streaming content can be processed within some time threshold remains a non-trivial and important research topic. We investigate how a cloud-based computational infrastructure can autonomically respond to such streaming content, offering Quality of Service guarantees. We propose an autonomic controller (based on feedback control and queueing theory) to elastically provision virtual machines to meet performance targets associated with a particular data stream. Evaluation is carried out using a federated Cloud-based infrastructure (implemented using CometCloud) – where the allocation of new resources can be based on: (i) differences between sites, i.e. types of resources supported (e.g. GPU vs. CPU only), (ii) cost of execution; (iii) failure rate and likely resilience, etc. In particular, we demonstrate how Little’s Law –a widely used result in queuing theory– can be adapted to support dynamic control in the context of such resource provisioning

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    A Data Model to Manage Data for Water Resources Systems Modeling

    Get PDF
    Current practices to identify, organize, analyze, and serve data to water resources systems models are typically model and dataset-specific. Data are stored in different formats, described with different vocabularies, and require manual, model-specific, and time-intensive manipulations to find, organize, compare, and then serve to models. This paper presents the Water Management Data Model (WaMDaM) implemented in a relational database. WaMDaM uses contextual metadata, controlled vocabularies, and supporting software tools to organize and store water management data from multiple sources and models and allow users to more easily interact with its database. Five use cases use thirteen datasets and models focused in the Bear River Watershed, United States to show how a user can identify, compare, and choose from multiple types of data, networks, and scenario elements then serve data to models. The database design is flexible and scalable to accommodate new datasets, models, and associated components, attributes, scenarios, and metadata

    Water Data Science: Data Driven Techniques, Training, and Tools for Improved Management of High Frequency Water Resources Data

    Get PDF
    Electronic sensors can measure water and climate conditions at high frequency and generate large quantities of observed data. This work addresses data management challenges associated with the volume and complexity of high frequency water data. We developed techniques for automatically reviewing data, created materials for training water data managers, and explored existing and emerging technologies for sensor data management. Data collected by sensors often include errors due to sensor failure or environmental conditions that need to be removed, labeled, or corrected before the data can be used for analysis. Manual review and correction of these data can be tedious and time consuming. To help automate these tasks, we developed a computer program that automatically checks the data for mistakes and attempts to fix them. This tool has the potential to save time and effort and is available to scientists and practitioners who use sensors to monitor water. Scientists may lack skillsets for working with sensor data because traditional engineering or science courses do not address how work with complex data with modern technology. We surveyed and interviewed instructors who teach courses related to “hydroinformatics” or “water data science” to understand challenges in incorporating data science techniques and tools into water resources teaching. Based on their feedback, we created educational materials that demonstrate how the articulated challenges can be effectively addressed to provide high-quality instruction. These materials are available online for students and teachers. In addition to skills for working with sensor data, scientists and engineers need tools for storing, managing, and sharing these data. Hydrologic information systems (HIS) help manage the data collected using sensors. HIS make sure that data can be effectively used by providing the computer infrastructure to get data from sensors in the field to secure data storage and then into the hands of scientists and others who use them. This work describes the evolution of software and standards that comprise HIS. We present the main components of HIS, describe currently available systems and gaps in technology or functionality, and then discuss opportunities for improved infrastructure that would make sensor data easier to collect, manage, and use. In short, we are trying to make sure that sensor data are good and useful; we’re helping instructors teach prospective data collectors and users about water and data; and we are making sure that the systems that enable collection, storage, management, and use of the data work smoothly
    corecore