4,404 research outputs found

    Towards Exascale Scientific Metadata Management

    Full text link
    Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

    Functional Data Analysis in Electronic Commerce Research

    Full text link
    This paper describes opportunities and challenges of using functional data analysis (FDA) for the exploration and analysis of data originating from electronic commerce (eCommerce). We discuss the special data structures that arise in the online environment and why FDA is a natural approach for representing and analyzing such data. The paper reviews several FDA methods and motivates their usefulness in eCommerce research by providing a glimpse into new domain insights that they allow. We argue that the wedding of eCommerce with FDA leads to innovations both in statistical methodology, due to the challenges and complications that arise in eCommerce data, and in online research, by being able to ask (and subsequently answer) new research questions that classical statistical methods are not able to address, and also by expanding on research questions beyond the ones traditionally asked in the offline environment. We describe several applications originating from online transactions which are new to the statistics literature, and point out statistical challenges accompanied by some solutions. We also discuss some promising future directions for joint research efforts between researchers in eCommerce and statistics.Comment: Published at http://dx.doi.org/10.1214/088342306000000132 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    PaPaS: A Portable, Lightweight, and Generic Framework for Parallel Parameter Studies

    Full text link
    The current landscape of scientific research is widely based on modeling and simulation, typically with complexity in the simulation's flow of execution and parameterization properties. Execution flows are not necessarily straightforward since they may need multiple processing tasks and iterations. Furthermore, parameter and performance studies are common approaches used to characterize a simulation, often requiring traversal of a large parameter space. High-performance computers offer practical resources at the expense of users handling the setup, submission, and management of jobs. This work presents the design of PaPaS, a portable, lightweight, and generic workflow framework for conducting parallel parameter and performance studies. Workflows are defined using parameter files based on keyword-value pairs syntax, thus removing from the user the overhead of creating complex scripts to manage the workflow. A parameter set consists of any combination of environment variables, files, partial file contents, and command line arguments. PaPaS is being developed in Python 3 with support for distributed parallelization using SSH, batch systems, and C++ MPI. The PaPaS framework will run as user processes, and can be used in single/multi-node and multi-tenant computing systems. An example simulation using the BehaviorSpace tool from NetLogo and a matrix multiply using OpenMP are presented as parameter and performance studies, respectively. The results demonstrate that the PaPaS framework offers a simple method for defining and managing parameter studies, while increasing resource utilization.Comment: 8 pages, 6 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US

    Prototype Packages for Managing and Animating Longitudinal Network Data: dynamicnetwork and rSoNIA

    Get PDF
    Work with longitudinal network survey data and the dynamic network outputs of the statnet ERGMs has demonstrated the need for consistent frameworks and data structures for expressing, storing, and manipulating information about networks that change in time. Motivated by our requirements for exchanging data among researchers and various analysis and visualization processes, we have created an R package dynamicnetwork that builds upon previous work in the network, statnet and sna packages and provides a limited functional implementation. This paper discusses design issues and considerations, describes classes and forms of dynamic data, and works through several examples to demonstrate the utility of the package. The functionality of the rSoNIA package that uses dynamicnetwork to exchange data with the Social Network Image Animator (SoNIA) software to create animated movies of changing networks from within R is also demonstrated.

    The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q

    Full text link
    Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -- dark matter and dark energy -- remain mysterious. Sky surveys that map billions of galaxies to explore the `Dark Universe', demand a corresponding extreme-scale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been designed to deliver this level of performance now, and into the future. With its novel algorithmic structure, HACC allows flexible tuning across diverse architectures, including accelerated and multi-core systems. On the IBM BG/Q, HACC attains unprecedented scalable performance -- currently 13.94 PFlops at 69.2% of peak and 90% parallel efficiency on 1,572,864 cores with an equal number of MPI ranks, and a concurrency of 6.3 million. This level of performance was achieved at extreme problem sizes, including a benchmark run with more than 3.6 trillion particles, significantly larger than any cosmological simulation yet performed.Comment: 11 pages, 11 figures, final version of paper for talk presented at SC1

    Asynchronous Execution of Python Code on Task Based Runtime Systems

    Get PDF
    Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards
    corecore