22 research outputs found
Grid collector: an event catalog with automated file management
High Energy Nuclear Physics (HENP) experiments such as STAR at BNL and ATLAS at CERN produce large amounts of data that are stored as files on mass storage systems in computer centers. In these files, the basic unit of data is an event. Analysis is typically performed on a selected set of events. The files containing these events have to be located, copied from mass storage systems to disks before analysis, and removed when no longer needed. These file management tasks are tedious and time consuming. Typically, all events contained in the files are read into memory before a selection is made. Since the time to read the events dominate the overall execution time, reading the unwanted event needlessly increases the analysis time. The Grid Collector is a set of software modules that works together to address these two issues. It automates the file management tasks and provides ''direct'' access to the selected events for analyses. It is currently integrated with the STAR analysis framework. The users can select events based on tags, such as, ''production date between March 10 and 20, and the number of charged tracks > 100.'' The Grid Collector locates the files containing relevant events, transfers the files across the Grid if necessary, and delivers the events to the analysis code through the familiar iterators. There has been some research efforts to address the file management issues, the Grid Collector is unique in that it addresses the event access issue together with the file management issues. This makes it more useful to a large variety of users
Unraveling Diffusion in Fusion Plasma: A Case Study of In Situ Processing and Particle Sorting
This work starts an in situ processing capability to study a certain
diffusion process in magnetic confinement fusion. This diffusion process
involves plasma particles that are likely to escape confinement. Such particles
carry a significant amount of energy from the burning plasma inside the tokamak
to the diverter and damaging the diverter plate. This study requires in situ
processing because of the fast changing nature of the particle diffusion
process. However, the in situ processing approach is challenging because the
amount of data to be retained for the diffusion calculations increases over
time, unlike in other in situ processing cases where the amount of data to be
processed is constant over time. Here we report our preliminary efforts to
control the memory usage while ensuring the necessary analysis tasks are
completed in a timely manner. Compared with an earlier naive attempt to
directly computing the same diffusion displacements in the simulation code,
this in situ version reduces the memory usage from particle information by
nearly 60% and computation time by about 20%
Fast Change Point Detection for Electricity Market Analysis
Electricity is a vital part of our daily life; therefore it is important to avoid irregularities such as the California Electricity Crisis of 2000 and 2001. In this work, we seek to predict anomalies using advanced machine learning algorithms, more specifically a Change Point Detection (CPD) algorithm on the electricity prices during the California Electricity Crisis. Such algorithms are effective, but computationally expensive when applied on a large amount of data. To address this challenge, we accelerate the Gaussian Process (GP) for 1-dimensional time series data. Since GP is at the core of many statistical learning techniques, this improvement could benefit many algorithms. In the specific Change Point Detection algorithm used in this study, we reduce the overall computational complexity from O(n5) to O(n2), where the amountized cost of solving a GP projet is O(1). Our efficient algorithm makes it possible to compute the Change Points using the hourly price data during the California Electricity Crisis. By comparing the detected Change Points with known events, we show that the Change Point Detection algorithm is indeed effective in detecting signals preceding major events
HDF5 As a Vehicle for in Transit Data Movement
For in transit processing, one of the fundamental challenges is the efficient movement of data from producers to consumers. Exploiting the flexibility offered by the SENSEI generic in situ framework, we have developed a number of different in transit data transport mechanisms. In this work, we focus on the transport mechanism that leverages the HDF5 parallel I/O library, and investigate the performance characteristics of this transport mechanism. For in transit use cases at scale on HPC platforms, one might expect that an in transit data transport mechanism that uses faster layers of the storage hierarchy, such as DRAM memory, would always outperform a transport that uses slower layers of the storage hierarchy, such as an NVRAM-based persistent storage presented as a distributed file system. However, our test results show that the performance of the transport using NVRAM is competitive with the transport that uses socket-based data movement across varying levels of producer and consumer concurrency.},
booktitle = {Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualizatio
Photolysis of ((3-(Trimethylsilyl)propoxy)phenyl)phenyliodonium Salts in the Presence of 1-Naphthol and 1-Methoxynaphthalene
Recommended from our members
Testing VPIN on Big Data – Response to 'Reflecting on the VPIN Dispute'
Recommended from our members
Grid Collector: Facilitating Efficient Selective Access from Data Grids
The Grid Collector is a system that facilitates the effective analysis and spontaneous exploration of scientific data. It combines an efficient indexing technology with a Grid file management technology to speed up common analysis jobs on high-energy physics data and to enable some previously impractical analysis jobs. To analyze a set of high-energy collision events, one typically specifies the files containing the events of interest, reads all the events in the files, and filters out unwanted ones. Since most analysis jobs filter out significant number of events, a considerable amount of time is wasted by reading the unwanted events. The Grid Collector removes this inefficiency by allowing users to specify more precisely what events are of interest and to read only the selected events. This speeds up most analysis jobs. In existing analysis frameworks, the responsibility of bringing files from tertiary storages or remote sites to local disks falls on the users. This forces most of analysis jobs to be performed at centralized computer facilities where commonly used files are kept on large shared file systems. The Grid Collector automates file management tasks and eliminates the labor-intensive manual file transfers. This makes it much easier to perform analyses that require data files on tertiary storages and remote sites. It also makes more computer resources available for analysis jobs since they are no longer bound to the centralized facilities
Grid Collector: An Event Catalog
High Energy Nuclear Physics (HENP) experiments such as STAR at BNL and ATLAS at CERN produce large amounts of data that are stored as files on mass storage systems in computer centers. In these files, the basic unit of data is an event. Analysis is typically performed on a selected set of events. The files containing these events have to be located, copied from mass storage systems to disks before analysis, and removed when no longer needed. These file management tasks are tedious and time consuming. Typically, all events contained in the files are read into memory before a selection is made. Since the time to read the events dominate the overall execution time, reading the unwanted event needlessly increases the analysis time. The Grid Collector is a set of software modules that works together to address these two issues. It automates the file management tasks and provides "direct" access to the selected events for analyses. It is currently integrated with the STAR analysis framework. The users can select events based on tags, such as, "production date between March 10 and 20, and the number of charged tracks > 100." The Grid Collector locates the files containing relevant events, transfers the files across the Grid if necessary, and delivers the events to the analysis code through the familiar iterators. There has been some research efforts to address the file management issues, the Grid Collector is unique in that it addresses the event access issue together with the file management issues. This makes it more useful to a large varieties of users