121 research outputs found
Recommended from our members
Storage resource managers: Middleware components for gridstorage
The amount of scientific data generated by simulations orcollected from large scale experiments have reached levels that cannot bestored in the researcher's workstation or even in his/her local computercenter. Such data are vital to large scientific collaborations dispersedover wide-area networks. In the past, the concept of a Gridinfrastructure [1]mainly emphasized the computational aspect ofsupporting large distributed computational tasks, and optimizing the useof the network by using bandwidth reservation techniques. In this paperwe discuss the concept of Storage Resource Managers (SRMs) as componentsthat complement this with the support for the storage management of largedistributed datasets. The access to data is becoming the main bottleneckin such "data intensive" applications because the data cannot bereplicated in all sites. SRMs can be used to dynamically optimize the useof storage resource to help unclog this bottleneck
Recommended from our members
Performances of Multi-Level and Multi-Component Compressed BitmapIndices
This paper presents a systematic study of two large subsetsof bitmap indexing methods that use multi-component and multi-levelencodings. Earlier studies on bitmap indexes are either empirical or foruncompressed versions only. Since most of bitmap indexes in use arecompressed, we set out to study the performance characteristics of thesecompressed indexes. To make the analyses manageable, we choose to use aparticularly simple, but efficient, compression method called theWord-Aligned Hybrid (WAH) code. Using this compression method, a numberof bitmap indexes are shown to be optimal because their worst-case timecomplexities for answering a query is a linear function of the number ofhits. Since compressed bitmap indexes behave drastically different fromuncompressed ones, our analyses also lead to a number of new methods thatare much more efficient than commonly used ones. As a validation for theanalyses, we implement a number of the best methods and measure theirperformance against well-known indexes. The fastest new methods arepredicted and observed to be 5 to 10 times faster than well-knownindexing methods
Recommended from our members
Replica Registration Service - Functional Interface Specification1.0
The goal of the Replica Registration Service (RRS) is toprovide a uniform interface to various file catalogs, replica catalogs,and metadata catalogs. It can be thought of as an abstraction of theconcepts used in such systems to register files and their replicas. Someexperiments may prefer to support their own file catalogs (which may havetheir own specialized structures, semantics, and implementations) ratherthan use a standard replica catalog. Providing an RRS that can interactwith such a catalog (for example by invoking a script) can permit thatcatalog to be invoked as a service in the same way that other replicacatalogs do. If at a later time the experiment wishes to change toanother file catalog or an RLS, it is only a matter of developing an RRSfor that new catalog and replacing the existing catalog. In addition,some systems use metadata catalogs or other catalogs to manage the filename spaces. Our goal is to provide a single interface that supports theregistration of files into such name spaces as well as retrieving thisinformation
A New Approach in Advance Network Reservation and Provisioning for High-Performance Scientific Data Transfers
Scientific applications already generate many terabytes and even petabytes of data from supercomputer runs and large-scale experiments. The need for transferring data chunks of ever-increasing sizes through the network shows no sign of abating. Hence, we need high-bandwidth high speed networks such as ESnet (Energy Sciences Network). Network reservation systems, i.e. ESnet's OSCARS (On-demand Secure Circuits and Advance Reservation System) establish guaranteed bandwidth of secure virtual circuits at a certain time, for a certain bandwidth and length of time. OSCARS checks network availability and capacity for the specified period of time, and allocates requested bandwidth for that user if it is available. If the requested reservation cannot be granted, no further suggestion is returned back to the user. Further, there is no possibility from the users view-point to make an optimal choice. We report a new algorithm, where the user specifies the total volume that needs to be transferred, a maximum bandwidth that he/she can use, and a desired time period within which the transfer should be done. The algorithm can find alternate allocation possibilities, including earliest time for completion, or shortest transfer duration - leaving the choice to the user. We present a novel approach for path finding in time-dependent networks, and a new polynomial algorithm to find possible reservation options according to given constraints. We have implemented our algorithm for testing and incorporation into a future version of ESnet?s OSCARS. Our approach provides a basis for provisioning end-to-end high performance data transfers over storage and network resources
Recommended from our members
Breaking the Curse of Cardinality on Bitmap Indexes
Bitmap indexes are known to be efficient for ad-hoc range queries that are common in data warehousing and scientific applications. However, they suffer from the curse of cardinality, that is, their efficiency deteriorates as attribute cardinalities increase. A number of strategies have been proposed, but none of them addresses the problem adequately. In this paper, we propose a novel binned bitmap index that greatly reduces the cost to answer queries, and therefore breaks the curse of cardinality. The key idea is to augment the binned index with an Order-preserving Bin-based Clustering (OrBiC) structure. This data structure significantly reduces the I/O operations needed to resolve records that cannot be resolved with the bitmaps. To further improve the proposed index structure, we also present a strategy to create single-valued bins for frequent values. This strategy reduces index sizes and improves query processing speed. Overall, the binned indexes with OrBiC great improves the query processing speed, and are 3 - 25 times faster than the best available indexes for high-cardinality data
Grid collector: an event catalog with automated file management
High Energy Nuclear Physics (HENP) experiments such as STAR at BNL and ATLAS at CERN produce large amounts of data that are stored as files on mass storage systems in computer centers. In these files, the basic unit of data is an event. Analysis is typically performed on a selected set of events. The files containing these events have to be located, copied from mass storage systems to disks before analysis, and removed when no longer needed. These file management tasks are tedious and time consuming. Typically, all events contained in the files are read into memory before a selection is made. Since the time to read the events dominate the overall execution time, reading the unwanted event needlessly increases the analysis time. The Grid Collector is a set of software modules that works together to address these two issues. It automates the file management tasks and provides ''direct'' access to the selected events for analyses. It is currently integrated with the STAR analysis framework. The users can select events based on tags, such as, ''production date between March 10 and 20, and the number of charged tracks > 100.'' The Grid Collector locates the files containing relevant events, transfers the files across the Grid if necessary, and delivers the events to the analysis code through the familiar iterators. There has been some research efforts to address the file management issues, the Grid Collector is unique in that it addresses the event access issue together with the file management issues. This makes it more useful to a large variety of users
Bulk Data Movement for Climate Dataset: Efficient Data Transfer Management with Dynamic Transfer Adjustment
Many scientific applications and experiments, such as high energy and nuclear physics, astrophysics, climate observation and modeling, combustion, nano-scale material sciences, and computational biology, generate extreme volumes of data with a large number of files. These data sources are distributed among national and international data repositories, and are shared by large numbers of geographically distributed scientists. A large portion of data is frequently accessed, and a large volume of data is moved from one place to another for analysis and storage. One challenging issue in such efforts is the limited network capacity for moving large datasets to explore and manage. The Bulk Data Mover (BDM), a data transfer management tool in the Earth System Grid (ESG) community, has been managing the massive dataset transfers efficiently with the pre-configured transfer properties in the environment where the network bandwidth is limited. Dynamic transfer adjustment was studied to enhance the BDM to handle significant end-to-end performance changes in the dynamic network environment as well as to control the data transfers for the desired transfer performance. We describe the results from the BDM transfer management for the climate datasets. We also describe the transfer estimation model and results from the dynamic transfer adjustment
Parallel in situ indexing for data-intensive computing
As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings
Finding regions of interest on toroidal meshes
Fusion promises to provide clean and safe energy, and a considerable amount of research effort is underway to turn this aspiration intoreality. This work focuses on a building block for analyzing data produced from the simulation of microturbulence in magnetic confinementfusion devices: the task of efficiently extracting regions of interest. Like many other simulations where a large amount of data are produced,the careful study of ``interesting'' parts of the data is critical to gain understanding. In this paper, we present an efficient approach forfinding these regions of interest. Our approach takes full advantage of the underlying mesh structure in magnetic coordinates to produce acompact representation of the mesh points inside the regions and an efficient connected component labeling algorithm for constructingregions from points. This approach scales linearly with the surface area of the regions of interest instead of the volume as shown with bothcomputational complexity analysis and experimental measurements. Furthermore, this new approach is 100s of times faster than a recentlypublished method based on Cartesian coordinates
- …