3,926 research outputs found
POOL File Catalog, Collection and Metadata Components
The POOL project is the common persistency framework for the LHC experiments
to store petabytes of experiment data and metadata in a distributed and grid
enabled way. POOL is a hybrid event store consisting of a data streaming layer
and a relational layer. This paper describes the design of file catalog,
collection and metadata components which are not part of the data streaming
layer of POOL and outlines how POOL aims to provide transparent and efficient
data access for a wide range of environments and use cases - ranging from a
large production site down to a single disconnected laptops. The file catalog
is the central POOL component translating logical data references to physical
data files in a grid environment. POOL collections with their associated
metadata provide an abstract way of accessing experiment data via their logical
grouping into sets of related data objects.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 4 pages, 1 eps figure, PSN MOKT00
Data processing model for the CDF experiment
The data processing model for the CDF experiment is described. Data
processing reconstructs events from parallel data streams taken with different
combinations of physics event triggers and further splits the events into
datasets of specialized physics datasets. The design of the processing control
system faces strict requirements on bookkeeping records, which trace the status
of data files and event contents during processing and storage. The computing
architecture was updated to meet the mass data flow of the Run II data
collection, recently upgraded to a maximum rate of 40 MByte/sec. The data
processing facility consists of a large cluster of Linux computers with data
movement managed by the CDF data handling system to a multi-petaByte Enstore
tape library. The latest processing cycle has achieved a stable speed of 35
MByte/sec (3 TByte/day). It can be readily scaled by increasing CPU and
data-handling capacity as required.Comment: 12 pages, 10 figures, submitted to IEEE-TN
Topic Maps as a Virtual Observatory tool
One major component of the VO will be catalogs measuring gigabytes and
terrabytes if not more. Some mechanism like XML will be used for structuring
the information. However, such mechanisms are not good for information
retrieval on their own. For retrieval we use queries. Topic Maps that have
started becoming popular recently are excellent for segregating information
that results from a query. A Topic Map is a structured network of hyperlinks
above an information pool. Different Topic Maps can form different layers above
the same information pool and provide us with different views of it. This
facilitates in being able to ask exact questions, aiding us in looking for gold
needles in the proverbial haystack. Here we discuss the specifics of what Topic
Maps are and how they can be implemented within the VO framework.
URL: http://www.astro.caltech.edu/~aam/science/topicmaps/Comment: 11 pages, 5 eps figures, to appear in SPIE Annual Meeting 2001
proceedings (Astronomical Data Analysis), uses spie.st
Distributed Computing Grid Experiences in CMS
The CMS experiment is currently developing a computing system capable of serving, processing and archiving the large number of events that will be generated when the CMS detector starts taking data. During 2004 CMS undertook a large scale data challenge to demonstrate the ability of the CMS computing system to cope with a sustained data-taking rate equivalent to 25% of startup rate. Its goals were: to run CMS event reconstruction at CERN for a sustained period at 25 Hz input rate; to distribute the data to several regional centers; and enable data access at those centers for analysis. Grid middleware was utilized to help complete all aspects of the challenge. To continue to provide scalable access from anywhere in the world to the data, CMS is developing a layer of software that uses Grid tools to gain access to data and resources, and that aims to provide physicists with a user friendly interface for submitting their analysis jobs. This paper describes the data challenge experience with Grid infrastructure and the current development of the CMS analysis system
Data production models for the CDF experiment
The data production for the CDF experiment is conducted on a large Linux PC
farm designed to meet the needs of data collection at a maximum rate of 40
MByte/sec. We present two data production models that exploits advances in
computing and communication technology. The first production farm is a
centralized system that has achieved a stable data processing rate of
approximately 2 TByte per day. The recently upgraded farm is migrated to the
SAM (Sequential Access to data via Metadata) data handling system. The software
and hardware of the CDF production farms has been successful in providing large
computing and data throughput capacity to the experiment.Comment: 8 pages, 9 figures; presented at HPC Asia2005, Beijing, China, Nov 30
- Dec 3, 200
Heterogeneous Relational Databases for a Grid-enabled Analysis Environment
Grid based systems require a database access mechanism that can provide seamless homogeneous access to the requested data through a virtual data access system, i.e. a system which can take care of tracking the data that is stored in geographically distributed heterogeneous databases. This system should provide an integrated view of the data that is stored in the different repositories by using a virtual data access mechanism, i.e. a mechanism which can hide the heterogeneity of the backend databases from the client applications. This paper focuses on accessing data stored in disparate relational databases through a web service interface, and exploits the features of a Data Warehouse and Data Marts. We present a middleware that enables applications to access data stored in geographically distributed relational databases without being aware of their physical locations and underlying schema. A web service interface is provided to enable applications to access this middleware in a language and platform independent way. A prototype implementation was created based on Clarens [4], Unity [7] and POOL [8]. This ability to access the data stored in the distributed relational databases transparently is likely to be a very powerful one for Grid users, especially the scientific community wishing to collate and analyze data distributed over the Grid
Technical Report: CSVM Ecosystem
The CSVM format is derived from CSV format and allows the storage of tabular
like data with a limited but extensible amount of metadata. This approach could
help computer scientists because all information needed to uses subsequently
the data is included in the CSVM file and is particularly well suited for
handling RAW data in a lot of scientific fields and to be used as a canonical
format. The use of CSVM has shown that it greatly facilitates: the data
management independently of using databases; the data exchange; the integration
of RAW data in dataflows or calculation pipes; the search for best practices in
RAW data management. The efficiency of this format is closely related to its
plasticity: a generic frame is given for all kind of data and the CSVM parsers
don't make any interpretation of data types. This task is done by the
application layer, so it is possible to use same format and same parser codes
for a lot of purposes. In this document some implementation of CSVM format for
ten years and in different laboratories are presented. Some programming
examples are also shown: a Python toolkit for using the format, manipulating
and querying is available. A first specification of this format (CSVM-1) is now
defined, as well as some derivatives such as CSVM dictionaries used for data
interchange. CSVM is an Open Format and could be used as a support for Open
Data and long term conservation of RAW or unpublished data.Comment: 31 pages including 2p of Anne
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
- …