3,926 research outputs found

    POOL File Catalog, Collection and Metadata Components

    Full text link
    The POOL project is the common persistency framework for the LHC experiments to store petabytes of experiment data and metadata in a distributed and grid enabled way. POOL is a hybrid event store consisting of a data streaming layer and a relational layer. This paper describes the design of file catalog, collection and metadata components which are not part of the data streaming layer of POOL and outlines how POOL aims to provide transparent and efficient data access for a wide range of environments and use cases - ranging from a large production site down to a single disconnected laptops. The file catalog is the central POOL component translating logical data references to physical data files in a grid environment. POOL collections with their associated metadata provide an abstract way of accessing experiment data via their logical grouping into sets of related data objects.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 4 pages, 1 eps figure, PSN MOKT00

    Data processing model for the CDF experiment

    Get PDF
    The data processing model for the CDF experiment is described. Data processing reconstructs events from parallel data streams taken with different combinations of physics event triggers and further splits the events into datasets of specialized physics datasets. The design of the processing control system faces strict requirements on bookkeeping records, which trace the status of data files and event contents during processing and storage. The computing architecture was updated to meet the mass data flow of the Run II data collection, recently upgraded to a maximum rate of 40 MByte/sec. The data processing facility consists of a large cluster of Linux computers with data movement managed by the CDF data handling system to a multi-petaByte Enstore tape library. The latest processing cycle has achieved a stable speed of 35 MByte/sec (3 TByte/day). It can be readily scaled by increasing CPU and data-handling capacity as required.Comment: 12 pages, 10 figures, submitted to IEEE-TN

    Topic Maps as a Virtual Observatory tool

    Get PDF
    One major component of the VO will be catalogs measuring gigabytes and terrabytes if not more. Some mechanism like XML will be used for structuring the information. However, such mechanisms are not good for information retrieval on their own. For retrieval we use queries. Topic Maps that have started becoming popular recently are excellent for segregating information that results from a query. A Topic Map is a structured network of hyperlinks above an information pool. Different Topic Maps can form different layers above the same information pool and provide us with different views of it. This facilitates in being able to ask exact questions, aiding us in looking for gold needles in the proverbial haystack. Here we discuss the specifics of what Topic Maps are and how they can be implemented within the VO framework. URL: http://www.astro.caltech.edu/~aam/science/topicmaps/Comment: 11 pages, 5 eps figures, to appear in SPIE Annual Meeting 2001 proceedings (Astronomical Data Analysis), uses spie.st

    Distributed Computing Grid Experiences in CMS

    Get PDF
    The CMS experiment is currently developing a computing system capable of serving, processing and archiving the large number of events that will be generated when the CMS detector starts taking data. During 2004 CMS undertook a large scale data challenge to demonstrate the ability of the CMS computing system to cope with a sustained data-taking rate equivalent to 25% of startup rate. Its goals were: to run CMS event reconstruction at CERN for a sustained period at 25 Hz input rate; to distribute the data to several regional centers; and enable data access at those centers for analysis. Grid middleware was utilized to help complete all aspects of the challenge. To continue to provide scalable access from anywhere in the world to the data, CMS is developing a layer of software that uses Grid tools to gain access to data and resources, and that aims to provide physicists with a user friendly interface for submitting their analysis jobs. This paper describes the data challenge experience with Grid infrastructure and the current development of the CMS analysis system

    Data production models for the CDF experiment

    Get PDF
    The data production for the CDF experiment is conducted on a large Linux PC farm designed to meet the needs of data collection at a maximum rate of 40 MByte/sec. We present two data production models that exploits advances in computing and communication technology. The first production farm is a centralized system that has achieved a stable data processing rate of approximately 2 TByte per day. The recently upgraded farm is migrated to the SAM (Sequential Access to data via Metadata) data handling system. The software and hardware of the CDF production farms has been successful in providing large computing and data throughput capacity to the experiment.Comment: 8 pages, 9 figures; presented at HPC Asia2005, Beijing, China, Nov 30 - Dec 3, 200

    Heterogeneous Relational Databases for a Grid-enabled Analysis Environment

    Get PDF
    Grid based systems require a database access mechanism that can provide seamless homogeneous access to the requested data through a virtual data access system, i.e. a system which can take care of tracking the data that is stored in geographically distributed heterogeneous databases. This system should provide an integrated view of the data that is stored in the different repositories by using a virtual data access mechanism, i.e. a mechanism which can hide the heterogeneity of the backend databases from the client applications. This paper focuses on accessing data stored in disparate relational databases through a web service interface, and exploits the features of a Data Warehouse and Data Marts. We present a middleware that enables applications to access data stored in geographically distributed relational databases without being aware of their physical locations and underlying schema. A web service interface is provided to enable applications to access this middleware in a language and platform independent way. A prototype implementation was created based on Clarens [4], Unity [7] and POOL [8]. This ability to access the data stored in the distributed relational databases transparently is likely to be a very powerful one for Grid users, especially the scientific community wishing to collate and analyze data distributed over the Grid

    Technical Report: CSVM Ecosystem

    Full text link
    The CSVM format is derived from CSV format and allows the storage of tabular like data with a limited but extensible amount of metadata. This approach could help computer scientists because all information needed to uses subsequently the data is included in the CSVM file and is particularly well suited for handling RAW data in a lot of scientific fields and to be used as a canonical format. The use of CSVM has shown that it greatly facilitates: the data management independently of using databases; the data exchange; the integration of RAW data in dataflows or calculation pipes; the search for best practices in RAW data management. The efficiency of this format is closely related to its plasticity: a generic frame is given for all kind of data and the CSVM parsers don't make any interpretation of data types. This task is done by the application layer, so it is possible to use same format and same parser codes for a lot of purposes. In this document some implementation of CSVM format for ten years and in different laboratories are presented. Some programming examples are also shown: a Python toolkit for using the format, manipulating and querying is available. A first specification of this format (CSVM-1) is now defined, as well as some derivatives such as CSVM dictionaries used for data interchange. CSVM is an Open Format and could be used as a support for Open Data and long term conservation of RAW or unpublished data.Comment: 31 pages including 2p of Anne

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    • …
    corecore