Search CORE

5 research outputs found

Incremental elasticity for array databases

Author: Ang K. H.
de Witt S.
Ganesan P.
P.
Stonebraker M.
Stonebraker M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2014
Field of study

Relational databases benefit significantly from elasticity, whereby they execute on a set of changing hardware resources provisioned to match their storage and processing requirements. Such flexibility is especially attractive for scientific databases because their users often have a no-overwrite storage model, in which they delete data only when their available space is exhausted. This results in a database that is regularly growing and expanding its hardware proportionally. Also, scientific databases frequently store their data as multidimensional arrays optimized for spatial querying. This brings about several novel challenges in clustered, skew-aware data placement on an elastic shared-nothing database. In this work, we design and implement elasticity for an array database. We address this challenge on two fronts: determining when to expand a database cluster and how to partition the data within it. In both steps we propose incremental approaches, affecting a minimum set of data and nodes, while maintaining high performance. We introduce an algorithm for gradually augmenting an array database's hardware using a closed-loop control system. After the cluster adds nodes, we optimize data placement for n-dimensional arrays. Many of our elastic partitioners incrementally reorganize an array, redistributing data only to new nodes. By combining these two tools, the scientific database efficiently and seamlessly manages its monotonically increasing hardware resources.Intel Corporation (Science and Technology Center for Big Data

CiteSeerX

DSpace@MIT

Crossref

Architecture and Knowledge Modelling for Smart City

Author: Rauch Nadia
Publication venue
Publication date: 01/01/2014
Field of study

Florence Research

Managing Large Data Volumes from Scientific Facilities

Author: de Witt S
Sansum A
Sinclair R
Wilson MD
Publication venue
Publication date: 01/01/2012
Field of study

ePubs: the open archive for STFC research publications

Design and Implementation of a Research Data Management System: The CRC/TR32 Project Database (TR32DB)

Author: Curdt Constanze
Publication venue
Publication date: 01/01/2014
Field of study

Research data management (RDM) includes all processes and measures which ensure that research data are well-organised, documented, preserved, stored, backed up, accessible, available, and re-usable. Corresponding RDM systems or repositories form the technical framework to support the collection, accurate documentation, storage, back-up, sharing, and provision of research data, which are created in a specific environment, like a research group or institution. The required measures for the implementation of a RDM system vary according to the discipline or purpose of data (re-)use. In the context of RDM, the documentation of research data is an essential duty. This has to be conducted by accurate, standardized, and interoperable metadata to ensure the interpretability, understandability, shareability, and long-lasting usability of the data. RDM is achieving an increasing importance, as digital information increases. New technologies enable to create more digital data, also automatically. Consequently, the volume of digital data, including big data and small data, will approximately double every two years in size. With regard to e-science, this increase of data was entitled and predicted as the data deluge. Furthermore, the paradigm change in science has led to data intensive science. Particularly scientific data that were financed by public funding are significantly demanded to be archived, documented, provided or even open accessible by different policy makers, funding agencies, journals and other institutions. RDM can prevent the loss of data, otherwise around 80-90 % of the generated research data disappear and are not available for re-use or further studies. This will lead to empty archives or RDM systems. The reasons for this course are well known and are of a technical, socio-cultural, and ethical nature, like missing user participation and data sharing knowledge, as well as lack of time or resources. In addition, the fear of exploitation and missing or limited reward for publishing and sharing data has an important role. This thesis presents an approach in handling research data of the collaborative, multidisciplinary, long-term DFG-funded research project Collaborative Research Centre/Transregio 32 (CRC/TR32) “Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation”. In this context, a RDM system, the so-called CRC/TR32 project database (TR32DB), was designed and implemented. The TR32DB considers the demands of the project participants (e.g. heterogeneous data from different disciplines with various file sizes) and the requirements of the DFG, as well as general challenges in RDM. For this purpose, a RDM system was established that comprises a well-described self-designed metadata schema, a file-based data storage, a well-elaborated database of metadata, and a corresponding user-friendly web interface. The whole system is developed in close cooperation with the local Regional Computing Centre of the University of Cologne (RRZK), where it is also hosted. The documentation of the research data with accurate metadata is of key importance. For this purpose, an own specific TR32DB Metadata Schema was designed, consisting of multi-level metadata properties. This is distinguished in general and data type specific (e.g. data, publication, report) properties and is developed according to the project background, demands of the various data types, as well as recent associated metadata standards and principles. Consequently, it is interoperable to recent metadata standards, such as the Dublin Core, the DataCite Metadata Schema, as well as core elements of the ISO19115:2003 Metadata Standard and INSPIRE Directive. Furthermore, the schema supports optional, mandatory, and automatically generated metadata properties, as well as it provides predefined, obligatory and self-established controlled vocabulary lists. The integrated mapping to the DataCite Metadata Schema facilitates the simple application of a Digital Object Identifier (DOI) for a dataset. The file-based data storage is organized in a folder system, corresponding to the structure of the CRC/TR32 and additionally distinguishes between several data types (e.g. data, publication, report). It is embedded in the Andrew File System hosted by the RRZK. The file system is capable to store and backup all data, is highly scalable, supports location independence, and enables easy administration by Access Control Lists. In addition, the relational database management system MySQL stores the metadata according to the previous mentioned TR32DB Metadata Schema as well as further necessary administrative data. A user-friendly web-based graphical user interface enables the access to the TR32DB system. The web-interface provides metadata input, search, and download of data, as well as the visualization of important geodata is handled by an internal WebGIS. This web-interface, as well as the entire RDM system, is self-developed and adjusted to the specific demands. Overall, the TR32DB system is developed according to the needs and requirements of the CRC/TR32 scientists, fits the demands of the DFG, and considers general problems and challenges of RDM as well. With regard to changing demands of the CRC/TR32 and technologic advances, the system is and will be consequently further developed. The established TR32DB approach was already successfully applied to another interdisciplinary research project. Thus, this approach is transferable and generally capable to archive all data, generated by the CRC/TR32, with accurately, interoperable metadata to ensure the re-use of the data, beyond the end of the project

Kölner UniversitätsPublikationsServer

Workshop on Cloud Services for File Synchronisation and Sharing

Author: HARĘŻLAK Daniel
Publication venue
Publication date: 01/01/2014
Field of study

Managing and sharing data stored in files results in a challenge due to data amounts produced by various scientific experiments [1]. While solutions such as Globus Online [2] focus on file transfer and synchronization, in this work we propose an additional layer of metadata over file resources which helps to categorize and structure the data, as well as to make it efficient in integration with web-based research gateways. A basic concept of the proposed solution [3] is a data model consisting of entities built from primitive types such as numbers, texts and also from files and relationships among different entities. This allows for building complex data structure definitions and mix metadata and file data into a single model tailored for a given scientific field. A data model becomes actionable after being deployed as a data repository which is done automatically by the proposed framework by using one of the available PaaS (platform-as-a-service) platforms and is exposed to the world as a REST service, which can be accessed from any computing site or a personal computer through the HTTP protocol. Data stored in such a repository can be shared by using various access policies (e.g. user-based or group-based) and can be managed from a wide range of applications. The repository is a self-contained application which can be scaled to improve transfer throughput and can integrate many underlying file storage technologies (currently it supports the GridFTP protocol). The generated REST interface allows data querying and file transfers directly from user web browsers without going through additional servers (this is possible thanks to using the CORS mechanism which is now supported by all major web browsers including mobiles). Using a PaaS platform as a deployment base for the repository gives an advantage of extending it with different metadata storage backends which can be more suitable for handling metadata schema of certain data models while keeping the source model unchanged. The framework supports it by a a plugin system for different storage backends. Such flexible approach allows to adapt the platform to specific requirements without rewriting everything from scratch. Using a single web endpoint for a repository gives the impression of using a cloud-based service to end users and other services (user credential delegation is also supported) while reusing existing storage facilities maintained in computing centers. Acknowledgements. This research has been supported by the European Union within the European Regional Development Fund program POIG.02.03.00-00-096/10 as part of the PL-Grid Plus project. References [1] Witt, S.D., Sinclair, R., Sansum, A., Wilson, M.: Managing large data volumes from scientific facilities. ERCIM News 2012(89) (2012) [2] Foster, I.: Globus Online: Accelerating and Democratizing Science through Cloud-Based Services, Internet Computing, IEEE , vol. 15, no. 3, pp. 70,73, May-June 2011 [3] Harężlak, D., Kasztelnik, M., Pawlik, M., Wilk, B., and Bubak, M.: A Lightweight Method of Metadata and Data Management with DataNet, eScience on Distributed Computing Infrastructure, Eds. Bubak, M., Kitowski, J., Wiatr, K., Springer International Publishing, Lecture Notes in Computer Science, vol. 8500, 2014, pp. 164-17

CERN Document Server