12 research outputs found

    Semantically-aware data discovery and placement in collaborative computing environments

    Get PDF
    As the size of scientific datasets and the demand for interdisciplinary collaboration grow in modern science, it becomes imperative that better ways of discovering and placing datasets generated across multiple disciplines be developed to facilitate interdisciplinary scientific research. For discovering relevant data out of large-scale interdisciplinary datasets. The development and integration of cross-domain metadata is critical as metadata serves as the key guideline for organizing data. To develop and integrate cross-domain metadata management systems in interdisciplinary collaborative computing environment, three key issues need to be addressed: the development of a cross-domain metadata schema; the implementation of a metadata management system based on this schema; the integration of the metadata system into existing distributed computing infrastructure. Current research in metadata management in distributed computing environment largely focuses on relatively simple schema that lacks the underlying descriptive power to adequately address semantic heterogeneity often found in interdisciplinary science. And current work does not take adequate consideration the issue of scalability in large-scale data management. Another key issue in data management is data placement, due to the increasing size of scientific datasets, the overhead incurred as a result of transferring data among different nodes also grow into a significant inhibiting factor affecting overall performance. Currently, few data placement strategies take into consideration semantic information concerning data content. In this dissertation, we propose a cross-domain metadata system in a collaborative distributed computing environment and identify and evaluate key factors and processes involved in a successful cross-domain metadata system with the goal of facilitating data discovery in collaborative environments. This will allow researchers/users to conduct interdisciplinary science in the context of large-scale datasets that will make it easier to access interdisciplinary datasets, reduce barrier to collaboration, reduce cost of future development of similar systems. We also investigate data placement strategies that involve semantic information about the hardware and network environment as well as domain information in the form of semantic metadata so that semantic locality could be utilized in data placement, that could potentially reduce overhead for accessing large-scale interdisciplinary datasets

    Asynchronous replication of metadata across multi-master servers in distributed data storage systems

    Get PDF
    In recent years, scientific applications have become increasingly data intensive. The increase in the size of data generated by scientific applications necessitates collaboration and sharing data among the nation\u27s education and research institutions. To address this, distributed storage systems spanning multiple institutions over wide area networks have been developed. One of the important features of distributed storage systems is providing global unified name space across all participating institutions, which enables easy data sharing without the knowledge of actual physical location of data. This feature depends on the ``location metadata\u27\u27 of all data sets in the system being available to all participating institutions. This introduces new challenges. In this thesis, we study different metadata server layouts in terms of high availability, scalability and performance. A central metadata server is a single point of failure leading to low availability. Ensuring high availability requires replication of metadata servers. A synchronously replicated metadata servers layout introduces synchronization overhead which degrades the performance of data operations. We propose an asynchronously replicated multi-master metadata servers layout which ensures high availability, scalability and provides better performance. We discuss the implications of asynchronously replicated multi-master metadata servers on metadata consistency and conflict resolution. Further, we design and implement our own asynchronous multi-master replication tool, deploy it in the state-wide distributed data storage system called PetaShare, and compare performance of all three metadata server layouts: central metadata server, synchronously replicated multi-master metadata servers and asynchronously replicated multi-master metadata servers

    PetaShare: A reliable, efficient and transparent distributed storage management system

    Get PDF
    Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data generated by scientific applications. We have developed a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent and scalable access. In PetaShare, we have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability, and an advanced buffering system for improved data transfer performance. In this paper, we present the details of our design and implementation, show performance results, and describe our experience in developing a reliable and efficient distributed data management system for data-intensive science. © 2011 - IOS Press and the authors. All rights reserved

    Project Final Report: Ubiquitous Computing and Monitoring System (UCoMS) for Discovery and Management of Energy Resources

    Full text link

    Data transfer scheduling with advance reservation and provisioning

    Get PDF
    Over the years, scientific applications have become more complex and more data intensive. Although through the use of distributed resources the institutions and organizations gain access to the resources needed for their large-scale applications, complex middleware is required to orchestrate the use of these storage and network resources between collaborating parties, and to manage the end-to-end processing of data. We present a new data scheduling paradigm with advance reservation and provisioning. Our methodology provides a basis for provisioning end-to-end high performance data transfers which require integration between system, storage and network resources, and coordination between reservation managers and data transfer nodes. This allows researchers/users and higher level meta-schedulers to use data placement as a service where they can plan ahead and reserve time and resources for their data movement operations. We present a novel approach for evaluating time-dependent structures with bandwidth guaranteed paths. We present a practical online scheduling model using advance reservation in dynamic network with time constraints. In addition, we report a new polynomial algorithm presenting possible reservation options and alternatives for earliest completion and shortest transfer duration. We enhance the advance network reservation system by extending the underlying mechanism to provide a new service in which users submit their constraints and the system suggests possible reservation requests satisfying users\u27 requirements. We have studied scheduling data transfer operation with resource and time conflicts. We have developed a new scheduling methodology considering resource allocation in client sites and bandwidth allocation on network link connecting resources. Some other major contributions of our study include enhanced reliability, adaptability, and performance optimization of distributed data placement tasks. While designing this new data scheduling architecture, we also developed other important methodologies such as early error detection, failure awareness, job aggregation, and dynamic adaptation of distributed data placement tasks. The adaptive tuning includes dynamically setting data transfer parameters and controlling utilization of available network capacity. Our research aims to provide a middleware to improve the data bottleneck in high performance computing systems

    Choosing between remote I/O versus staging in distributed environments

    Get PDF
    Today, scientifi_x000C_c applications and experiments have become increasingly complex and more demanding in terms of their computational and data requirements. The amount of data generated and used has grown at a very rapid rate. As tens or hundreds of terabytes of data for a single application is very common today; petabytes and even exabytes of data will be very common in a few years. One of the major challenges in distributed computing environments is how to access these large datasets remotely over the network. Data staging and remote I/O are the most widely used data access methods for distributed applications. Application developers generally chose one over the other intuitively without making any scienti_x000C_fic comparison specifi_x000C_c to their applications since there is no generic model available that they can use. In this thesis, we develop generic models and set guidelines for the application developers which would help them to choose the most appropriate data access method for their application. We de_x000C_fine the parameters that potentially aff_x000B_ect the end-to-end performance of the distributed applications which need to access remote data. To achieve our goal, we implement a series of synthetic benchmark applications to simulate di_x000B_fferent data access patterns. We run these benchmark applications on diff_x000B_erent distributed computing settings with di_x000B_fferent parameters, such as network bandwidth, server and client capabilities, and data access ratio. We also use di_x000B_fferent remote I/O protocols to show the importance of the protocol in making a decision. We use regression analysis to develop applicable generic models for comparing diff_x000B_erent data access methods, and test our models in a real life application. The main contribution of our thesis is generic models that can be applied to most data-intensive distributed applications to decide the best data access technique for those applications. Our models provide the scientists and application developers an opportunity to choose the best data access method before actually running the application

    Generation of game contents by social media analysis and MAS planning

    Get PDF
    In the age of pervasive computing and social networks, it has become commonplace to retrieve opinions about digital contents in games. In the case of multi-player, open world gaming, in fact even in “old-school” single players games, it is evident the need for adding new features in a game depending on users comments and needs. However this is a challenging task that usually requires considerable design and programming efforts, and more and more patches to games, with the inevitable consequence of loosing interest in the game by players over years. This is particularly a hard problem for all games that do not intend to be designed as interactive novels. Process Content Generation (PCG) of new contents could be a solution to this problem, but usually such techniques are used to design new maps or graphical contents. Here we propose a novel PCG technique able to introduce new contents in games by means of new story-lines and quests. We introduce new intelligent agents and events in the world: their attitudes and behaviors will promote new actions in the game, leading to the involvement of players in new gaming content. The whole methodology is driven by Social Media Analysis contents about the game, and by the use of formal planning techniques based on Multi-Agents modelsPeer ReviewedPostprint (author's final draft

    Semantic technologies for the domain specific and formal description of time series in databases

    Get PDF
    Messdaten werden zur effizienten Organisation und Weiterverarbeitung in relationalen Datenbanken gespeichert. Die in den letzten Jahren entstandenen Semantic Web Technologien bieten eine hervorragende Basis zur Wissensmodellierung und Beschreibung von Domäneninhalten in Form von Ontologien. Aufgrund der offenen Architektur dieses Ansatzes können leicht fremde Ontologien und Ressourcen mit eingebunden und berücksichtigt werden. Semantic Web Technologien stellen eine formale Modellierungsgrundlage dar. Mittels Reasoning kann deshalb aus Ontologien implizites Wissen abgeleitet werden. In dieser Arbeit werden semantische (Datenbank-) Annotationen und deren Interpretation fokussiert. Sie verknüpfen Datenbanken und das Semantic Web miteinander. Die Annotationen erlauben es, Inhalte von Datenbanken mit Semantic Web Technologien in verschiedenen Nutzungsszenarien zu beschreiben. Außerdem wird für die gemeinsame Behandlung und den Einsatz beider Technologien eine Architektur entwickelt. Auf dieser Basis werden Konzepte zur Visualisierung und Interaktion mit den Annotationen eingeführt. Weiterhin wird deren Einsatz zur formalen Modellierung von Ereignissen mittels Automaten betrachtet, sodass ein Reasoning zur Berechnung durchgeführt werden kann. Mittels einer Implementierung werden die eingeführten Konzepte demonstriert. Die Applikation Semantic Database Browser erlaubt die integrierte Verwendung von Messdaten und deren formaler Beschreibung. Modelle können ausgetauscht und wiederverwendet werden, sodass die Wiederverwendung von Wissen gefördert wird. Anhand des Beispiels von Ereignissen während Autofahrten wird demonstriert, wie auf Basis der formalen Beschreibung Schlussfolgerungen gezogen werden können. So können durch das Schlussfolgern ohne zusätzlichen Aufwand neue Erkenntnisse über auftretende Fahrmanöver generiert werden. Aufgrund des domänenunabhängigen Charakters der skizzierten Lösungsansätze wird gezeigt, dass diese sich leicht auf andere Anwendungsfälle anwenden lassen.Measurement data in form of time series of scientific experiments is stored in relational databases for efficient processing. Complementary, Semantic Web technologies have been developed in the last years for describing domain knowledge in form of ontologies. Due to their open architecture, foreign ontologies and resources can be easily referenced and integrated. Since Semantic Web technologies are based on predicate logic, they are suitable for formal modeling. Therefore, using reasoning implicit knowledge can be derived from ontologies. This work introduces semantic (database) annotations to link databases and ontologies to take advantage of both together by describing database contents with Semantic Web technologies. An architecture is developed for the combined handling and usage of these two technologies, which is designed in respect of scalability of large amounts of measurement data. Based on this architecture, concepts for visualizing and interacting with annotations are introduced. Furthermore, semantic annotations are used for formally modeling events in time series using finite state machines, which are computed using reasoning. An implementation is introduced to demonstrate the feasibility and advantages of the discussed concepts. The presented application Semantic Database Browser allows using semantic database annotations and interactively working with them for integrated handling of formally described measurement data. Formal models can be easily exchanged and reused to support reusability of knowledge and cooperation. By describing measurement data with models, data becomes much easier to understand. Using an example of events during driving, it is demonstrated how formal description can be used for automatic reasoning to generate additional knowledge about driving maneuvers without any additional effort. Because the presented approaches are domain independent, they can be easily adapted for other use cases

    The Study of pT Dependence of Dijet Azimuthal Decorrelations in Proton-Proton Collision at Center of Mass Energy = 7 TeV

    Get PDF
    The transverse momentum (pT) dependence of azimuthal decorrelations in dijet events is studied with data collected, at an integrated luminosity of [special characters omitted] dt = (36 ± 4) pb−1, from collisions between protons at a center of mass energy of [special characters omitted] = 7 TeV using the ATLAS detector at the Large Hadron Collider. The results of the analysis of jets in a central rapidity of |y| \u3c 0.8 and pT in the range 60 GeV \u3c pT \u3c 1200 GeV are presented. A new observable RΔ&phis;, defined as the fraction of the total dijet cross section corresponding to a particular range of opening angles between the two jets with the highest pT, is measured as a differential quantity in the pT of the jet with highest pT. The results of the analysis are compared with good agreement to next-to-leading order perturbative QCD calculations as well as the predictions from different Monte Carlo generators including PYTHIA and ALPGEN
    corecore