    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    Técnicas de altas prestaciones aplicadas al diseño de infraestructuras ferroviarias complejas

    In this work we will focus on overhead air switches design problem. The design of railway infrastructures is an important problem in the railway world, non-optimal designs cause limitations in the train speed and, most important, malfunctions and breakages. Most railway companies have regulations for the design of these elements. Those regulations have been defined by the experience, but, as far as we know, there are no computerized software tools that assist with the task of designing and testing optimal solutions for overhead switches. The aim of this thesis is the design, implementation, and evaluation of a simulator that that facilitates the exploration of all possible solutions space, looking for the set of optimal solutions in the shortest time and at the lowest possible cost. Simulators are frequently used in the world of rail infrastructure. Many of them only focus on simulated scenarios predefined by the users, analyzing the feasibility or otherwise of the proposed design. Throughout this thesis, we will propose a framework to design a complete simulator that be able to propose, simulate and evaluate multiple solutions. This framework is based on four pillars: compromise between simulation accuracy and complexity, automatic generation of possible solutions (automatic exploration of the solution space), consideration of all the actors involved in the design process (standards, additional restrictions, etc.), and finally, the expert’s knowledge and integration of optimization metrics. Once we defined the framework different deployment proposes are presented, one to be run in a single node, and one in a distributed system. In the first paradigm, one thread per CPU available in the system is launched. All the simulators are designed around this paradigm of parallelism. The second simulation approach will be designed to be deploy in a cluster with several nodes, MPI will be used for that purpose. Finally, after the implementation of each of the approaches, we will proceed to evaluate the performance of each of them, carrying out a comparison of time and cost. Two examples of real scenarios will be used.El diseño de agujas aéreas es un problema bastante complejo y critico dentro del proceso de diseño de sistemas ferroviarios. Un diseño no óptimo puede provocar limitaciones en el servicio, como menor velocidad de tránsito, y lo que es más importante, puede ser la causa principal de accidentes y averías. La mayoría de las compañías ferroviarias disponen de regulaciones para el diseño correcto de estas agujas aéreas. Todas estas regulaciones han sido definidas bajo décadas de experiencia, pero hasta donde sé, no existen aplicaciones software que ayuden en la tarea de diseñar y probar soluciones óptimas. Es en este punto donde se centra el objetivo de la tesis, el diseño, implementación y evaluación de un simulador capaz de explorar todo el posible espacio de soluciones buscando el conjunto de soluciones óptimas en el menor tiempo y con el menor coste posible. Los simuladores son utilizados frecuentemente en el mundo de la infraestructura ferroviaria. Muchos de ellos solo se centran en la simulación de escenarios preestablecidos por el usuario, analizando la viabilidad o no del diseño propuesto. A lo largo de esta tesis, se propondrá un framework que permita al simulador final ser capaz de proponer, simular y evaluar múltiples soluciones. El framework se basa en 4 pilares fundamentales, compromiso entre precisión en la simulación y la complejidad del simulador; generación automática de posibles soluciones (exploración automática del espacio de soluciones), consideración de todos los agentes que intervienen en el proceso de diseño (normativa, restricciones adicionales, etc.) y por último, el conocimiento del experto y la integración de métricas de optimización. Una vez definido el framework se presentaran varias opciones de implementación del simulador, en la primera de ellas se diseñará e implementara una versión con hilos pura. Se lanzara un hilo por cada CPU disponible en el sistema. Todo el simulador se diseñará en torno a este paradigma de paralelismo. En un segundo simulador, se aplicará un paradigma mucho más pensado para su despliegue en un cluster y no en un único nodo (como el paradigma inicial), para ello se empleara MPI. Con esta versión se podrá adaptar el simulador al cluster en el que se va a ejecutar. Por último, se va a emplear un paradigma basado en cloud computing. Para ello, según las necesidades del escenario a simular, se emplearán más o menos máquinas virtuales. Finalmente, tras la implementación de cada uno de los simuladores, se procederá a evaluar el rendimiento de cada uno de ellos, realizando para ello una comparativa de tiempo y coste. Se empleara para ello dos ejemplos de escenarios reales.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: José Daniel García Sánchez.- Secretario: Antonio García Dopico.- Vocal: Juan Carlos Díaz Martí

    Workflow models for heterogeneous distributed systems

    The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security. Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow. The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology. Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures

    EXPRESS: Resource-oriented and RESTful Semantic Web services

    This thesis investigates an approach that simplifies the development of Semantic Web services (SWS) by removing the need for additional semantic descriptions.The most actively researched approaches to Semantic Web services introduce explicit semantic descriptions of services that are in addition to the existing semantic descriptions of the service domains. This increases their complexity and design overhead. The need for semantically describing the services in such approaches stems from their foundations in service-oriented computing, i.e. the extension of already existing service descriptions. This thesis demonstrates that adopting a resource-oriented approach based on REST will, in contrast to service-oriented approaches, eliminate the need for explicit semantic service descriptions and service vocabularies. This reduces the development efforts while retaining the significant functional capabilities.The approach proposed in this thesis, called EXPRESS (Expressing RESTful Semantic Services), utilises the similarities between REST and the Semantic Web, such as resource realisation, self-describing representations, and uniform interfaces. The semantics of a service is elicited from a resource’s semantic description in the domain ontology and the semantics of the uniform interface, hence eliminating the need for additional semantic descriptions. Moreover, stub-generation is a by-product of the mapping between entities in the domain ontology and resources.EXPRESS was developed to test the feasibility of eliminating explicit service descriptions and service vocabularies or ontologies, to explore the restrictions placed on domain ontologies as a result, to investigate the impact on the semantic quality of the description, and explore the benefits and costs to developers. To achieve this, an online demonstrator that allows users to generate stubs has been developed. In addition, a matchmaking experiment was conducted to show that the descriptions of the services are comparable to OWL-S in terms of their ability to be discovered, while improving the efficiency of discovery. Finally, an expert review was undertaken which provided evidence of EXPRESS’s simplicity and practicality when developing SWS from scratch

    Factors that Impact the Cloud Portability of Legacy Web Applications

    The technological dependency of products or services provided by a particular cloud platform or provider (i.e. cloud vendor lock-in) leaves cloud users unprotected against service failures and providers going out of business, and unable to modernise their software applications by exploiting new technologies and cheaper services from alternative clouds. High portability is key to ensure a smooth migration of software applications between clouds, reducing the risk of vendor lock-in. This research identifies and models key factors that impact the portability of legacy web applications in cloud computing. Unlike existing cloud portability studies, we use a combination of techniques from empirical software engineering, software quality and areas related to cloud, including service-oriented computing and distributed systems, to carry out a rigorous experimental study of four factors impacting on cloud application portability. In addition, we exploit established methods for software effort prediction to build regression models for predicting the effort required to increase cloud application portability. Our results show that software coupling, authentication technology, cloud platform and service are statistically significant and scientifically relevant factors for cloud application portability in the experiments undertaken. Furthermore, the experimental data enabled the development of fair (mean magnitude of relative error, MMRE, between 0.493 and 0.875), good (MMRE between 0.386 and 0.493) and excellent (MMRE not exceeding 0.368) regression models for predicting the effort of increasing the portability of legacy cloud applications. By providing empirical evidence of factors that impact cloud application portability and building effort prediction models, our research contributes to improving decision making when migrating legacy applications between clouds, and to mitigating the risks associated with cloud vendor lock-in

    An Overlay Architecture for Personalized Object Access and Sharing in a Peer-to-Peer Environment

    Due to its exponential growth and decentralized nature, the Internet has evolved into a chaotic repository, making it difficult for users to discover and access resources of interest to them. As a result, users have to deal with the problem of information overload. The Semantic Web's emergence provides Internet users with the ability to associate explicit, self-described semantics with resources. This ability will facilitate in turn the development of ontology-based resource discovery tools to help users retrieve information in an efficient manner. However, it is widely believed that the Semantic Web of the future will be a complex web of smaller ontologies, mostly created by various groups of web users who share a similar interest, referred to as a Community of Interest. This thesis proposes a solution to the information overload problem using a user driven framework, referred to as a Personalized Web, that allows individual users to organize themselves into Communities of Interests based on ontologies agreed upon by all community members. Within this framework, users can define and augment their personalized views of the Internet by associating specific properties and attributes to resources and defining constraint-functions and rules that govern the interpretation of the semantics associated with the resources. Such views can then be used to capture the user's interests and integrate these views into a user-defined Personalized Web. As a proof of concept, a Personalized Web architecture that employs ontology-based semantics and a structured Peer-to-Peer overlay network to provide a foundation of semantically-based resource indexing and advertising is developed. In order to investigate mechanisms that support the resource advertising and retrieval of the Personalized Web architecture, three agent-driven advertising and retrieval schemes, the Aggressive scheme, the Crawler-based scheme, and the Minimum-Cover-Rule scheme, were implemented and evaluated in both stable and churn environments. In addition to the development of a Personalized Web architecture that deals with typical web resources, this thesis used a case study to explore the potential of the Personalized Web architecture to support future web service workflow applications. The results of this investigation demonstrated that the architecture can support the automation of service discovery, negotiation, and invocation, allowing service consumers to actualize a personalized web service workflow. Further investigation will be required to improve the performance of the automation and allow it to be performed in a secure and robust manner. In order to support the next generation Internet, further exploration will be needed for the development of a Personalized Web that includes ubiquitous and pervasive resources