370 research outputs found

    Analysis Of Aircraft Arrival Delay And Airport On-time Performance

    Get PDF
    While existing grid environments cater to specific needs of a particular user community, we need to go beyond them and consider general-purpose large-scale distributed systems consisting of large collections of heterogeneous computers and communication systems shared by a large user population with very diverse requirements. Coordination, matchmaking, and resource allocation are among the essential functions of large-scale distributed systems. Although deterministic approaches for coordination, matchmaking, and resource allocation have been well studied, they are not suitable for large-scale distributed systems due to the large-scale, the autonomy, and the dynamics of the systems. We have to seek for nondeterministic solutions for large-scale distributed systems. In this dissertation we describe our work on a coordination service, a matchmaking service, and a macro-economic resource allocation model for large-scale distributed systems. The coordination service coordinates the execution of complex tasks in a dynamic environment, the matchmaking service supports finding the appropriate resources for users, and the macro-economic resource allocation model allows a broker to mediate resource providers who want to maximize their revenues and resource consumers who want to get the best resources at the lowest possible price, with some global objectives, e.g., to maximize the resource utilization of the system

    Operating policies for energy efficient large scale computing

    Get PDF
    PhD ThesisEnergy costs now dominate IT infrastructure total cost of ownership, with datacentre operators predicted to spend more on energy than hardware infrastructure in the next five years. With Western European datacentre power consumption estimated at 56 TWh/year in 2007 and projected to double by 2020, improvements in energy efficiency of IT operations is imperative. The issue is further compounded by social and political factors and strict environmental legislation governing organisations. One such example of large IT systems includes high-throughput cycle stealing distributed systems such as HTCondor and BOINC, which allow organisations to leverage spare capacity on existing infrastructure to undertake valuable computation. As a consequence of increased scrutiny of the energy impact of these systems, aggressive power management policies are often employed to reduce the energy impact of institutional clusters, but in doing so these policies severely restrict the computational resources available for high-throughput systems. These policies are often configured to quickly transition servers and end-user cluster machines into low power states after only short idle periods, further compounding the issue of reliability. In this thesis, we evaluate operating policies for energy efficiency in large-scale computing environments by means of trace-driven discrete event simulation, leveraging real-world workload traces collected within Newcastle University. The major contributions of this thesis are as follows: i) Evaluation of novel energy efficient management policies for a decentralised peer-to-peer (P2P) BitTorrent environment. ii) Introduce a novel simulation environment for the evaluation of energy efficiency of large scale high-throughput computing systems, and propose a generalisable model of energy consumption in high-throughput computing systems. iii iii) Proposal and evaluation of resource allocation strategies for energy consumption in high-throughput computing systems for a real workload. iv) Proposal and evaluation for a realworkload ofmechanisms to reduce wasted task execution within high-throughput computing systems to reduce energy consumption. v) Evaluation of the impact of fault tolerance mechanisms on energy consumption

    Supporting SLA Provisioning in Grids by Risk Management Processes

    Get PDF
    Gridtechnologien haben heutzutage einen hohen Entwicklungsstand erreicht, aber für die Etablierung eines kommerziellen Grids ist es erforderlich, Defizite in den Bereichen Sicherheit, Vertrauenswürdigkeit und Verlässlichkeit zu beheben. Anwender fordern eine Ausführung ihrer Applikation (Grid Jobs) gemäß einer gewünschten Priorität und Qualität. Um vertraglich derartige Aspekte einzufordern, können Service Level Agreements (SLAs) zwischen Dienstbenutzern und Dienstanbietern verhandelt werden. Dienstanbieter kennen jedoch die Unzuverlässigkeit von Grid Ressourcen und sind daher vorsichtig, strenge Forderungen zu akzeptieren und entsprechende Qualitäten zu garantieren. Können strenge Forderungen jedoch nicht vertraglich vereinbart werden, so bevorzugen es viele Anwender, eigene Rechenressourcen zu verwenden. Zwar ist die Unterhaltung eigener Ressourcen in vielen Fällen teurer, aber sie haben die Kontrolle über ihre Applikation, was ihnen mehr Sicherheit bietet. Für die Etablierung eines kommerziellen Grids ist es daher unerlässlich, dass Grid Provider auch strenge SLAs akzeptieren. Damit Provider strenge SLAs akzeptieren können, benötigen sie Abschätzungen dafür, dass sie die SLA nicht erfüllen können (Risikoberechnung). Des Weiteren sollten solche Abschätzungen als Entscheidungskriterium bei der Ressourcenallokation oder Initiierung von Fehlertoleranzmaßnahmen fungieren (Risikomanagement). Diese Arbeit integriert die Betrachtung von Risiken in die Abläufe des Providers, die in die Erbringung von SLAs involviert sind. Während der SLA Verhandlung wird evaluiert welche Ressourcen für die Diensterbringung verwendet werden. Basierend darauf wird die Fehlerwahrscheinlichkeit dieser Ressourcen und der SLA Erbringung im Gesamten berechnet. Falls die mögliche Fehlerwahrscheinlichkeit zu hoch ist, können risikoreduzierende Maßnahmen durchgeführt werden, so dass die SLA akzeptiert werden kann. Die berechnete Fehlerwahrscheinlichkeit wird von Provider und Benutzer ebenfalls bei der Bestimmung des Preises und der Konventionalstrafe betrachtet. Nach dem Vertragsabschluss ist es für die Vermeidung von SLA Verletzungen aus Grid Provider Sicht essentiell, Ressourcenausfälle kompensieren zu können. Die Verwendung von Fehlertoleranzmaßnahmen in Zusammenhang mit einer Risikobetrachtung unterstützt Grid Provider bei der Bewältigung dieser Aufgabe. Risikomanagementprozesse werden dabei direkt mit dem Ressourcenmanagement verknüpft und sind nicht sichtbar für Anwender. Ein wichtiger Aspekt des entwickelten Risikomanagements sind selbstorganisierende Mechanismen, die eine Fehlertoleranzmaßnahme oder eine Kette solcher initiieren, um auf Instabilitäten oder Ausfälle von Ressourcen zu reagieren. Für kommerzielle Grid Provider ist die Betrachtung finanzieller Aspekte im Ressourcenbetrieb und in der Diensterbringung stets von hoher Bedeutung. Folglich werden alle Entscheidungen unter Berücksichtigung finanzieller Aspekte getroffen, wie zum Beispiel der Gewinnmarge, den Kosten für eine Fehlertoleranzmaßnahme sowie dem erwarteten Profit für eine Jobausführung. Zusammengefasst gilt die Integration von Risikomanagement in die Abläufe eines Grid Providers als initialer Schritt für ein risikobetrachtendes Grid. Es wird die Transparenz, Zuverlässigkeit und Vertrauenswürdigkeit steigern und dient als objektives Kriterium für Entscheidungsprozesse im Ressourcenmanagement. Ein integriertes Risikomanagement bringt enorme Vorteile sowohl während der SLA Verhandlung als auch nach Vertragsabschluss - und damit insgesamt für die Diensterbringung im Rahmen von SLAs.Grid technologies have reached a high level of development, however core shortcomings have been identified relating to security, trust, and dependability of the Grid which reduce its appeal to potential commercial adopters. Users require a job execution with a desired priority and quality. In order to stipulate such requirements, Service Level Agreements (SLA) can be negotiated. These are a powerful instrument enabling the specification of the business relationships between service providers and service users in detail. However, providers are aware of various threats for SLA violations and are reluctant to adopt a mechanism which requires them to meet strict requirements and to guarantee associated quality constraints. If strict guarantees cannot be agreed by contract, many users prefer to operate their own resources instead of using the Grid. This is more expensive but they control their applications, which removes the issues of trust and ensures dependability concerning its successful completion. To establish a commercial Grid environment, it is essential that Grid providers are prepared to accept an approach involving SLAs with associated guarantees. In order to enable providers to accept such SLAs, they need estimates of the likelihood that they are unable to fulfill an SLA, i. e. Risk Assessment. Furthermore the resource management should take into account such estimations when allocating resources or initiating fault-tolerance mechanisms, i. e. Risk Management. This work integrates risk awareness in the provider’s processes which are involved in SLA provisioning: During SLA negotiation they evaluate which resources can be used for service provisioning and estimate the Probability of Failure (PoF) of resources and of fulfilling the SLA. If the estimated PoF is too high, then, by applying risk reduction mechanisms, the provider may be able to reduce it sufficiently to accept the SLA. The estimated PoF will also be considered by the service provider and service consumer when determining the revenue and the contractual penalty. Compared to a service request requiring a relatively low quality of service, providing a more reliable service requires to receive a higher price since more guarantees have to be ensured. If a more reliable service is provided, the consumer might also define a higher contractual penalty. Thus, the PoF is an additional decision making element in the SLA negotiation since it enables end-users to compare different SLA offers by an objective measurement. When providers have accepted an SLA, they have to be able to compensate for resource failures in order to prevent SLA violations. The usage of fault-tolerance mechanisms combined with risk awareness support Grid providers in this task. The Risk Management processes are interlaced with the resource management and thereby transparent for Grid service consumers. An important aspect of the Risk Management developed for the Grid are self-organising mechanisms, which initiate a fault-tolerance action or a chain of them, in order to manage resource instabilities or resource outages. Decisions are made on the basis of financial considerations, such as the profit margin, the cost for performing fault-tolerance, and the expected profit when executing a job. Taking into account such financial factors is of high importance for commercial Grid providers. In conclusion, the integration of Risk Management in the processes of Grid providers is the initial step towards a risk aware Grid. It will increase transparency, reliability, and trust and provides an objective basis for decision processes in the resource management. Risk Management is integrated to address the SLA negotiation as well as the post-negotiation phase and thereby improves the SLA provisioning process in general

    Recovery-oriented software architecture for grid applications (ROSA-Grids)

    Get PDF
    Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large-scale projects. Examples of grid resources include high-performance computers, massive data stores, high bandwidth networking, telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the heterogeneity of grid resources, the distribution of these resources over unreliable networks, the complexity of mechanisms that are needed to integrate such resources into a seamless utility, and the dynamic nature of the grid infrastructure that allows continuous changes to happen. To make matters worse, grid applications are generally long running, and these runs repeatedly require coordinated use of many resources at the same time. In this thesis, we propose the Recovery-Aware Components (RAC) approach. The RAC approach enables a grid application to handle failure reactively and proactively at the level of the smallest and independent execution unit of the application. The approach also combines runtime prediction with a proactive fault tolerance strategy. The RAC approach aims at improving the reliability of the grid application with the least overhead possible. Moreover, to allow a grid fault tolerance manager fine-tuned control and trading off of reliability gained and overhead paid, this thesis offers an architecture-aware modelling and simulation of reliability and overhead. The thesis demonstrates for a few of a dozen or so classes of application architecture already identified in prior research, that the typical architectural structure of the class can be captured in a few parameters. The work shows that these parameters suffice to achieve significant insight into, and control of, such tradeoffs. The contributions of our research project are as follows. We defined the RAC approach. We showed the usage of the RAC approach for improving the reliability of MapReduce and Combinational Logic grid applications. We provided Markov models that represent the execution behaviour of these applications for reliability and overhead analyses. We analysed the sensitivity of the reliability-overhead tradeoff of the RAC approach to the type of fault tolerance strategy, the parameters of a fault tolerance strategy, prediction interval and a predictor’s accuracy. The final contribution of our research is an experiment testbed that enables a grid fault tolerance expert to evaluate diverse fault tolerance support configurations, and then choose the one that will satisfy the reliability and cost requirements

    Dependability-driven Strategies to Improve the Design and Verification of Safety-Critical HDL-based Embedded Systems

    Full text link
    [ES] La utilización de sistemas empotrados en cada vez más ámbitos de aplicación está llevando a que su diseño deba enfrentarse a mayores requisitos de rendimiento, consumo de energía y área (PPA). Asimismo, su utilización en aplicaciones críticas provoca que deban cumplir con estrictos requisitos de confiabilidad para garantizar su correcto funcionamiento durante períodos prolongados de tiempo. En particular, el uso de dispositivos lógicos programables de tipo FPGA es un gran desafío desde la perspectiva de la confiabilidad, ya que estos dispositivos son muy sensibles a la radiación. Por todo ello, la confiabilidad debe considerarse como uno de los criterios principales para la toma de decisiones a lo largo del todo flujo de diseño, que debe complementarse con diversos procesos que permitan alcanzar estrictos requisitos de confiabilidad. Primero, la evaluación de la robustez del diseño permite identificar sus puntos débiles, guiando así la definición de mecanismos de tolerancia a fallos. Segundo, la eficacia de los mecanismos definidos debe validarse experimentalmente. Tercero, la evaluación comparativa de la confiabilidad permite a los diseñadores seleccionar los componentes prediseñados (IP), las tecnologías de implementación y las herramientas de diseño (EDA) más adecuadas desde la perspectiva de la confiabilidad. Por último, la exploración del espacio de diseño (DSE) permite configurar de manera óptima los componentes y las herramientas seleccionados, mejorando así la confiabilidad y las métricas PPA de la implementación resultante. Todos los procesos anteriormente mencionados se basan en técnicas de inyección de fallos para evaluar la robustez del sistema diseñado. A pesar de que existe una amplia variedad de técnicas de inyección de fallos, varias problemas aún deben abordarse para cubrir las necesidades planteadas en el flujo de diseño. Aquellas soluciones basadas en simulación (SBFI) deben adaptarse a los modelos de nivel de implementación, teniendo en cuenta la arquitectura de los diversos componentes de la tecnología utilizada. Las técnicas de inyección de fallos basadas en FPGAs (FFI) deben abordar problemas relacionados con la granularidad del análisis para poder localizar los puntos débiles del diseño. Otro desafío es la reducción del coste temporal de los experimentos de inyección de fallos. Debido a la alta complejidad de los diseños actuales, el tiempo experimental dedicado a la evaluación de la confiabilidad puede ser excesivo incluso en aquellos escenarios más simples, mientras que puede ser inviable en aquellos procesos relacionados con la evaluación de múltiples configuraciones alternativas del diseño. Por último, estos procesos orientados a la confiabilidad carecen de un soporte instrumental que permita cubrir el flujo de diseño con toda su variedad de lenguajes de descripción de hardware, tecnologías de implementación y herramientas de diseño. Esta tesis aborda los retos anteriormente mencionados con el fin de integrar, de manera eficaz, estos procesos orientados a la confiabilidad en el flujo de diseño. Primeramente, se proponen nuevos métodos de inyección de fallos que permiten una evaluación de la confiabilidad, precisa y detallada, en diferentes niveles del flujo de diseño. Segundo, se definen nuevas técnicas para la aceleración de los experimentos de inyección que mejoran su coste temporal. Tercero, se define dos estrategias DSE que permiten configurar de manera óptima (desde la perspectiva de la confiabilidad) los componentes IP y las herramientas EDA, con un coste experimental mínimo. Cuarto, se propone un kit de herramientas que automatiza e incorpora con eficacia los procesos orientados a la confiabilidad en el flujo de diseño semicustom. Finalmente, se demuestra la utilidad y eficacia de las propuestas mediante un caso de estudio en el que se implementan tres procesadores empotrados en un FPGA de Xilinx serie 7.[CA] La utilització de sistemes encastats en cada vegada més àmbits d'aplicació està portant al fet que el seu disseny haja d'enfrontar-se a majors requisits de rendiment, consum d'energia i àrea (PPA). Així mateix, la seua utilització en aplicacions crítiques provoca que hagen de complir amb estrictes requisits de confiabilitat per a garantir el seu correcte funcionament durant períodes prolongats de temps. En particular, l'ús de dispositius lògics programables de tipus FPGA és un gran desafiament des de la perspectiva de la confiabilitat, ja que aquests dispositius són molt sensibles a la radiació. Per tot això, la confiabilitat ha de considerar-se com un dels criteris principals per a la presa de decisions al llarg del tot flux de disseny, que ha de complementar-se amb diversos processos que permeten aconseguir estrictes requisits de confiabilitat. Primer, l'avaluació de la robustesa del disseny permet identificar els seus punts febles, guiant així la definició de mecanismes de tolerància a fallades. Segon, l'eficàcia dels mecanismes definits ha de validar-se experimentalment. Tercer, l'avaluació comparativa de la confiabilitat permet als dissenyadors seleccionar els components predissenyats (IP), les tecnologies d'implementació i les eines de disseny (EDA) més adequades des de la perspectiva de la confiabilitat. Finalment, l'exploració de l'espai de disseny (DSE) permet configurar de manera òptima els components i les eines seleccionats, millorant així la confiabilitat i les mètriques PPA de la implementació resultant. Tots els processos anteriorment esmentats es basen en tècniques d'injecció de fallades per a poder avaluar la robustesa del sistema dissenyat. A pesar que existeix una àmplia varietat de tècniques d'injecció de fallades, diverses problemes encara han d'abordar-se per a cobrir les necessitats plantejades en el flux de disseny. Aquelles solucions basades en simulació (SBFI) han d'adaptar-se als models de nivell d'implementació, tenint en compte l'arquitectura dels diversos components de la tecnologia utilitzada. Les tècniques d'injecció de fallades basades en FPGAs (FFI) han d'abordar problemes relacionats amb la granularitat de l'anàlisi per a poder localitzar els punts febles del disseny. Un altre desafiament és la reducció del cost temporal dels experiments d'injecció de fallades. A causa de l'alta complexitat dels dissenys actuals, el temps experimental dedicat a l'avaluació de la confiabilitat pot ser excessiu fins i tot en aquells escenaris més simples, mentre que pot ser inviable en aquells processos relacionats amb l'avaluació de múltiples configuracions alternatives del disseny. Finalment, aquests processos orientats a la confiabilitat manquen d'un suport instrumental que permeta cobrir el flux de disseny amb tota la seua varietat de llenguatges de descripció de maquinari, tecnologies d'implementació i eines de disseny. Aquesta tesi aborda els reptes anteriorment esmentats amb la finalitat d'integrar, de manera eficaç, aquests processos orientats a la confiabilitat en el flux de disseny. Primerament, es proposen nous mètodes d'injecció de fallades que permeten una avaluació de la confiabilitat, precisa i detallada, en diferents nivells del flux de disseny. Segon, es defineixen noves tècniques per a l'acceleració dels experiments d'injecció que milloren el seu cost temporal. Tercer, es defineix dues estratègies DSE que permeten configurar de manera òptima (des de la perspectiva de la confiabilitat) els components IP i les eines EDA, amb un cost experimental mínim. Quart, es proposa un kit d'eines (DAVOS) que automatitza i incorpora amb eficàcia els processos orientats a la confiabilitat en el flux de disseny semicustom. Finalment, es demostra la utilitat i eficàcia de les propostes mitjançant un cas d'estudi en el qual s'implementen tres processadors encastats en un FPGA de Xilinx serie 7.[EN] Embedded systems are steadily extending their application areas, dealing with increasing requirements in performance, power consumption, and area (PPA). Whenever embedded systems are used in safety-critical applications, they must also meet rigorous dependability requirements to guarantee their correct operation during an extended period of time. Meeting these requirements is especially challenging for those systems that are based on Field Programmable Gate Arrays (FPGAs), since they are very susceptible to Single Event Upsets. This leads to increased dependability threats, especially in harsh environments. In such a way, dependability should be considered as one of the primary criteria for decision making throughout the whole design flow, which should be complemented by several dependability-driven processes. First, dependability assessment quantifies the robustness of hardware designs against faults and identifies their weak points. Second, dependability-driven verification ensures the correctness and efficiency of fault mitigation mechanisms. Third, dependability benchmarking allows designers to select (from a dependability perspective) the most suitable IP cores, implementation technologies, and electronic design automation (EDA) tools. Finally, dependability-aware design space exploration (DSE) allows to optimally configure the selected IP cores and EDA tools to improve as much as possible the dependability and PPA features of resulting implementations. The aforementioned processes rely on fault injection testing to quantify the robustness of the designed systems. Despite nowadays there exists a wide variety of fault injection solutions, several important problems still should be addressed to better cover the needs of a dependability-driven design flow. In particular, simulation-based fault injection (SBFI) should be adapted to implementation-level HDL models to take into account the architecture of diverse logic primitives, while keeping the injection procedures generic and low-intrusive. Likewise, the granularity of FPGA-based fault injection (FFI) should be refined to the enable accurate identification of weak points in FPGA-based designs. Another important challenge, that dependability-driven processes face in practice, is the reduction of SBFI and FFI experimental effort. The high complexity of modern designs raises the experimental effort beyond the available time budgets, even in simple dependability assessment scenarios, and it becomes prohibitive in presence of alternative design configurations. Finally, dependability-driven processes lack an instrumental support covering the semicustom design flow in all its variety of description languages, implementation technologies, and EDA tools. Existing fault injection tools only partially cover the individual stages of the design flow, being usually specific to a particular design representation level and implementation technology. This work addresses the aforementioned challenges by efficiently integrating dependability-driven processes into the design flow. First, it proposes new SBFI and FFI approaches that enable an accurate and detailed dependability assessment at different levels of the design flow. Second, it improves the performance of dependability-driven processes by defining new techniques for accelerating SBFI and FFI experiments. Third, it defines two DSE strategies that enable the optimal dependability-aware tuning of IP cores and EDA tools, while reducing as much as possible the robustness evaluation effort. Fourth, it proposes a new toolkit (DAVOS) that automates and seamlessly integrates the aforementioned dependability-driven processes into the semicustom design flow. Finally, it illustrates the usefulness and efficiency of these proposals through a case study consisting of three soft-core embedded processors implemented on a Xilinx 7-series SoC FPGA.Tuzov, I. (2020). Dependability-driven Strategies to Improve the Design and Verification of Safety-Critical HDL-based Embedded Systems [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/159883TESI
    corecore