5 research outputs found

    Correlated Set Coordination in Fault Tolerant Message Logging Protocols

    Full text link
    Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic as-sumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

    Asynchronous Snapshots of Actor Systems for Latency-Sensitive Applications

    Get PDF
    The actor model is popular for many types of server applications. Efficient snapshotting of applications is crucial in the deployment of pre-initialized applications or moving running applications to different machines, e.g for debugging purposes. A key issue is that snapshotting blocks all other operations. In modern latency-sensitive applications, stopping the application to persist its state needs to be avoided, because users may not tolerate the increased request latency. In order to minimize the impact of snapshotting on request latency, our approach persists the application鈥檚 state asynchronously by capturing partial heaps, completing snapshots step by step. Additionally, our solution is transparent and supports arbitrary object graphs. We prototyped our snapshotting approach on top of the Truffle/Graal platform and evaluated it with the Savina benchmarks and the Acme Air microservice application. When performing a snapshot every thousand Acme Air requests, the number of slow requests ( 0.007% of all requests) with latency above 100ms increases by 5.43%. Our Savina microbenchmark results detail how different utilization patterns impact snapshotting cost. To the best of our knowledge, this is the first system that enables asynchronous snapshotting of actor applications, i.e. without stop-the-world synchronization, and thereby minimizes the impact on latency. We thus believe it enables new deployment and debugging options for actor systems

    RADIC : un middleware de tolerancia a fallos que preserva el rendimiento

    Get PDF
    La tolerancia a fallos es una l铆nea de investigaci贸n que ha adquirido una importancia relevante con el aumento de la capacidad de c贸mputo de los s煤per-computadores actuales. Esto es debido a que con el aumento del poder de procesamiento viene un aumento en la cantidad de componentes que trae consigo una mayor cantidad de fallos. Las estrategias de tolerancia a fallos actuales en su mayor铆a son centralizadas y estas no escalan cuando se utiliza una gran cantidad de procesos, dado que se requiere sincronizaci贸n entre todos ellos para realizar las tareas de tolerancia a fallos. Adem谩s la necesidad de mantener las prestaciones en programas paralelos es crucial, tanto en presencia como en ausencia de fallos. Teniendo en cuenta lo citado, este trabajo se ha centrado en una arquitectura tolerante a fallos descentralizada (RADIC - Redundant Array of Distributed and Independant Controllers) que busca mantener las prestaciones iniciales y garantizar la menor sobrecarga posible para reconfigurar el sistema en caso de fallos. La implementaci贸n de esta arquitectura se ha llevado a cabo en la librer铆a de paso de mensajes denominada Open MPI, la misma es actualmente una de las m谩s utilizadas en el mundo cient铆fico para la ejecuci贸n de programas paralelos que utilizan una plataforma de paso de mensajes. Las pruebas iniciales demuestran que el sistema introduce m铆nima sobrecarga para llevar a cabo las tareas correspondientes a la tolerancia a fallos. MPI es un est谩ndar por defecto fail-stop, y en determinadas implementaciones que a帽aden cierto nivel de tolerancia, las estrategias m谩s utilizadas son coordinadas. En RADIC cuando ocurre un fallo el proceso se recupera en otro nodo volviendo a un estado anterior que ha sido almacenado previamente mediante la utilizaci贸n de checkpoints no coordinados y la relectura de mensajes desde el log de eventos. Durante la recuperaci贸n, las comunicaciones con el proceso en cuesti贸n deben ser retrasadas y redirigidas hacia la nueva ubicaci贸n del proceso. Restaurar procesos en un lugar donde ya existen procesos sobrecarga la ejecuci贸n disminuyendo las prestaciones, por lo cual en este trabajo se propone la utilizaci贸n de nodos spare para la recuperar en ellos a los procesos que fallan, evitando de esta forma la sobrecarga en nodos que ya tienen trabajo. En este trabajo se muestra un dise帽o propuesto para gestionar de un modo autom谩tico y descentralizado la recuperaci贸n en nodos spare en un entorno Open MPI y se presenta un an谩lisis del impacto en las prestaciones que tiene este dise帽o. Resultados iniciales muestran una degradaci贸n significativa cuando a lo largo de la ejecuci贸n ocurren varios fallos y no se utilizan spares y sin embargo utiliz谩ndolos se restablece la configuraci贸n inicial y se mantienen las prestaciones.Fault tolerance is a research line that has gained significant importance with the increasing of the computing power of today's super-computers. The increasing of processing power comes with an increase in the number of components that brings also an increase in the number of failures. Today's fault tolerance strategies are mostly centralized and these do not scale when using a large number of processes, since synchronization is required between them to perform the fault tolerance tasks. Maintain performance in parallel applications is crucial, in the presence or absence of fault. According to the above, this work has focused on a decentralized fault-tolerant architecture (RADIC - Redundant Array of Distributed and Independant Controllers) that seeks to maintain the initial performance and ensure the lowest possible overhead to reconfigure the system in case of failure. The implementation of this architecture has been made in the message passing library called Open MPI. This is one of the most used message passing library in the scientific world to execute parallel programs. Initial tests show that the system introduces minimal overhead to perform fault tolerances tasks, and also show that performance is restored as it was before failure. MPI is a fail-stop standard and some implementations that add fault tolerances use a coordinated strategy. In RADIC when a failure occur the failed process recovers in another node rolling back to a previous saved state made by using an uncoordinated strategy of checkpoint and by reprocessing the saved log. During restart, the communications with the failed process should be delayed and redirected to the new process location. Restoring processes in a place where processes already exists overload the application and the performance decrease. In this work is proposed the inclusion of spare nodes to restore failed processes in them, avoiding performance degradation. In this work we propose an automatic and decentralized method to manage the recovery of failed processes in spare nodes in an Open MPI environment and is also presented an analysis of the failure impact in the performance. Experimental evaluation shows a significant degradation when failures occur along a parallel execution and there is no spare nodes, nevertheless by using spares, the initial configuration and the initial performance may be restored
    corecore