7 research outputs found

    Fast Lean Erasure-Coded Atomic Memory Object

    Get PDF
    In this work, we propose FLECKS, an algorithm which implements atomic memory objects in a multi-writer multi-reader (MWMR) setting in asynchronous networks and server failures. FLECKS substantially reduces storage and communication costs over its replication-based counterparts by employing erasure-codes. FLECKS outperforms the previously proposed algorithms in terms of the metrics that to deliver good performance such as storage cost per object, communication cost a high fault-tolerance of clients and servers, guaranteed liveness of operation, and a given number of communication rounds per operation, etc. We provide proofs for liveness and atomicity properties of FLECKS and derive worst-case latency bounds for the operations. We implemented and deployed FLECKS in cloud-based clusters and demonstrate that FLECKS has substantially lower storage and bandwidth costs, and significantly lower latency of operations than the replication-based mechanisms

    Rejuvenation and the Age of Information

    Get PDF
    International audienc

    Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell

    Get PDF
    As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results

    Cautiously Optimistic Program Analyses for Secure and Reliable Software

    Full text link
    Modern computer systems still have various security and reliability vulnerabilities. Well-known dynamic analyses solutions can mitigate them using runtime monitors that serve as lifeguards. But the additional work in enforcing these security and safety properties incurs exorbitant performance costs, and such tools are rarely used in practice. Our work addresses this problem by constructing a novel technique- Cautiously Optimistic Program Analysis (COPA). COPA is optimistic- it infers likely program invariants from dynamic observations, and assumes them in its static reasoning to precisely identify and elide wasteful runtime monitors. The resulting system is fast, but also ensures soundness by recovering to a conservatively optimized analysis when a likely invariant rarely fails at runtime. COPA is also cautious- by carefully restricting optimizations to only safe elisions, the recovery is greatly simplified. It avoids unbounded rollbacks upon recovery, thereby enabling analysis for live production software. We demonstrate the effectiveness of Cautiously Optimistic Program Analyses in three areas: Information-Flow Tracking (IFT) can help prevent security breaches and information leaks. But they are rarely used in practice due to their high performance overhead (>500% for web/email servers). COPA dramatically reduces this cost by eliding wasteful IFT monitors to make it practical (9% overhead, 4x speedup). Automatic Garbage Collection (GC) in managed languages (e.g. Java) simplifies programming tasks while ensuring memory safety. However, there is no correct GC for weakly-typed languages (e.g. C/C++), and manual memory management is prone to errors that have been exploited in high profile attacks. We develop the first sound GC for C/C++, and use COPA to optimize its performance (16% overhead). Sequential Consistency (SC) provides intuitive semantics to concurrent programs that simplifies reasoning for their correctness. However, ensuring SC behavior on commodity hardware remains expensive. We use COPA to ensure SC for Java at the language-level efficiently, and significantly reduce its cost (from 24% down to 5% on x86). COPA provides a way to realize strong software security, reliability and semantic guarantees at practical costs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/170027/1/subarno_1.pd

    Testing de performance en sistemas críticos: una nueva metodología y aplicaciones

    Get PDF
    El nuevo mundo es digital y crece a un ritmo sin precedentes. Se estima que hasta 2003 la humanidad había creado y almacenado digitalmente una cantidad de información equivalente a la que hoy se genera a diario. En la actualidad la mayoría de los procesos masivos, así como los datos y contenidos tanto públicos como personales, están informáticamente soportados. Por su creciente importancia y transversalidad a todos los sectores, los sistemas se han constituido en uno de los activos más críticos para las organizaciones. Buscando elevar la confiabilidad de esos sistemas, se recurre a diversas combinaciones de múltiples prácticas como ser: alta disponibilidad y performance de los componentes de los sistemas, procesos de desarrollo estandarizados y depurados en el tiempo, y el “testing de software”, entre otros. El testing en particular busca una validación independiente sobre los requerimientos que un componente o sistema debe cumplir, y tiene múltiples variantes. En lo que hace al tipo de requerimiento destacamos: funcionales (si el resultado de las acciones es el esperado), performance (si soporta el nivel de carga o el volumen de datos necesario) y seguridad (capacidad para resistir fallas, o ataques intencionales). El objeto de esta tesis es introducir una metodología que sirva como marco de trabajo para realizar “pruebas de performance”, y presentar además tres aplicaciones reales complementarias donde se constate su efectividad. El “testing de performance” es un área de vanguardia, de alta complejidad, que requiere entre otras cosas el costoso desarrollo de una plataforma para interactuar con el sistema a probar. Es usual entonces que a la hora de priorizar pruebas, las organizaciones se inclinen hacia los aspectos funcionales, o incluso los de seguridad, en muchos de los cuales se puede avanzar sin enfrentarse a grandes dificultades tecnológicas. Esto es razonable para una pequeña empresa o para una aplicación con pocos usuarios o datos a manejar, pero es inaceptable en las grandes organizaciones, que son precisamente las que más dependen de la informática. Durante este trabajo no sólo veremos cómo aplicar la metodología a aplicaciones de distintos contextos tecnológicos, veremos además cómo los resultados de esas pruebas ayudan a optimizar el desempeño de los sistemas con mínimos ajustes en los componentes. Los casos son entonces evidencia de que incluso los sistemas soportados sobre componentes de hardware y software de tipo world-class, pueden no cumplir las condiciones mínimas para entrar en producción aún cuando hayan pasado por un proceso de validación funcional, y muestran también que la solución no necesariamente viene acompañada de inversiones en infraestructura. La metodología aquí presentada fue co-desarrollada por el autor como miembro del Centro de Ensayos de Software (CES), a partir de las mejores prácticas existentes combinadas y ajustadas a la luz de la experiencia acumulada durante más diez de años en aplicaciones reales. Se elabora en actividades agrupadas en etapas, cuyo fin se resume en: identificar las transacciones representativas del uso esperado del sistema y los monitores para cuantificar su desempeño; la implementación de esas transacciones en un framework que permita automatizar la ejecución simultánea de combinaciones de múltiples instancias; la ejecución de varios ciclos de pruebas en los que se identifican los problemas a partir del análisis de los datos disponibles, se busca un diagnóstico y se repiten las pruebas explorando soluciones junto a los expertos del sistema. Durante los últimos diez años, distintas versiones de esta metodología han sido usadas en más de 20 organizaciones, algunas de las cuales atienden a más de 3000 usuarios, y ajustes de configuración mediante han permitido mejoras en los tiempos de respuesta del sistema de hasta 1000%. Entendemos que los resultados son alentadores y confiamos que se potenciarán por el creciente uso de sistemas distribuidos complejos, particularmente en la forma del denominado “cloud-computing”
    corecore