73 research outputs found

    Algorithms in fault-tolerant CLOS networks

    Get PDF

    Parallel and Distributed Simulation from Many Cores to the Public Cloud (Extended Version)

    Full text link
    In this tutorial paper, we will firstly review some basic simulation concepts and then introduce the parallel and distributed simulation techniques in view of some new challenges of today and tomorrow. More in particular, in the last years there has been a wide diffusion of many cores architectures and we can expect this trend to continue. On the other hand, the success of cloud computing is strongly promoting the everything as a service paradigm. Is parallel and distributed simulation ready for these new challenges? The current approaches present many limitations in terms of usability and adaptivity: there is a strong need for new evaluation metrics and for revising the currently implemented mechanisms. In the last part of the paper, we propose a new approach based on multi-agent systems for the simulation of complex systems. It is possible to implement advanced techniques such as the migration of simulated entities in order to build mechanisms that are both adaptive and very easy to use. Adaptive mechanisms are able to significantly reduce the communication cost in the parallel/distributed architectures, to implement load-balance techniques and to cope with execution environments that are both variable and dynamic. Finally, such mechanisms will be used to build simulations on top of unreliable cloud services.Comment: Tutorial paper published in the Proceedings of the International Conference on High Performance Computing and Simulation (HPCS 2011). Istanbul (Turkey), IEEE, July 2011. ISBN 978-1-61284-382-

    Heterogeneity aware fault tolerance for extreme scale computing

    Get PDF
    Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10^18 floating point operations per second (FLOPS), primarily through significant expansion in scale. A major concern for such large scale systems, however, is how to deal with failures in the system. This is because the impact of failures on system efficiency, while utilizing existing fault tolerance techniques, generally also increases with scale. Hence, current research effort in this area has been directed at optimizing various aspects of fault tolerance techniques to reduce their overhead at scale. One characteristic that has been overlooked so far, however, is heterogeneity, specifically in the rate at which individual components of the underlying system fail, and in the execution profile of a parallel application running on such a system. In this thesis, we investigate the implications of such types of heterogeneity for fault tolerance in large scale high performance computing (HPC) systems. To that end, we 1) study how knowledge of heterogeneity in system failure likelihoods can be utilized to make current fault tolerance schemes more efficient, 2) assess the feasibility of utilizing application imbalance for improved fault tolerance at scale, and 3) propose and evaluate changes to system level resource managers in order to achieve reliable job placement over resources with unequal failure likelihoods. The results in this thesis, taken together, demonstrate that heterogeneity in failure likelihoods significantly changes the landscape of fault tolerance for large scale HPC systems

    Optical Wireless Data Center Networks

    Get PDF
    Bandwidth and computation-intensive Big Data applications in disciplines like social media, bio- and nano-informatics, Internet-of-Things (IoT), and real-time analytics, are pushing existing access and core (backbone) networks as well as Data Center Networks (DCNs) to their limits. Next generation DCNs must support continuously increasing network traffic while satisfying minimum performance requirements of latency, reliability, flexibility and scalability. Therefore, a larger number of cables (i.e., copper-cables and fiber optics) may be required in conventional wired DCNs. In addition to limiting the possible topologies, large number of cables may result into design and development problems related to wire ducting and maintenance, heat dissipation, and power consumption. To address the cabling complexity in wired DCNs, we propose OWCells, a class of optical wireless cellular data center network architectures in which fixed line of sight (LOS) optical wireless communication (OWC) links are used to connect the racks arranged in regular polygonal topologies. We present the OWCell DCN architecture, develop its theoretical underpinnings, and investigate routing protocols and OWC transceiver design. To realize a fully wireless DCN, servers in racks must also be connected using OWC links. There is, however, a difficulty of connecting multiple adjacent network components, such as servers in a rack, using point-to-point LOS links. To overcome this problem, we propose and validate the feasibility of an FSO-Bus to connect multiple adjacent network components using NLOS point-to-point OWC links. Finally, to complete the design of the OWC transceiver, we develop a new class of strictly and rearrangeably non-blocking multicast optical switches in which multicast is performed efficiently at the physical optical (lower) layer rather than upper layers (e.g., application layer). Advisors: Jitender S. Deogun and Dennis R. Alexande

    Software-implemented attack tolerance for critical information retrieval

    Get PDF
    The fast-growing reliance of our daily life upon online information services often demands an appropriate level of privacy protection as well as highly available service provision. However, most existing solutions have attempted to address these problems separately. This thesis investigates and presents a solution that provides both privacy protection and fault tolerance for online information retrieval. A new approach to Attack-Tolerant Information Retrieval (ATIR) is developed based on an extension of existing theoretical results for Private Information Retrieval (PIR). ATIR uses replicated services to protect a user's privacy and to ensure service availability. In particular, ATIR can tolerate any collusion of up to t servers for privacy violation and up to ƒ faulty (either crashed or malicious) servers in a system with k replicated servers, provided that k ≄ t + ƒ + 1 where t ≄ 1 and ƒ ≀ t. In contrast to other related approaches, ATIR relies on neither enforced trust assumptions, such as the use of tanker-resistant hardware and trusted third parties, nor an increased number of replicated servers. While the best solution known so far requires k (≄ 3t + 1) replicated servers to cope with t malicious servers and any collusion of up to t servers with an O(n^*^) communication complexity, ATIR uses fewer servers with a much improved communication cost, O(n1/2)(where n is the size of a database managed by a server).The majority of current PIR research resides on a theoretical level. This thesis provides both theoretical schemes and their practical implementations with good performance results. In a LAN environment, it takes well under half a second to use an ATIR service for calculations over data sets with a size of up to 1MB. The performance of the ATIR systems remains at the same level even in the presence of server crashes and malicious attacks. Both analytical results and experimental evaluation show that ATIR offers an attractive and practical solution for ever-increasing online information applications
    • 

    corecore