Failure Resilience and Traffic Engineering for Multi-Controller Software Defined Networking

Abstract

This thesis explores and proposes solutions to address the challenges faced by Multi-Controller SDN (MCSDN) systems when deploying TE optimisation on WANs. Despite the interest from the research community, existing MCSDN systems present limitations. For example, TE optimisation systems are computationally complex, have high consistency requirements, and need network-wide state to operate. Because of such requirements, MCSDN systems can encounter performance overheads and state consistency problems when implementing TE. Moreover, performance and consistency problems are more prominent when deploying the system on WANs as these network types have higher inter-device latency, delaying state propagation. Unlike existing literature, this thesis presents several design choices that address all four challenges affecting MCSDN systems (scalability, consistency, resilience, and coordination). We use the presented design choices to build Helix, a hierarchical MCSDN system. Helix provides better scalability, performance and failure resilience compared to existing MCSDN systems by sharing minimal state between controllers, offloading operations closer to the data plane and deploying lightweight tasks. A challenge that we faced when building Helix was that existing TE algorithms did not meet Helix's design choices. This thesis presents a new CSPF-based TE algorithm that needs minimal state to operate and supports offloading inter-area TE to local controllers, fulfilling Helix's requirements. Helix's TE algorithm provides better performance and forwarding stability, addressing 1.6x more congestion while performing up to 29x fewer path modifications than the other algorithms evaluated in our experiments. While MCSDN literature has explored evaluating different aspects of system performance, there is a lack of readily available tools and concrete testing methodologies. To this end, this thesis provides concrete testing methodologies and tools readily available to the MCSDN community to evaluate the data plane failure resilience, control plane failure resilience, and TE optimisation performance of MCSDN systems

    Similar works