Software-defined Networking (SDN) enables flexible network management, but as networks
evolve to a large number of end-points with diverse network policies, higher
speed, and higher utilization, abstraction of networks by SDN makes monitoring and
debugging network problems increasingly harder and challenging. While some problems
impact packet processing in the data plane (e.g., congestion), some cause policy
deployment failures (e.g., hardware bugs); both create inconsistency between operator
intent and actual network behavior. Existing debugging tools are not sufficient to
accurately detect, localize, and understand the root cause of problems observed in a
large-scale networks; either they lack in-network resources (compute, memory, or/and
network bandwidth) or take long time for debugging network problems.
This thesis presents three debugging tools: PathDump, SwitchPointer, and Scout,
and a technique for tracing packet trajectories called CherryPick. We call for a different
approach to network monitoring and debugging: in contrast to implementing
debugging functionality entirely in-network, we should carefully partition the debugging
tasks between end-hosts and network elements. Towards this direction, we present
CherryPick, PathDump, and SwitchPointer. The core of CherryPick is to cherry-pick the
links that are key to representing an end-to-end path of a packet, and to embed picked
linkIDs into its header on its way to destination.
PathDump is an end-host based network debugger based on tracing packet trajectories,
and exploits resources at the end-hosts to implement various monitoring and
debugging functionalities. PathDump currently runs over a real network comprising
only of commodity hardware, and yet, can support surprisingly a large class of network
debugging problems with minimal in-network functionality.
The key contributions of SwitchPointer is to efficiently provide network visibility
to end-host based network debuggers like PathDump by using switch memory as a
"directory service" — each switch, rather than storing telemetry data necessary for
debugging functionalities, stores pointers to end hosts where relevant telemetry data is
stored. The key design choice of thinking about memory as a directory service allows
to solve performance problems that were hard or infeasible with existing designs.
Finally, we present and solve a network policy fault localization problem that arises
in operating policy management frameworks for a production network. We develop
Scout, a fully-automated system that localizes faults in a large scale policy deployment
and further pin-points the physical-level failures which are most likely cause for
observed faults