Abstract-Data center networks encode locality and topology information into their server and switch addresses for performance and routing purposes. For this reason, the traditional address configuration protocols such as DHCP require a huge amount of manual input, leaving them error-prone. In this paper, we present DAC, a generic and automatic Data center Address Configuration system. With an automatically generated blueprint that defines the connections of servers and switches labeled by logical IDs, e.g., IP addresses, DAC first learns the physical topology labeled by device IDs, e.g., MAC addresses. Then, at the core of DAC is its device-to-logical ID mapping and malfunction detection. DAC makes an innovation in abstracting the device-to-logical ID mapping to the graph isomorphism problem and solves it with low time complexity by leveraging the attributes of data center network topologies. Its malfunction detection scheme detects errors such as device and link failures and miswirings, including the most difficult case where miswirings do not cause any node degree change. We have evaluated DAC via simulation, implementation, and experiments. Our simulation results show that DAC can accurately find all the hardest-to-detect malfunctions and can autoconfigure a large data center with 3.8 million devices in 46 s. In our implementation, we successfully autoconfigure a small 64-server BCube network within 300 ms and show that DAC is a viable solution for data center autoconfiguration.
infrastructure outsourcing for both individual users and organizations. To take advantage of economies of scale, it is common for a data center to contain tens or even hundreds of thousands of servers. The current choice for building data centers is using commodity servers and Ethernet switches for hardware and the standard TCP/IP protocol suite for interserver communication. This choice provides the best performance to price tradeoff [2] . All the servers are connected via network switches to form a large distributed system.
Before the servers and switches can provide any useful services, however, they must be correctly configured. For existing data centers using the TCP/IP protocol, the configuration includes assigning an IP address to every server. For layer-2 Ethernet, we can use DHCP [3] for dynamic IP address configuration. However, servers in a data center need more than one IP address in certain address ranges. This is because, for performance and fault tolerance reasons, servers need to know the locality of other servers. For example, in a distributed file system [4] , a chunk of data is replicated several times, typically three, to increase reliability. It is better to put the second replica on a server in the same rack as the original, and the third replica on a server at another rack. The current practice is to embed locality information into IP addresses. The address locality can also be used to increase performance. For example, instead of fetching a piece of data from a distant server, we can retrieve the same piece of data from a closer one. This kind of locality-based optimization is widely used in data center applications [4] , [5] .
The newly proposed data center network (DCN) structures [6] [7] [8] [9] go one step further by encoding their topology information into their logical IDs. These logical IDs can take the form of IP address (e.g., in VL2 [9] ), MAC address (e.g., in Portland [8] ), or even newly invented IDs (e.g., in DCell [6] and BCube [7] ). These structures then leverage the topological information embedded in the logical IDs for scalable and efficient routing. For example, Portland switches choose a routing path by exploiting the location information of destination Pseudo-MAC (PMAC). BCube servers build a source routing path by modifying one digit at one step based on source and destination BCube IDs.
For all the cases above, we need to configure the logical IDs, which may be IP or MAC addresses or BCube or DCell IDs, for all the servers and switches. Meanwhile, in the physical topology, all the devices are identified by their unique device IDs, such as MAC addresses. A naïve way is to build a static device-to-logical ID mapping table at the DHCP server. Building such a table is mainly a manual effort that does not work for the following two reasons. First of all, the scale of data center is huge. It is not uncommon that a mega data center can have hundreds of thousands of servers [1] . Second, manual configuration is error-prone. A recent survey from 100 data center professionals [10] suggested that 57% of the data center outages are caused by human errors. Two more surveys [11] , [12] showed 50%-80% of network downtime is due to human configuration errors. In short, "the vast majority of failures in data centers are caused, triggered or exacerbated by human errors" [13] .
B. Challenges and Contributions
Automatic address configuration is therefore highly desirable for data center networks. We envision that a good autoconfiguration system will have the following features, which also pose challenges for building such a system.
• Generality: The system needs to be applicable to various network topologies and addressing schemes.
• Efficiency and scalability: The system should assign a logical ID to a device quickly and be scalable to a large number of devices.
• Malfunction and error handling: The system must be able to handle various malfunctions such as broken NICs and wires and human errors such as miswirings.
• Minimal human intervention: The system should require minimal manual effort to reduce human errors. To the best of our knowledge, there are very few existing solutions, and none of them can meet all the requirements above. In this paper, we address these problems by proposing DAC-a generic and automatic Data center Address Configuration system for the existing and future data center networks. To make our solution generic, we assume that we only have a blueprint of the to-be-configured data center network, which defines how the servers and switches are connected and labels each device with a logical ID. The blueprint can be automatically generated because all the existing data center network structures are quite regular and can be described either recursively or iteratively (see [6] [7] [8] [9] for examples).
Through a physical network topology learning procedure that we will describe in Section V, DAC first automatically learns and stores the physical topology of the data center network into an autoconfiguration manager. Then, we make the following two key contributions when designing DAC.
First of all, we solve the core problem of autoconfiguration: how to map the device IDs in the physical topology to the logical IDs in the blueprint while preserving the topological relationship of these devices. DAC makes an innovation in abstracting the device-to-logical ID mapping to the graph isomorphism (GI) problem [14] in graph theory. Existing GI solutions are too slow for some large-scale data center networks. Based on the attributes of data center network topologies, such as sparsity and symmetry (or asymmetry), we apply graph theory knowledge to design an improved algorithm that significantly speeds up the mapping. Specifically, we use three speedup techniques: candidate selection via SPLD, candidate pruning via orbit, and selective splitting. The first technique is our own. The last two we selected from previous works [15] and [16] , respectively, after finding that they are quite effective for data center graphs.
Second, despite that the malfunction detection problem is NP-complete and APX-hard, 1 we design a practical scheme that subtly exploits the degree regularity in all data center structures to detect the malfunctions causing device degree change. For the hardest one with no degree change, we propose a scheme to compare the blueprint graph and the physical topology graph from multiple anchor points and correlate malfunctions via majority voting. Evaluation shows that our solution is fast and is able to detect all the hardest-to-detect malfunctions.
We have studied our DAC design via extensive experiments and simulations. The experimental results show that the time of our device-to-logical ID mapping scales in proportion to the total number of devices in the networks. Furthermore, our simulation results show that DAC can autoconfigure a large data center with 3.8 million devices in 46 s. We have also developed and implemented DAC as an application on a 64-server test bed, where the 64 servers and 16 mini-switches form a two-level BCube [7] network. Our autoconfiguration protocols automatically and accurately assign BCube logical IDs to these 64 servers within 300 ms.
Roadmap: The rest of the paper is organized as follows. Section II presents the system overview. Section III introduces the device-to-logical ID mapping. Section IV discusses how DAC deals with malfunctions. Sections V and VI evaluate DAC via experiments, simulations, and implementations. Section VII discusses the related work. Section VIII concludes the paper.
II. SYSTEM OVERVIEW
One important characteristic shared by all data centers is that a given data center is owned and operated by a single organization. DAC takes advantage of this property to employ a centralized autoconfiguration manager, which we call DAC manager throughout this paper. DAC manager deals with all the address configuration intelligences such as physical topology collection, device-to-logical ID mapping, logical ID dissemination, and malfunction detection. In our design, DAC manager can simply be a server in the physical topology or can run on a separate control network.
Our centralized design is also inspired by the success of several recent large-scale infrastructure deployments. For instance, the data processing system MapReduce [5] and the modern storage GFS [4] employ a central master at the scale of tens of thousands of devices. More recently, Portland [8] leverages a fabric manager to realize a scalable and efficient layer-2 data center network fabric.
As stated in our first design goal, DAC should be a generic solution for various topologies and addressing schemes. To achieve this, DAC cannot assume any specific form of structure or addressing scheme in its design. Considering this, DAC only uses the following two graphs as its input.
1) Blueprint: Data centers have well-defined structures. Prior to deploying a real data center, a blueprint [ Fig. 1(a) ] should be designed to guide the construction of the data center. To make our solution generic, we only require the blueprint to provide the following minimal information.
• Interconnections between devices: It should define the interconnections between devices. Note that though it is possible for a blueprint to label port numbers and define how the ports of neighboring devices are connected, DAC does not depend on such information. DAC only requires the neighbor information of the devices, contained in any connected graph. • Logical ID for each device: It should specify a logical ID for each device. 2 The encoding of these logical IDs conveys the topological information of the network structure. These logical IDs are vital for server communication and routing protocols. Since data center networks are quite regular and can be described iteratively or recursively, we can automatically generate the blueprint using software.
2) Physical Network Topology: The physical topology [ Fig. 1(b) ] is constructed by following the interconnections defined in the blueprint. In this physical topology, we use the MAC address as a device ID to uniquely identify a device. For a device with multiple MAC addresses, we use the lowest one.
In the rest of the paper, we use to denote the blueprint graph and to denote the physical topology graph.
are the set of nodes (i.e., devices) with logical/device IDs, respectively, and are the set of edges (i.e., links). Note that while the blueprint graph is known for any data center, the physical topology graph is not known until the data center is built and information collected.
The whole DAC system structure is illustrated in Fig. 2 . The two core components of DAC are device-to-logical ID mapping and malfunction detection and handling. We also have a module to collect the physical topology and a module to disseminate the logical IDs to individual devices after DAC manager finishes the device-to-logical ID mapping. In what follows, we overview the design of these modules.
3) Physical Topology Collection: In order to perform logical ID resolution, we need to know both blueprint and physical topology . Since is not known readily, DAC requires a communication channel over the physical network to collect the physical topology information. To this end, we propose a 2 While most data center structures, like BCube [7] , DCell [6] , Ficonn [17] , and Portland [8] , use device-based logical ID, there also exist structures, like VL2 [9] , that use port-based logical ID. For brevity, in this paper, DAC is introduced and evaluated as the device based case. It can handle the port-based scenario by simply considering each port as a single device and treating a device with multiple ports as multiple logical devices.
Communication channel Building Protocol (CBP). The channel built from CBP is a layered spanning tree, and the root is DAC manager with level 0, its children are level 1, so on and so forth.
When the channel is built, the next step is to collect the physical topology . For this, we introduce a Physical topology Collection Protocol (PCP). In PCP, the physical topology information, i.e., the connection information between each node, is propagated bottom-up from the leaf devices to the root (i.e., DAC manager) layer by layer. After is collected by DAC manager, we go to the device-to-logical ID mapping module.
4) Device-to-Logical ID Mapping: After has been collected, we come to device-to-logical ID mapping, which is a key component of DAC. As introduced in Section I, the challenge is how to have the mapping reflect the topological relationship of these devices. To this end, we devise , a fast one-to-one mapping engine, to realize this functionality. We elaborate this fully in Section III.
5) Logical ID Dissemination:
When logical IDs for all the devices have been resolved, i.e., the device-to-logical ID mapping table is achieved, we need to disseminate this information to the whole network. To this end, we introduce a Logical ID Dissemination Protocol (LDP). In contrast to PCP, in LDP the mapping table is delivered top-down from DAC manager to the leaf devices, layer by layer. Upon receipt of such information, a device can easily index its logical ID according to its device ID. A more detailed explanation of LDP together with CBP and PCP is introduced in Section V.
6) Malfunction Detection and Handling: DAC needs to automatically detect malfunctions and pinpoint their locations. For this, we introduce a malfunction detection and handling module. In DAC, this module interacts tightly with the device-to-logical ID mapping module because the former one is only triggered by the latter. If there exist malfunctions in , our engine quickly perceives this by noticing that the physical topology graph mismatches with the blueprint graph . Then, the malfunction detection module is immediately invoked to detect those malfunctioning devices and report them to network administrators. We describe this module in Section IV.
III. DEVICE-TO-LOGICAL ID MAPPING
In this section, we formally introduce how DAC performs the device-to-logical ID mapping. We first formulate the mapping using graph theory. Then, we solve the problem via optimizations designed for data center structures. Lastly, we discuss how to do the mapping for data center expansion.
A. Problem Formulation and Solution Overview
As introduced, the challenge here is to do the device-to-logical mapping such that this mapping reflects the topological relationship of these devices. Considering we have the blueprint graph and the physical topology graph , to meet the above requirement, we formulate the mapping problem as finding a one-to-one mapping between nodes in and while preserving the adjacencies in and . Interestingly, this is actually a variant of the classical graph isomorphism (GI) problem [14] .
Definition 1: Two graphs and are isomorphic, denoted by , if there is a bijection Fig. 3 . mapping engine.
such that if and only if , for all , . Such a bijection is called a graph isomorphism between and . To the best of our knowledge, we are the first to introduce the GI model to data center networks, thus solving the address autoconfiguration problem. After the problem formulation, the next step is to solve the GI problem. In the past 20 years, many research efforts have been made to determine whether the general GI problem is in P or NP [14] . When the maximum node degree is bounded, polynomial algorithm with time complexity is known [18] , where is the number of nodes and is the maximum node degree.
However, is too slow for our problem since data centers can have millions of devices [6] and the maximal node degree can be more than 100 [9] . To this end, we devise , a fast one-to-one mapping engine. As shown in Fig. 3 , starts with a base algorithm (i.e., ) for general graphs, and upon that we propose an improved algorithm (i.e., ) using three speedup techniques: candidate selection via SPLD, candidate filtering via orbit, and selective splitting, which are specially tailored for the attributes of data center structures and our real address autoconfiguration application. In the following, we first introduce some preliminaries together with the base algorithm, and then introduce the improved algorithm.
B. Base Algorithm

1) Preliminaries: Given a graph
, a partition of a vertex set , e.g.,
, is a set of disjoint nonempty subsets of whose union is . We call each subset a cell. In , the basic operations on partitions or cells are "decompose" and "split."
• Decompose: Given a node , a cell , and a partition where and , using to decompose means to replace with and in partition , where is set minus meaning to remove node from .
• Split: Given two cells , , using to split means doing the following. First, for each node , we calculate a value as the number of connections between node and nodes in where is called connection function. Then, we divide into smaller cells by grouping the nodes with the same value together to be a new cell. Moreover, we call the inducing cell and the target cell. The target cell should be a non-singleton. A partition is equitable if no cell can be split by any other cell in the partition. A partition is discrete if each cell of this partition is a singleton (i.e., single element). Suppose we use an are divided isomorphically by if for each value , has the same number of nodes with -connection to as has to . Note that the cells in a partition have their orders. We use parenthesis to represent a partition, and each cell is indexed by its order. For example, means a partition with cells and the th cell is . In our mapping algorithm, decomposition/split operation always works on the corresponding pair of cells (i.e., two cells with the same order) in two partitions. Furthermore, during these operations, we place the split cells back to the partitions in corresponding orders. For example, decomposing with , we replace with , and with , , and then place the split cells back to the partitions such that and are in the same order and and are in the same order. In addition to the above terms, we further have two important terms used in the improved algorithm, which are SPLD and orbit.
• SPLD: SPLD is short for shortest path length distribution. The SPLD of a node is the distribution of distances between this node and all other nodes in the graph.
• Orbit: An orbit is a subset of nodes in graph such that two nodes and are in the same orbit if there exists an automorphism 3 of that maps to [19] . For example, in of Fig. 6 , to are in the same orbit since there is an automorphism permutation of , which is , that maps to . 2) Base Algorithm: Fig. 4 is a base mapping algorithm for general graphs we summarize from previous literature. It contains and , and it repeatedly decomposes and refines (or splits) and until either they both are discrete, or it terminates in the middle finding that and are not isomorphic. In each level of recursion, we first check if the current partitions and are discrete. If so, we return (line 2) and get a one-to-one mapping by mapping each singleton cell of to the corresponding singleton cell of . Otherwise, we do . . If all the candidates in fail to be mapped to , we must backtrack (line 10). Such recursion continues until either both partitions become discrete, i.e., a one-to-one mapping is found (line 2), or we backtrack to root of the search tree, thus concluding that no one-to-one mapping exists (line 12).
C. Improved Algorithm
Compared to general graphs, network topologies of data centers have the following attributes: 1) they are sparse; 2) they are typically either highly symmetric like BCube [7] or highly asymmetric like DCell [6] . In any case, for our address autoconfiguration problem, the blueprint graph is available in advance, which means we can do some precomputation.
Based on these features, we apply graph theory to design an improved algorithm with three speedup techniques: candidate selection via SPLD, candidate filtering via orbit, and selective splitting to speed up the device-to-logical ID mapping. Specifically, we introduce the first technique and borrow the last two from [15] and [16] , respectively, based on their effectiveness for graphs derived for data centers. We prove that adding these speedup techniques to the base algorithm maintains its correctness [20] . Our experiments in Section VI-B indicate that we need all these three speedup techniques to solve our problem, and any partial combination of them is slow for some structures. Fig. 5 is the improved algorithm built on the base algorithm. In the following, we explain the three speedup techniques emphasizing the reasons why they are suitable for data center graphs.
1) Candidate Selection via SPLD:
We observe that nodes in data centers have different roles such as switches and servers, and switches in some data centers like FatTree can be further divided into ToR, aggregation, and core. Hence, from this point of view, SPLD can be helpful by itself to distinguish nodes of different roles. Furthermore, SPLD can provide even significant improvement for structures like DCell, which are very asymmetric. This is because the SPLDs of different nodes in DCell are very different. To take advantage of this property, we propose using SPLD as a more sophisticated signature to select mapping candidates. That is, when we try to select a node in as a candidate to be mapped to a node in , we only select the from these nodes that have the same SPLD as . This is effective because two nodes with different SPLDs cannot be mapped to each other. However, computing SPLDs for all nodes in a large graph requires time. Fortunately, this can be computed earlier on the blueprint.
In our improved algorithm, we precompute the SPLDs for all nodes of beforehand. In lines 6 and 7, we improve the base algorithm in this way: If we find the number of candidates (i.e., nodes in ) for a node, say in , to be mapped to is larger than a threshold (i.e., ) and the number of different SPLDs of them is larger than a threshold (i.e., ), we compute the SPLD for and only select candidates in having the same SPLD. Thresholds and are tunable. Note that using this technique is a tradeoff: Although we can do precomputation on offline, applying this optimization means that we should compute online, which also consumes time. In all our experiments later, we apply this technique on all the structures only once at the first round of mapping.
2) Candidate Filtering via Orbit: It is indicated in [15] that for and , if cannot be mapped to , all nodes in the same orbit as cannot be mapped to either. We find this theory is naturally suited for solving the GI problem on data centers. First, some structures such as BCube are highly symmetric, and there should be many symmetric nodes within these structures that are in the same orbit. Second, the blueprint graph is available much earlier than the real address autoconfiguration stage, and we can easily precompute the orbits in the blueprint beforehand using preexisting tools such as [16] , [21] .
In Fig. 4 , the base algorithm tries to map to every node in iteratively if the current mapping fails, which is not effective especially for highly symmetric data center structures. Observing this, in the improved algorithm, we precompute all the orbits of beforehand. Then, as shown in lines 16-18, we improve the base algorithm: If we find a certain node cannot be mapped to , we skip all the attempts that try to map to any other node in the same orbit as because, according to the above theory, these nodes cannot be mapped to either.
3) Selective Splitting: In the base algorithm, tries to use the inducing cell to split all the other cells. As data center structures are sparse, it is likely that while there are many cells in the partition, the majority of them are disjoint with the inducing cell. Observing this, in line 11, we use , in which we only try to split the cells that really connect to the inducing cell other than all. 4 Furthermore, when splitting a connected cell , the base algorithm tries to calculate the number of connections between each node in and the inducing cell, and then divide based on these values. Again, due to sparsity, it is likely that the number of nodes in that really connect to the inducing cell is very small. Observing this, in a similar way, we speed up by only calculating the number of connections for the nodes actually connected. The unconnected nodes can be grouped together directly. Specifically, when splitting using inducing cell , we first move the elements in with connections to to the left-end of and leave all unconnected elements on the right. Then, we only calculate the values for the elements on the left, and group them according to the values.
4) Walkthrough Example for :
We provide a step-by-step example of our algorithm in Fig. 6 . is labeled by its logical IDs, and is labeled by its device IDs. White arrows mean decomposition, and dark arrows mean refinement. Suppose all orbits in have been calculated beforehand. In this case, they are . Initially, all nodes in are in one cell in partitions .
Step (1) decomposes original using .
Step (2) refines the current using inducing cells , but fails due to a nonisomorphic division. This is because during splitting, has four elements with 1-connection to and three elements with 0-connection, while has one element with 1-connection to and seven elements with 0-connection. Therefore, they are not divided isomorphically.
From step (2), we know cannot be mapped to . By speedup technique 2, we skip the candidates , , and , which are in the same orbit as . Thus, in Step (3), we decompose the original using .
Step (4) refines the current using . Specifically, in , we find 4 We achieve this by maintaining an adjacency list that is built once when the graph is read. In the adjacency list, for each vertex, we keep the neighboring vertices, so at any point we know the vertices each vertex is connected to. We also have another data structure that keeps track of the place where each vertex is located at within the partition. In this way, we know which cell is connected to the inducing cell.
, , , and have 1-connection to while the rest do not. In , we find , , , and have 1-connection to while the rest do not. Therefore, are isomorphically divided by . After step (4), since the current partitions
are not yet equitable, in steps (5) and (6), we continuously use newly born cells and to further split other cells until are equitable.
Steps (7)- (9) decompose the current partitions using , , and , respectively. Since in each of these three steps, there is no cell that can be split by other cells, no division is performed. After step (9), the two partitions are discrete, and we find a one-to-one mapping between and by mapping each node in to its corresponding node in . Two things should be noted in the above example: First and most importantly, we do not use speedup technique 1 since we want to show the case of nonisomorphic division in steps (1) and (2) . In the real mapping, after applying speedup technique 1, we will directly go from step (3) instead of trying to map to because they have different SPLDs. This shows that SPLD is effective in selecting mapping candidates. Second, although we have not explicitly mentioned speedup technique 3, in each refinement we only try to split the connected cells rather than all cells. For example, after step (7), are newly born, but when it comes to refinement, we do not try to split or using because they are disjoint.
D. Using for Data Center Expansion
To meet the growth of applications and storage, the scale of a data center does not remain the same for long [22] . Therefore, address autoconfiguration for data center expansion is required. Two direct approaches are either to configure the new part directly or to configure the entire data center as a whole. However, both approaches have problems. The first one fails to take into account the connections between the new part and the old part of the expanded data center. The second one considers the connections between the new part and the old part, but it may cause another lethal problem, i.e., the newly allocated logical IDs are different from the original ones for the same devices of the old part, messing up existing communications.
To avoid these problems, DAC configures the entire data center while keeping the logical IDs for the old part unmodified. To achieve this goal, we still use , but need to modify the input. Instead of putting all the nodes from a graph in one cell as before, we first differentiate nodes between the new part and the old part in and . Since we already have the device-to-logical ID mapping for the old part, say for , we explicitly express such one-to-one mapping in the partitions. In other words, we have and , and all the nodes for the new part of are in , respectively. Then, we refine until they both are equitable. At last, we enter mapping with the equitable partitions. In this way, we can produce a device-to-logical ID mapping table for the new part of data center while keeping the logical IDs for devices of the old part unmodified.
IV. MALFUNCTION DETECTION AND HANDLING
As introduced before, the malfunction detection module is triggered when returns . This "false" indicates the physical topology is not the same as the blueprint. In this section, we describe how DAC handles malfunctions.
A. Malfunction Overview
Malfunctions can be caused by hardware and software failures or simply human configuration errors. For example, bad or mismatched network cards and cables are common, and miswired or improperly connected cables are nearly inevitable.
We consider and categorize three malfunction types in data centers: node, link, and miswiring. The first type occurs when a given server or switch breaks down from hardware or software reasons, causing it to be completely unreachable and disconnected from the network. The second one occurs when the cable or network card is broken or not properly plugged in so that the connectivity between devices on that link is lost. The third one occurs when wired cables are different from those in the blueprint. These malfunctions may introduce severe problems and downgrade the performance.
Note that from the physical topology, it is unlikely to clearly distinguish some failure types, e.g., a crashed server versus completely malfunctioning interface cards on that server. Our goal is to detect and further locate all malfunction-related devices and report the device information to network administrators, rather than identifying the malfunction type. We believe our malfunction handling not only solves this issue for autoconfiguration, but also reduces the deployment/maintenance costs for real-world large data center deployment.
B. Problem Complexity and Challenge
The problem of malfunction detection can be formally described as follows. Given and , the problem to locate all the malfunctioning parts in the graph is equivalent to obtaining the maximum common subgraph (MCS) of and . Thus, we compare to to find the differences, which are the malfunctioning parts. All the devices (i.e., servers or switches) related to these parts, which we call malfunctioning devices, can be detected. However, it is proven that the MCS problem is NP-complete [23] and APX-hard [24] . That is, there is no efficient algorithm, especially for large graphs such as those of data center network topologies. Therefore, we resort to designing our own algorithms based on the particular properties of data center structures and our real-world There are two problems we need to address in Sections IV-C-IV-E: 1) detecting the malfunctioning devices by identifying their device IDs; and 2) locating the physical position of a malfunctioning device with its device ID automatically.
C. Practical Malfunction Detection Methods
To achieve better performance and easier management, large-scale data centers are usually designed and constructed according to some patterns or rules. Such patterns or rules imply two properties of the data center structures. 1) The nodes in the topologies typically have regular degrees. For example, we show the degree patterns for several well-known data center networks in Table I. 2) The graphs are sparse, so that our can quickly determine if two graphs are isomorphic. These properties are important for us to detect malfunctions in data centers. In DAC, the first property is used to detect malfunctioning devices where there are node degree changes, and the second one serves as a tool in our malfunction detection scheme for the case where no degree change occurs.
1) Malfunction With Node Degree Change:
For the aforementioned three types of malfunctions, we discuss them one by one as follows. Our observation is that most of the cases may cause the change of degree on devices.
• Node: If there is a malfunctioning node, the degrees of its neighboring nodes are decreased by one, and thus it is possible to identify the malfunction by checking its neighbor nodes.
• Link: If there is a malfunctioning link, the degrees of associated nodes are decreased by one, making it possible to detect.
• Miswiring: Miswirings are somewhat more complex than the other two errors. As shown in the left of Fig. 7 , the miswiring causes its related nodes to increase or decrease their degrees and can be detected readily. On the contrary, in the right of Fig. 7 , the miswirings of a pair of cables occur coincidentally so that the degree change caused by one miswired cable is glossed over by another, and thus no node degree change happens. We discuss this hardest case separately in the following. Note that for any malfunction caused by the links, i.e., link failure or miswirings, we report the associated nodes (i.e., malfunctioning devices) in our malfunction detection.
2) Malfunction Without Node Degree Change: Though in most cases the malfunctions cause detectable node degree change [25] , it is still possible to have miswirings with no node degree change. This case occurs after an administrator has checked the network and the degree-changing malfunctions have been fixed. The practical assumptions here are: 1) the number of nodes involved in such malfunctions is a considerably small amount over all the nodes; 2) and have the same number of nodes and node degree patterns.
Despite the miswirings, the vast majority part of and are still the same. We leverage this fact to detect such miswirings. Our basic idea is that we first find some nodes that are supposed to be symmetric between and , then use those nodes as anchor points to check if the subgraphs deduced from them are isomorphic. Through this we derive the difference between the two graphs and correlate the malfunctioning candidates derived from different anchor points to make a decision. Basically, our scheme has two parts: anchor point selection and malfunction detection.
To minimize the human intervention, the first challenge is selecting anchor pairs between the blueprint graph and the physical topology graph without human input. Our idea is again to leverage the SPLD. Considering that the number of nodes involved in miswirings is small, it is likely that two "symmetric" nodes in two graphs will still have similar SPLDs. Based on this, we design our heuristics to select anchor pair points, which is in Fig. 8 . In the algorithm, is simply the Euclidean distance. Given that two node with similar SPLDs are not necessarily a truly symmetric pair, our malfunction detection scheme will take the potential false positives into account and handle this issue via majority voting.
Once the anchor node pairs have been selected, we compare and from these anchor node pairs and correlate malfunctions via majority voting. The algorithm for this is in Fig. 8 . Specifically, given , and definition of maximal subgraph in line 5, for each anchor pair , we search the maximal isomorphic subgraph of graphs with hop length from nodes respectively. The process to obtain such a subgraph is in line 7. We can use a binary search to accelerate the searching procedure. If we find that and are isomorphic while and are not, we assume some miswirings happened between -hop and -hop away from , and the nodes in these two hops are suspicious. In line 9, we increase a counter for each of these nodes to represent this conclusion.
After finishing the detection from all the anchor points, we report a list to the administrator. The list contains node device IDs and counter values of each node, ranked in the descending order of the counter values. Essentially, the larger its counter value, the more likely the device is miswired. Then, the administrator will go through the list and rectify the miswirings. This process stops when he finds a node is not really miswired and ignores the rest of nodes on the list.
The accuracy of our scheme depends on the number of anchor points we selected for detection versus the number of miswirings in the network. Our experiments suggest that, with a sufficient number of anchor points, our algorithm can always find all the malfunctions (i.e., put the miswired devices on top of the output list). According to the experimental results in Section VI-D, with at most 1.5% of nodes selected as anchor points, we can detect all miswirings on the evaluated structures. To be more reliable, we can always conservatively select a larger percentage of anchor points to start our detection, and most likely we will detect all miswirings (i.e., have all of them on top of the list). Actually, this can be facilitated by the parallel computing because in our malfunction detection, the calculations from different anchor points are independent of each other and thus can be performed in parallel.
After fixing the miswirings, we will run to get the device-to-logical ID mapping again. Even in the case that not all the miswirings are on the top of the list and we miss some, will perceive that quickly. Then, we will rerun our detection algorithm until all miswirings are detected and rectified, and can get the correct device-to-logical ID mapping finally.
D. Device Locating
Given a detected malfunctioning device, the next practical question is how to identify the location of the device given only its device ID (i.e., MAC). In fact, the device locating procedure is not necessarily achieved by an autoconfiguration algorithm, but also possibly by some human efforts. In this paper, we argue that it is a practical deployment and maintenance problem in data centers, and thus we seek a scheme to collect such location information automatically.
Our idea is to sequentially turn on the power of each rack in order to generate a record for the location information. This procedure is performed only once, and the generated record is used by the administrator to find a mapping between MAC and rack. It works as follows. 1) To power on the data center for the first time, the administrator turns on the power of server racks one by one sequentially. We require a time interval between powering each rack so we can differentiate devices in different racks. The time interval is a tradeoff: Larger values allow easier rack differentiation, while smaller values reduce boot time cost on all racks. We think by default it should be 10 s. 2) In the physical topology collection stage, when reporting the topology information to DAC manager, each device also piggybacks the boot-up time, from when it had been powered on to its first reporting. 3) When receiving such boot-up time information, DAC manager groups the devices with similar boot-up times (compared to the power on time interval between racks). 4) When DAC manager outputs a malfunctioning device, it also outputs the boot-up time for that group. Therefore, the administrator can check the rack physical position accordingly.
E. Run-Time Malfunction Handling
We have discussed the malfunction detection and handling, focusing on the bootstrap stage. After that, a node should cache its logical ID and neighbor information in case run-time malfunctions occur. During the run-time stage, a rebooted device may use its cached logical ID only if the ID has not timed out, and its newly collected neighbor information is consistent with its cached neighbor information. However, it is possible that the device may crash and require replacement. In this case, there is no cached logical ID on the device, and it must obtain a logical ID at run-time. For a newly replaced device with no cache, or a rebooted device with a timed-out logical ID cache and inconsistent cached neighbor information, the device will collect its neighbor information and propagate that information to DAC manager and request a logical ID. Knowing the neighbor nodes, DAC manager can easily figure out the requested logical ID.
To summarize, our malfunction detection and locating designs focus on how to quickly detect and locate various malfunctions including the most difficult miswiring cases. We note that our schemes help to identify malfunctions, but not repair them. It is our hope that the detection procedure can help administrators to fix any malfunction more rapidly during the autoconfiguration stage.
V. IMPLEMENTATION AND EXPERIMENT
In this section, we first introduce the protocols that are used to do physical topology collection and logical ID dissemination. Then, we describe our implementation of DAC.
A. Communication Protocols
To achieve reliable physical topology collection and logical ID dissemination between all devices and DAC manager, we need a communication channel over the network. We note that the classical spanning tree protocol (STP) does not fit our scenario: 1) we have a fixed root-DAC manager-so networkwide broadcast for root selection is not necessary; 2) the scale of data center networks can be hundreds of thousands, making it difficult to guarantee reliability and information correctness in the network-wide broadcast. Therefore, we provide a CBP to set up a communication channel over a mega data center network. Moreover, we introduce two protocols, namely the PCP and the LDP, to perform the topology information collection and ID dissemination over that spanning tree built by CBP.
Building Communication Channel:
In CBP, each network device sends Channel Building Messages (CBMs) periodically (with a timeout interval ) to all of its interfaces. Neighbor nodes are discovered by receiving CBMs. Each node sends its own CBMs and does not relay CBMs received from other nodes. To speed up the information propagation procedure, a node also sends out a CBM if it observes changes in neighbor information. A checking intervalis introduced to reduce the number of CBM messages by limiting the minimal interval between two successive CBMs.
DAC manager sends out its CBM with its level marked as 0, and its neighbor nodes correspondingly set their levels to 1. This procedure continues until all nodes get their respective levels, representing the number of hops from that node to DAC manager. A node randomly selects a neighbor node as its parent if that node has the lowest level among its neighbors and claims itself as that node's child by its next CBM. The communication channel building procedure is finished once every node has its level and has selected its parent node. Therefore, the built communication channel is essentially a layered spanning tree, rooted at DAC manager. We define a leaf node as one that has the largest level among its neighbors and no children node. If a leaf node observes no neighbor updates for a timeout value , it enters the next stage, physical topology information collection.
Physical Topology Collection and Logical ID Dissemination: Once the communication channel has been built by CBP, the physical topology collection and logical ID dissemination over the communication channel can be performed by using PCP and LDP. Essentially, the topology collection is a bottom-up process that starts from leaf devices and blooms up to DAC manager, while the logical ID dissemination is a top-down style that initiates from DAC manager and flows down to the leaf devices.
In PCP, each node reports its node device ID and all its neighbors to its parent node. After receiving all information from its children, an intermediate node merges them (including its own neighbor information) and sends them to its parent node. This procedure continues until DAC manager receives the node and link information of the whole network, and then it constructs the physical network topology. In LDP, the procedure is reverse to PCP. DAC manager sends the achieved device-to-logical ID mapping information to all its neighbor nodes, and each intermediate node delivers the information to its children. Since a node knows the descendants from each child via PCP, it can divide the mapping information on a per-child base and deliver the more specific mapping information to each child. Note that the messages exchanged in both PCP and LDP are unicast messages that require acknowledgements for reliability.
B. BCube Test Bed and Experiment
We designed and implemented DAC as an application over the Windows network stack. This application implements the modules described in Section II, i.e., device-to-logical ID mapping, communication channel building, physical topology collection, and logical ID dissemination. We built a test bed using 64 Dell servers and 16 8-port DLink DGS-1008D Gigabit Ethernet switches. Each server has an Intel 2-GHz dual-core CPU, 2-GB DRAM, 160-GB disk, and an Intel Pro/1000PT dual-port Ethernet NIC. Each link works at Gigabit.
The topology of our test bed is a BCube (8, 1) . It has two dimensions, and eight servers on each dimension connected by an 8-port Ethernet switch. Each server uses two ports of its dualport NIC to form a BCube network. Fig. 9 illustrates the physical test-bed topology and its corresponding blueprint graph. Note that we only programmed our DAC design on servers, and we did not touch switches in this setup because these switches cannot be programmed. Thus, the blueprint graph of our test bed observed at any server should have a degree of 14 instead of 2 as there are seven neighbors for each dimension. This server-only setup is designed to demonstrate that DAC works in real-world systems, not its scalability.
In this setup, our DAC application is developed to automatically assign the BCube ID for all the 64 servers in the test bed. A server is selected as DAC manager by setting its level to 0. To inspect the working process of DAC, we divide DAC into five steps and check each of them: 1) CCB (communication channel building): from DAC manager broadcasts the message with level 0 to the last node in the network gets its level; 2) timeout: there is no change in neighboring nodes for at leaf nodes; 3) TC (physical topology collection): from the first leaf node sends out its TCM to DAC manager receives the entire network topology; 4) mapping: device-to-logical ID mapping time including the I/O time; 5) LD (logical IDs dissemination): from DAC manager, sends out the mapping information to all the devices to get their logical IDs. Table II shows the result with differentand parameters. Note thatis to control the number of CBM messages, is the timeout value for CBP broadcast, and is for TCM triggering. The experiments show that the total configuration time is mainly dominated by the mapping time and , and -can control and reduce the bustiness of CBM messages. In all the cases, our autoconfiguration process can be done within 300 ms. 
C. Implementation Experience on Click
We have implemented our DAC protocols (CBP, PCP, and LDP) using Click software routers [26] . A Click router is a directed graph of packet processing modules called elements that implement tasks such as building a spanning tree among switches or interacting with network devices.
We extended three Click standard Ethernet elements Ether-SpanTree, Bridgemessage, and EtherSwitch for our purpose and obtained three new elements ExtenEtherSpanTree, ExtenBridgemessage, and ExtenEtherSwitch. The ExtenBridgmessage element defines the format of our CBM based on the Bridge Protocol Data Unit (BPDU) packet format that is already defined in the standard element Bridgemessage. For every CBM packet, the node uses the ExtenEtherSpanTree element to implement CBP. To implement PCP and LDP functionalities, the node uses the ExtenEtherSwitch element to maintain its parent and children information and to perform topology collection and logical ID dissemination as described in Section V-A. We omit further details due to space limitation. We have made our implementation code publicly available at [27] . Our experience with Click shows that DAC protocols are easy to implement based on existing Ethernet protocols and packet formats.
VI. PERFORMANCE EVALUATION
In this section, we evaluate DAC via extensive simulations. We first introduce the evaluation methodology, and then present the results.
A. Evaluation Methodology
Structures for Evaluation: We evaluate DAC via experiments on four well-known data center structures: BCube [7] , FatTree [8] , VL2 [9] , and DCell [6] . Among these structures, BCube is the most symmetric, followed by FatTree, VL2, and DCell. DCell is the most asymmetric. All the structures can be considered as sparse graphs with different sparsity. VL2 is the sparsest, followed by FatTree, DCell, and BCube. For each of them, we vary the size as shown in Table III . Please refer to these papers for details. Since BCube is specifically designed for a modular data center (MDC) sealed in shipping Fig. 10 . Speed of mapping on BCube, FatTree, VL2, and DCell structures, and its comparison to and . Note that we do not include the performance curves of on DCell, FatTree, and VL2 structures because the run-time of on all the graphs bigger than DCell(3,3), FatTree(40) and VL2(20,100), respectively, is more than one day. Furthermore, we use log-log scale to clearly show the performance of both and on DCell.
containers, the number of devices in BCube should not be very large. We expect them to be in the thousands, or at most tens of thousands. For FatTree and VL2, we intentionally make their sizes to be as large as hundreds of thousands of nodes. DCell is designed for large data centers. One merit of DCell is that the number of servers in a DCell scales doubly exponentially as the level increases. For this reason, we check the performance of DAC on very large DCell graphs. For example, DCell(6,3) has more than 3.8 million nodes.
Metrics: There are three metrics in our evaluation. First, we measure the speed of on the aforementioned structures, which includes both mapping from scratch (i.e., for brand-new data centers) and mapping for incremental expansion (i.e., for data center expansion), as well as the memory overhead in the mappings. This metric is used to show how efficient is as a device-to-logical ID mapping engine. Then, we estimate the total time DAC takes for a complete autoconfiguration process. Lacking a large test bed, we employ simulations. Lastly, we evaluate the accuracy of DAC in detecting malfunctions via simulations. All the experiments and simulations are performed on a Linux server with an Intel 2.5-GHz dual-core CPU with 8 GB DRAM. The server runs Red-Hat 4.1.2 with Linux kernel 2.6.18.
B. Efficiency of Mapping Engine
Mapping From Scratch: We study the performance of together with the seminal GI tool proposed in [15] called and another algorithm proposed in digital design automation field called [16] . For , we use the latest version, v2.4. For , it does not calculate the one-to-one mapping nor does the isomorphism check between two graphs by default. Instead, it is a tool to calculate the automorphisms in a graph. We observe that when inputting two graphs as one bigger graph into , among all the output automorphisms there exists at least one that maps each node in one graph to a node in another given that the two graphs are isomorphic to each other. To compare to , we improve its algorithm to check and calculate a one-to-one mapping between two graphs and call it . Essentially, includes candidate pruning via orbit, is built on top of and introduces selective splitting, and is further built on top of and includes candidate selection via SPLD, shown in Table IV . Fig. 10 plots the results for device-to-logical ID mapping. Note that we do not include the I/O time for reading graphs into memory. From the figure, we can see that the mapping time of scales in proportion to the total number of devices in the network.
The results in Fig. 10 (3,3) , FatTree(40), and VL2(20,100) is too long (i.e., days) to fit into the figures nicely.
To better understand why performs best, we assess the relative effectiveness of the three speedup techniques used in the algorithms on popular data center structures. We make the following three observations. First, we find that candidate pruning via orbit is very efficient for symmetric structures. For example, needs only 0.07 s for BCube(4,4) with 2034 devices, whereas it requires 312 s for FatTree (20) with 2500 devices. Another example is that while it only takes less than 8 s to perform the mapping for BCube (8, 4) with 53 248 devices, it fails to obtain the result for either FatTree(40) with 58 500 devices or VL2(20,100) with 52 650 devices within 24 h. One factor contributing to this effect is that BCube is more symmetric than either FatTree or VL2 structure.
Second, our experiments suggest that selective splitting introduced in should be more efficient for sparse graphs. For example, VL2(100,100) and FatTree(100) have similar numbers of devices (250 000 ), but VL2 needs only 6.33 s, whereas FatTree needs 18.50 s. This is because VL2(100,100) is sparser than FatTree(100). We have checked the average node degree of these two structures. The average degree for VL2(100,100) is approximately 1.03. Compared to VL2(100,100), FatTree(100) has an average node degree of 2.86, more than two times denser.
Finally, when candidate selection via SPLD is further introduced in to work together with the above two techniques, it exhibits different performance gains on different structures. SPLD works best for asymmetric graphs. For example, compared to , , which has the SPLD technique, improves the time from 2.97 to 1.31 s (2.27 times) for BCube (8, 4) , from 18.5 to 4.16 s (4.34 times) for FatTree(100), and from 6.33 to 1.07 s (5.92 times) for VL2(100,100), whereas it reduces the time from 44603 to 8.88 s (5011 times) for DCell (6, 3) . This is because the more asymmetric a graph is, the more likely that the SPLDs of two nodes will be different. In our case, BCube is the most symmetric structure since all the switches are interchangeable, whereas DCell is the most asymmetric one since there are only two automorphisms for a DCell.
We have also checked other combinations of the techniques, such as selective splitting, candidate pruning via orbit plus candidate selection via SPLD, and selective splitting plus candidate selection via SPLD, etc. We leave the numerical results and analysis in the Appendix. The results of all these combinations confirm the above observations: Candidate pruning via orbit is efficient for symmetric graphs, selective splitting works well for sparse graphs, and candidate selection via SPLD improves both techniques and has remarkable performance gain for asymmetric graphs such as DCell.
Mapping for Incremental Expansion: For the evaluation of on incremental expansion, we choose one expansion scenario for each structure. Since BCube and DCell are recursively defined, we expand them by increasing the level. For FatTree and VL2, we expand them by increasing the number of servers in each rack. The results are listed in Table V . We find that all the mappings can be done efficiently. For BCube, we extend BCube (8, 3) to BCube (8, 4) and finish the mapping in 0.19 s. For FatTree, we expand partial FatTree(100), where each edge switch connects to 25 servers, to complete FatTree(100), where each edge switch connects to 50 servers, and take 0.47 s for mapping. For VL2, we expand VL2(50,100) to VL2(100,100) and spend 0.24 s. For DCell, we extend DCell(6,2) to DCell(6,3) and use 7.514 s. Finally, we check and verify that keeps logical IDs for old devices unmodified.
Memory Overhead of
Mapping: We observe the peak memory usage during each mapping process. Fig. 11 shows the results. It contains mapping from scratch on the biggest graph of each structure in Table III and mapping for incremental expansion in Table V . Except for DCell (6, 3) , the mapping processes (both mapping from scratch and mapping for incremental expansion) for all other structures have very low memory usage ( 0.1 Gb). DCell(6,3) requires more memory than others because of its size, with 3.8 million vertices. However, a 1-Gb peak memory usage is still a decent outcome. Overall, the results show that the mapping process is memory-efficient.
C. Estimated Time Cost on Autoconfiguration
Recall that in Section V, we have evaluated the time cost of DAC on our BCube(8,1) test bed. In this section, we estimate this time on large data centers via simulations. We use the same parameters -(checking interval) and (timeout for CBP broadcast) as in the implementation, and setas 10 ms and as 50 ms. We estimate the time cost for each of the five phases, i.e., CCB, timeout, TC, mapping, and LD, as described in Section V. In the simulations, device ID is a 48-b MAC address and logical ID is set to 32 b, like an IP address. We assume all the links are 1 Gb/s and all communications use the full link speed. For each structure, we choose the smallest and largest graphs in Table III for evaluation. The results are shown in Table VI . From the table, we find that, except for DCell (6, 3) , the autoconfiguration can be finished in less than 10 s. We also find that for big topologies like BCube (8, 4) , DCell(6,3), FatTree(100), and VL2(100,100), the mapping time dominates the entire autoconfiguration time. DCell(6,3) takes the longest time, nearly 45 s, to do the mapping. While the CPU time for the mapping is only 8.88 s, the memory I/O time is 36.09 s. Here, we use more powerful Linux servers than what we used in the implementation, so the mapping here is relatively faster than that in Section V.
D. Results for Malfunction Detection
Since malfunctions with degree change can be detected readily, in this section we focus on simulations on the miswirings where there is no degree change. We evaluate the accuracy of our algorithm proposed in Fig. 8 in detecting such malfunction. Our simulations are performed on all four structures. For each one, we select a moderate size with tens of thousands of devices for evaluation; specifically, they are BCube(6,4), FatTree(40), VL2(20,100), and DCell(3,3). As we know, miswirings without degree change are exceedingly rare, and every such case requires at least four miswired devices. Thus in our simulations, we randomly create five groups of such miswirings with a total of 20 miswired nodes. In the output of our algorithm, we check how many miswired nodes we have detected versus the number (or percent) of anchor points we have selected. We say a miswired node is detected only if there is no normal node above it in the counter list. This is because the administrators will rectify the miswirings according to our list sequentially and stop once they come to a node that is not really miswired. Fig. 12 demonstrates the results. It clearly shows that the number of detected malfunctions is increased with the number of selected anchor points. In our experiments on all structure, we can detect all the malfunctions with at most 1.5% of nodes selected as anchor points. Interestingly, we find the counter values of good nodes and those of bad nodes are well separated, i.e., there is a clear drop in the sorted counter value list. We also find that for different structures, we need different numbers of anchor points in order to detect all 20 miswired devices. For example, in DCell we require as many as 500 pairs of nodes as anchor points to detect all the malfuctions; in VL2, we need 350 pairs of nodes to detect them all. However, in BCube and FatTree, we only need 150 and 100 anchor points, respectively, to detect all malfunctions. One reason for the difference is that our selected DCell and VL2 networks are larger than BCube and FatTree. Another reason is that different structures can result in different false positives in . At last, it is worth mentioning that the above malfunction detection has been done efficiently. In the worst case, we used 809.36 s to detect all the 20 malfunctioning devices in DCell from 500 anchor points. Furthermore, as mentioned before, the calculations starting from different anchor points are independent of each other and can be performed in parallel for further acceleration.
VII. RELATED WORK
In this section, we review the work related to DAC. The differences between DAC and other schemes in related areas such as Ethernet and IP networks are caused by different design goals for different scenarios. Data Center Networking: Portland [8] is perhaps the most related work to DAC. It uses a distributed location discovery protocol (LDP) for PMAC (physical MAC) address assignment. LDP leverages the multirooted tree topology property for switches to decide their levels since only edge switches directly connect to servers. DAC differs from Portland in several aspects: 1) DAC can be applied to arbitrary topologies, whereas LDP only works for multirooted trees; 2) DAC follows a centralized design because it significantly simplifies the protocol design in distributed systems, and furthermore, data centers are operated by a single entity.
Plug-and-Play in Ethernet: Standing as one of the most widely used networking technologies, Ethernet has the beautiful property of "plug-and-play." It is essentially another notion of autoconfiguration in that each host in an Ethernet possesses a persistent MAC address and Ethernet bridges automatically learn host addresses during communication. Flat addressing simplifies the handling of topology dynamics and host mobility with no human input to reassign addresses. However, it suffers from scalability problems. Many efforts, such as [28] [29] [30] , have been made toward a scalable bridge architecture. More recently, SEATTLE [31] proposes to distribute ARP state among switches using a one-hop DHT and makes dramatic advances toward a plug-and-play Ethernet. However, it still cannot well support large data centers since: 1) switch state grows with end-hosts; 2) routing needs all-to-all broadcast; 3) forwarding loop still exists [8] .
Autoconfiguration in IP Networks: Autoconfiguration protocols for traditional IP networks can be divided into stateless and stateful approaches. In stateful protocols, a central server is employed to record state information about IP addresses that have already been assigned. When a new host joins, the servers allocate a new, unused IP to the host to avoid conflict. DHCP [3] is a representative protocol for this category. Autoconfiguration in stateless approaches does not rely on a central server. A new node proposes an IP address for itself and verifies its uniqueness using a duplicate address detection procedure. For example, a node broadcasts its proposed address to the network, and if it does not receive any message showing the address has been occupied, it successfully obtains that address. Examples include IPv6 stateless address autoconfiguration protocol [32] and IETF Zeroconf protocol [33] . However, neither of them can solve the autoconfiguration problem in new data centers where addresses contain locality and topology information.
VIII. CONCLUSION
In this paper, we have designed, evaluated, and implemented DAC, a generic and automatic Data center Address Configuration system. To the best of our knowledge, this is the first work in address autoconfiguration for generic data center networks. At the core of DAC is its device-to-logical ID mapping and malfunction detection. DAC has made an innovation in abstracting the device-to-logical ID mapping to the graph isomorphism problem and solved it in low time complexity by leveraging the sparsity and symmetry (or asymmetry) of data center structures. The DAC malfunction detection scheme is able to detect various errors, including the most difficult case where miswirings do not cause any node degree change.
Our simulation results show that DAC can accurately find all the hardest-to-detect malfunctions and can autoconfigure a large data center with 3.8 million devices in 46 s. In our implementation on a 64-server BCube test bed, DAC has used less than 300 ms to successfully autoconfigure all the servers. Our implementation experience and experiments show that DAC is a viable solution for data center network autoconfiguration.
APPENDIX
We show the numerical results for combinations of the three speedup techniques in Table VII . The information delivered from the table is consistent with Section VI-B.
• Candidate pruning via orbit is very efficient for symmetric graphs. For example, BCube is the most symmetric one of all the evaluated structures, and the other technique combinations without candidate pruning via orbit all take longer time for BCube. • Selective splitting is very efficient for sparse graphs. For example, we can see that with only selective splitting, over the structures with a similar number of nodes, we have relatively better results for DCell and VL2 (because they are sparser), but relatively worse performance for BCube and Fattree (because they are denser).
• Candidate selection via SPLD generally works together with the other two techniques and can further improve the performance when added, and the performance gain is more obvious for asymmetric graphs such as DCell. In the table, we do not consider the base algorithm without any speedup technique because it is extremely slow. We also note that we introduce the technique of candidate selection via SPLD to work with the other two techniques. When using only this technique, the improvement over the baseline algorithm is quite limited for symmetric graphs such as BCube. In fact, the improvement from candidate pruning via orbit to candidate pruning via orbit plus candidate selection via SPLD for BCube is mostly around two times in our experiments. This is because in very symmetric graphs, many devices have the same SPLDs. We therefore do not further explore SPLD as a standalone speedup technique.
