Search CORE

1 research outputs found

Hierarchical clustering and summarization of network traffic data

Author: Mahmood Abdun Naser
Publication venue
Publication date: 01/01/2008
Field of study

© 2008 Dr. Abdun Naser MahmoodAn important task in managing IP networks is understanding the different types of traffic that are utilizing a network, based on a given trace of the packets or flows in the network. One of the key challenges in this task is the volume and complexity of the data that is available in traffic traces. What is needed by network managers in this context is a concise report of the significant traffic patterns that are present in the network. In this thesis, we address the problem of how to generate a succinct traffic report that contains a set of aggregated traffic flows, such that each aggregate flow corresponds to a significant traffic pattern in the network. We view the problem of generating a report of the significant traffic patterns in a network as a form of clustering problem. In particular, some distance-based hierarchical clustering techniques have advantages in terms of scalability when analyzing the types of large traffic traces that arise in this context. However, there are several important problems that need to be addressed before we can effectively use these types of clustering techniques on network traffic traces. The first research problem we address is how to handle non-numeric attributes that appear in network traffic data, such as attributes with a categorical or hierarchical structure. We have proposed a hierarchical similarity measure that is suitable for comparing hierarchical attributes in network traffic data. We have then developed a one-pass, hierarchical clustering scheme that can exploit the structure of hierarchical attributes in combination with categorical and numerical attributes. We demonstrate that our clustering scheme achieves significant improvements in both accuracy and execution time on a standard benchmark dataset, compared to an existing approach based on frequent itemset clustering. The second research problem we address is how to improve the scalability of our hierarchical clustering scheme when computing resources are limited. We propose an adaptive, two-stage sampling technique, which controls the rate at which records from frequently seen patterns are received by our clustering scheme. This enables more computational resources to be allocated to clustering new or unusual traffic patterns. We demonstrate that our two-stage sampling technique can identify less frequent traffic patterns with greater accuracy than when traditional systematic sampling is used. The third research problem we address is how to generate a concise yet accurate summary report from the results of our hierarchical clustering. We present two approaches to summarization, based on the size and the homogeneity of the clusters in the hierarchical cluster tree. We demonstrate that these approaches to summarization can substantially reduce the final report size with little impact on the accuracy of the report

University of Melbourne Institutional Repository