27 research outputs found

    Self-Learning Classifier for Internet traffic

    Get PDF
    Network visibility is a critical part of traffic engineering, network management, and security. Recently, unsupervised algorithms have been envisioned as a viable alternative to automatically identify classes of traffic. However, the accuracy achieved so far does not allow to use them for traffic classification in practical scenario. In this paper, we propose SeLeCT, a Self-Learning Classifier for Internet traffic. It uses unsupervised algorithms along with an adaptive learning approach to automatically let classes of traffic emerge, being identified and (easily) labeled. SeLeCT automatically groups flows into pure (or homogeneous) clusters using alternating simple clustering and filtering phases to remove outliers. SeLeCT uses an adaptive learning approach to boost its ability to spot new protocols and applications. Finally, SeLeCT also simplifies label assignment (which is still based on some manual intervention) so that proper class labels can be easily discovered. We evaluate the performance of SeLeCT using traffic traces collected in different years from various ISPs located in 3 different continents. Our experiments show that SeLeCT achieves overall accuracy close to 98%. Unlike state-of-art classifiers, the biggest advantage of SeLeCT is its ability to help discovering new protocols and applications in an almost automated fashio

    Self-learning classifier for internet traffic

    Get PDF
    A method for classifying network traffic, including (1) processing a first working set portion of a flow batch for a first iteration by dividing the first working set portion into clusters and filtering a cluster by (i) identifying a first server port as most frequently occurring comparing to all other server ports in the cluster, (ii) in response to determining that a first frequency of occurrence of the first server port in the cluster exceeds a pre-determined threshold: (a) identifying the cluster as a dominatedPort cluster, (b) removing the cluster from the first working set portion to generate a remainder as a second working set portion, and (c) removing, from the cluster to be added to the second working set portion, one or more flows having different server port than the first server port, and (2) processing the second working set portion for a second iteration

    DNS to the rescue: Discerning Content and Services in a Tangled Web

    Get PDF
    A careful perusal of the Internet evolution reveals two major trends - explosion of cloud-based services and video stream- ing applications. In both of the above cases, the owner (e.g., CNN, YouTube, or Zynga) of the content and the organiza- tion serving it (e.g., Akamai, Limelight, or Amazon EC2) are decoupled, thus making it harder to understand the asso- ciation between the content, owner, and the host where the content resides. This has created a tangled world wide web that is very hard to unwind, impairing ISPs' and network ad- ministrators' capabilities to control the traffic flowing on the network. In this paper, we present DN-Hunter, a system that lever- ages the information provided by DNS traffic to discern the tangle. Parsing through DNS queries, DN-Hunter tags traffic flows with the associated domain name. This association has several applications and reveals a large amount of useful in- formation: (i) Provides a fine-grained traffic visibility even when the traffic is encrypted (i.e., TLS/SSL flows), thus en- abling more effective policy controls, (ii) Identifies flows even before the flows begin, thus providing superior net- work management capabilities to administrators, (iii) Un- derstand and track (over time) different CDNs and cloud providers that host content for a particular resource, (iv) Discern all the services/content hosted by a given CDN or cloud provider in a particular geography and time, and (v) Provides insights into all applications/services running on any given layer-4 port number. We conduct extensive experimental analysis and show that the results from real traffic traces, ranging from FTTH to 4G ISPs, that support our hypothesis. Simply put, the informa- tion provided by DNS traffic is one of the key components required to unveil the tangled web, and bring the capabilities of controlling the traffic back to the network carrier

    Automatic parsing of binary-based application protocols using network traffic

    Get PDF
    A method for analyzing a binary-based application protocol of a network. The method includes obtaining conversations from the network, extracting content of a candidate field from a message in each conversation, calculating a randomness measure of the content to represent a level of randomness of the content across all conversation, calculating a correlation measure of the content to represent a level of correlation, across all of conversations, between the content and an attribute of a corresponding conversation where the message containing the candidate field is located, and selecting, based on the randomness measure and the correlation measure, and using a pre-determined field selection criterion, the candidate offset from a set of candidate offsets as the offset defined by the protocol

    ABSTRACT Communication-Efficient Distributed Monitoring of Thresholded Counts

    No full text
    Monitoring is an issue of primary concern in current and next generation networked systems. For example, the objective of sensor networks is to monitor their surroundings for a variety of different applications like atmospheric conditions, wildlife behavior, and troop movements among others. Similarly, monitoring in data networks is critical not only for accounting and management, but also for detecting anomalies and attacks. Such monitoring applications are inherently continuous and distributed, and must be designed to minimize the communication overhead that they introduce. In this context we introduce and study a fundamental class of problems called “thresholded counts ” where we must return the aggregate frequency count of an event that is continuously monitored by distributed nodes with a user-specified accuracy whenever the actual count exceeds a given threshold value. In this paper we propose to address the problem of thresholded counts by setting local thresholds at each monitoring node and initiating communication only when the locally observed data exceeds these local thresholds. We explore algorithms in two categories: static thresholds and adaptive thresholds. In the static case, we consider thresholds based on a linear combination of two alternate strategies, and show that there exists an optimal blend of the two strategies that results in minimum communication overhead. We further show that this optimal blend can be found using a steepest descent search. In the adaptive case, we propose algorithms that adjust the local thresholds based on the observed distributions of updated information in the distributed monitoring system. We use extensive simulations not only to verify the accuracy of our algorithms and validate our theoretical results, but also to evaluate the performance of the two approaches. We find that both approaches yield significant savings over the naive approach of performing processing at a centralized location. 1

    SeLeCT: Self-Learning Classifier for Internet Traffic

    No full text
    Network visibility is a critical part of traffic engineering, network management, and security. The most popular current solutions - Deep Packet Inspection (DPI) and statistical classification, deeply rely on the availability of a training set. Besides the cumbersome need to regularly update the signatures, their visibility is limited to classes the classifier has been trained for. Unsupervised algorithms have been envisioned as a viable alternative to automatically identify classes of traffic. However, the accuracy achieved so far does not allow to use them for traffic classification in practical scenario. To address the above issues, we propose Select, a Self-Learning Classifier for Internet Traffic. It uses unsupervised algorithms along with an adaptive seeding approach to automatically let classes of traffic emerge, being identified and labeled. Unlike traditional classifiers, it requires neither a-priori knowledge of signatures nor a training set to extract the signatures. Instead, Select automatically groups flows into pure (or homogeneous) clusters using simple statistical features. Select simplifies label assignment (which is still based on some manual intervention) so that proper class labels can be easily discovered. Furthermore, Select uses an iterative seeding approach to boost its ability to cope with new protocols and applications. We evaluate the performance of Select using traffic traces collected in different years from various ISPs located in 3 different continents. Our experiments show that Select achieves excellent precision and recall, with overall accuracy close to 98%. Unlike state-of-art classifiers, the biggest advantage of Select is its ability to discover new protocols and applications in an almost automated fashion

    Profiling users in a 3g network using hourglass co-clustering

    No full text
    With widespread popularity of smart phones, more and more users are accessing the Internet on the go. Understanding mobile user browsing behavior is of great significance for several reasons. For example, it can help cellular (data) service providers (CSPs) to improve service performance, thus increasing user satisfaction. It can also provide valuable insights about how to enhance mobile user experience by providing dynamic content personalization and recommendation, or location-aware services. In this paper, we try to understand mobile user browsing behavior by investigating whether there exists distinct “behavior patterns” among mobile users. Our study is based on real mobile network data collected from a large 3G CSP in North America. We formulate this user behavior profiling problem as a co-clustering problem, i.e., we group both users (who share similar browsing behavior), and browsing profiles (of like-minded users) simultaneously. We propose and develop a scalable co-clustering methodology, Phantom, using a novel hourglass model. The proposed hourglass model first reduces the dimensions of the input data and performs divisive hierarchical co-clustering on the lower dimensional data; it then carries out an expansion step that restores the original dimensions. Applying Phantom to the mobile network data, we find that there exists a number of prevalent and distinct behavior patterns that persist over time, suggesting that user browsing behavior in 3G cellular networks can be captured using a small number of co-clusters. For instance, behavior of most users can be classified as either homogeneous (users with very limited set of browsing interests) or heterogeneous (users with very diverse browsing interests), and such behavior profiles do not change significantly at either short (30-min) or long (6 hour) time scales. 1
    corecore