5 research outputs found

    Probabilistic graphical models for semi-supervised traffic classification

    No full text
    Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features. Copyright © 2010 ACM

    Probabilistic Graphical Models for Semi-Supervised Traffic Classification ∗

    No full text
    Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features

    Automatic network traffic classification

    Full text link
    The thesis addresses a number of critical problems in regard to fully automating the process of network traffic classification and protocol identification. Several effective solutions based on statistical analysis and machine learning techniques are proposed, which significantly reduce the requirements for human interventions in network traffic classification systems

    What's in a Name? Intelligent Classification and Identification of Online Media Content

    Get PDF
    The sheer amount of content on the Internet poses a number of challenges for content providers and users alike. The providers want to classify and identify user downloads for market research, advertising and legal purposes. From the user’s perspective it is increasingly difficult to find interesting content online, hence content personalisation and media recommendation is expected by the public. An especially important (and also technically challenging) case is when a downloadable item has no supporting description or meta-data, as in the case of (normally illegal) torrent downloads, which comprise 10 to 30 percent of the global traffic depending on the region. In this case, apart from its size, we have to rely entirely on the filename – which is often deliberately obfuscated – to identify or classify what the file really is. The Hollywood movie industry is sufficiently motivated by this problem that it has invested significant research – through its company MovieLabs – to help understand more precisely what material is being illegally downloaded in order both to combat piracy and exploit the extraordinary opportunities for future sales and marketing. This thesis was inspired, and partly supported, by MovieLabs who recognised the limitations of their current purely data-driven algorithmic approach. The research hypothesis is that, by extending state-of-the-art information retrieval (IR) algorithms and by developing an underlying causal Bayesian Network (BN) incorporating expert judgment and data, it is possible to improve on the accuracy of MovieLabs’s benchmark algorithm for identifying and classifying torrent names. In addition to identification and standard classification (such as whether the file is Movie, Soundtrack, Book, etc.) we consider the crucial orthogonal classifications of pornography and malware. The work in the thesis provides a number of novel extensions to the generic problem of classifying and personalising internet content based on minimal data and on validating the results in the absence of a genuine ‘oracle’. The system developed in the thesis (called Toran) is extensively validated using a sample of torrents classified by a panel of 3 human experts and the MovieLabs system, divided into knowledge and validation sets of 2,500 and 479 records respectively. In the absence of an automated classification oracle, we established manually the true classification for the test set of 121 records in order to be able to compare Toran, the human panel (HP) and the MovieLabs system (MVL). The results show that Toran performs better than MVL for the key medium categories that contain most items, such as music, software, movies, TVs and other videos. Toran also has the ability to assess the risk of fakes and malware prior to download, and is on par or even surpasses human experts in this capability.EPSRC for funding and to Queen Mary University of London for making this project possible. This work was also supported in part by European Research Council Advanced Grant ERC-2013-AdG339182-BAYES_KNOWLEDGE (April 2015-Dec 2015)
    corecore