4 research outputs found
Hierarchical TCP network traffic classification with adaptive optimisation
Nowadays, with the increasing deployment of modern packet-switching networks,
traffic classification is playing an important role in network administration. To
identify what kinds of traffic transmitting across networks can improve network
management in various ways, such as traffic shaping, differential services, enhanced
security, etc. By applying different policies to different kinds of traffic, Quality
of Service (QoS) can be achieved and the granularity can be as fine as flow-level.
Since illegal traffic can be identified and filtered, network security can be enhanced
by employing advanced traffic classification.
There are various traditional techniques for traffic classification. However,
some of them cannot handle traffic generated by applications using non-registered
ports or forged ports, some of them cannot deal with encrypted traffic and some
techniques require too much computational resources. The newly proposed technique
by other researchers, which uses statistical methods, gives an alternative
approach. It requires less resources, does not rely on ports and can deal with encrypted
traffic. Nevertheless, the performance of the classification using statistical
methods can be further improved.
In this thesis, we are aiming for optimising network traffic classification based
on the statistical approach. Because of the popularity of the TCP protocol, and
the difficulties for classification introduced by TCP traffic controls, our work is
focusing on classifying network traffic based on TCP protocol. An architecture has
been proposed for improving the classification performance, in terms of accuracy
and response time. Experiments have been taken and results have been evaluated
for proving the improved performance of the proposed optimised classifier.
In our work, network packets are reassembled into TCP flows. Then, the
statistical characteristics of flows are extracted. Finally the classes of input flows
can be determined by comparing them with the profiled samples. Instead of using only one algorithm for classifying all traffic flows, our proposed system employs
a series of binary classifiers, which use optimised algorithms to detect different
traffic classes separately. There is a decision making mechanism for dealing with
controversial results from the binary classifiers. Machining learning algorithms
including k-nearest neighbour, decision trees and artificial neural networks have
been taken into consideration together with a kind of non-parametric statistical
algorithm β Kolmogorov-Smirnov test. Besides algorithms, some parameters are
also optimised locally, such as detection windows, acceptance thresholds. This
hierarchical architecture gives traffic classifier more flexibility, higher accuracy
and less response time
Source code authorship attribution
To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem