Streaming Network Traffic Analysis Using Active Learning
MetadataShow full item record
The aim of this thesis is to evaluate the performance of different budgeting strategies, as well as an Adaptive Neural Network, in analyzing streaming network traffic, specifically, for the purpose of detecting malicious/botnet activity. In previous works, researchers have generally measured the classification performance by the overall accuracy of their strategy. However, this method of analyzing performance is not necessarily the most effective. Thus, in addition to accuracy, performance is measured by analyzing detection rate, prequential accuracy, and prequential detection rate. Measuring the detection rate of a strategy provides a performance metric that is not biased in terms of class distribution. The prequential accuracy and prequential detection rates offer additional performance analysis in that these performance metrics present the changes of accuracy and detection rate throughout the network stream. In a real life scenario network traffic is unending and constantly being streamed, resulting in large datasets that require a large number of resources to train a classifier on. Thus, budgeting strategies that select a small portion of data instances on which to train on have been developed. In this thesis, five budgeting strategies are evaluated; Random, Fixed Uncertainty, Variable Uncertainty, Random Variable Uncertainty, and Select Sampling. Performance of the budgeting strategies is measured at budgets of 10% and 100%. The aforementioned strategies are tested in conjunction with two different classifiers; Naive Bayes and Hoeffding Tree. In addition to the budgeting strategies, an adaptive Neural Network Strategy is also evaluated. The proposed strategies are applied to six different streaming network traffic datasets that include different malicious or botnet activity. The results demonstrate that all of the budgeting strategies (with the exception of the fixed uncertainty strategy) are suitable candidates for classification of streaming network traffic where some of the state-of-the-art classifiers achieved accuracies in the range of 90% or higher. Furthermore, limiting labeling budgets to 10% does not affect performance negatively, thus its use is recommended as to save computing resources.