Stream Genetic Programming for Botnet Detection
Algorithms for constructing classification models in streaming data scenarios are attracting more attention in the era of artificial intelligence and machine learning for data analysis. The huge volumes of streaming data necessitate a learning framework with timely and accurate processing. For a streaming classifier to be deployed in the real world, multiple challenges exist such as 1) Concept drift, 2) Imbalanced data; and 3) Costly labelling processes. These challenges become more crucial when they occur in sensitive fields of operation such as network security. The objective of this thesis is to provide a team-based genetic programming (GP) framework to explore and address these challenges with regard to network-based services. The GP classifier incrementally introduces changes to the model throughout the course of the stream to adapt to the content of the stream. The framework is based on an active learning approach where the learning process happens in interaction with a data subset to build a model. Thus, the design of the system is founded on the introduction of sampling and archiving policies to decouple the stream distribution from the training data subset. These policies work with no prior information on the distribution of classes and true labels. Benchmarking is conducted with real-world network security datasets with label budgets in the order of 5 to 0.5 percent and significant class imbalance. Evaluations for the detection of minor classes have been performed that represent the classifier behaviour in case of attacks. Comparisons to the current streaming algorithms and specifically network state-of-the-art frameworks for streaming processing under label budgets demonstrate the effectiveness of the proposed GP framework to address the challenges related to streaming data. Furthermore, the applicability of the proposed framework in network and security analytics is demonstrated.