Automatically Generating Robust Signatures Using a Machine Learning Approach to Unveil Encrypted VoIP Traffic Without Using Port Numbers, IP Addresses and Payload Inspection
The identification of encrypted network traffic represents an important issue for network management tasks including quality of service, firewall enforcement and security. Traffic identification becomes more and more challenging as the traditional techniques such as port numbers or deep packet inspection are becoming ineffective against applications such as the Peer-to-Peer (P2P) Voice over Internet Protocol (VoIP), which uses non-standard ports and encryption. Thus, different approaches such as machine learning (ML) are explored in the literature for traffic classification. However, traffic classification represents a particularly challenging application domain for ML. Ideally, solutions should be both simple (hence efficient to deploy) and accurate. Recent advances in ML provide the opportunity to decompose the original problem into a subset of classifiers with non-overlapping behaviours, in effect providing further insight into the problem domain and increasing the throughput of solutions. Thus, this thesis presents a novel approach for generating robust signatures to classify P2P VoIP traffic using a ML-based approach, specifically with the C5.0, GP and AdaBoost classification algorithms. In this research, simple packet header feature sets and statistical flow feature sets are explored without using the IP addresses, source/destination ports and payload information to unveil the encrypted VoIP application in network traffic. In this context, what is meant by robust signatures are those which have been learned by training on one network are still valid when they are applied to traffic coming from different time periods, different networks (locations) as well as under evasion attacks that are designed to bypass such a classifier. Results show that the performance of the automatically generated signatures does not degrade significantly when evaluated against the robustness criteria. These results demonstrate that flow-based statistical features (temporal information) with the use of a ML-based approach can achieve high classification accuracy and produce robust signatures. Furthermore, the results on the evasion experiments demonstrate that the performance of the signatures is very promising if a malicious user tries to alter the characteristics of VoIP (specifically, Skype) traffic to evade the classifier.