An Investigation of Using Machine Learning with Distribution Based Flow Features for Classifying SSL Encrypted Network Traffic
Arndt, Daniel Joseph
MetadataShow full item record
Encrypted protocols, such as Secure Socket Layer (SSL), are becoming more prevalent because of the growing use of e-commerce, anonymity services, gaming and Peer-to-Peer (P2P) applications such as Skype and Gtalk. The objective of this work is two-fold. First, an investigation is provided into the identification of web browsing behaviour in SSL tunnels. To this end, C5.0, naive Bayesian, AdaBoost and Genetic Programming learning models are evaluated under training and test conditions from a network traffic capture. In these experiments flow based features are employed without using Internet Protocol (IP) addresses, source/destination ports or payload information. Results indicate that it is possible to identify web browsing behaviour in SSL encrypted tunnels. Test performance of ~95% detection rate and ~2% false positive rate is achieved with a C5.0 model for identifying SSL. ~98% detection rate and ~3% false positive rate is achieved with an AdaBoost model for identifying web browsing within these tunnels. Second, the identifying characteristics of SSL traffic are investigated, whereby a new tool is introduced to generate new flow statistics that focus on presenting the features in a unique way, using bins to represent distributions of measurements. These new features are tested using the best performers from previous experiments, C5.0 and AdaBoost, and increase detection rates by up to 32.40%, and lower false positive rates by as much as 54.73% on data sets that contain traffic from a different network than the training set was captured on. Furthermore, the new feature set out-preforms the old feature set in every case.