Machine Learning based Framework for User-Centered Insider Threat Detection
Abstract
Insider threat represents a major cyber-security challenge to companies, organizations, and government agencies. Harmful actions in insider threats are performed by authorized users in organizations. Due to the fact that an insider is authorized to access the organization's computer systems and has knowledge about the organization's security procedures, detecting insider threats is challenging. Many other challenges exist in this detection problem, including unbalanced data, limited ground truth, and possible user behaviour changes. This research proposes a comprehensive machine learning-based framework for insider threat detection, from data pre-processing, a combination of supervised and unsupervised learning, to deep analysis and meaningful result reporting.
For the data pre-processing step, the framework introduces a data extraction approach allowing extraction of numerical feature vectors representing user activities from heterogeneous data, with different data granularity levels and temporal data representations, and enabling applications of machine learning. In the initial detection step of the framework, assume no available ground truth, unsupervised learning methods with different working principles and unsupervised ensembles are explored for anomaly detection to identify anomalous user behaviours that may indicate insider threats. Furthermore, the framework employs supervised and semi-supervised machine learning under limited ground truth availability and real-world conditions to maximize the effectiveness of limited training data and detect insider threats with high precision. Throughout the thesis, realistic evaluation and comprehensive result reporting are performed to facilitate understanding of the framework's performance under real-world conditions.
Evaluation results on publicly available datasets show the effectiveness of the proposed approach. High insider threat detection rates are achieved at very low false positive rates. The robustness of the detection models is also demonstrated and comparisons with the state-of-the-art confirm the advantages of the approach.