Repository logo
 

EXPLORING A MACHINE LEARNING BASED APPROACH FOR ANALYZING ANONYMIZED DATA

Date

2017-04-11T12:56:18Z

Authors

Nheiley, Derek

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Information found in log files is often stored in human readable plain text. Research study participants do not want sensitive information recorded, and when data is made publicly available, participants ask that they cannot be individually linked to their data. Researchers anonymize publicly available datasets using randomly assigned encodings for users, and one-way hashing functions for encrypting other human readable plain text. This research examines the effects of concatenating and hashing a list of nominal values to represent a single dataset feature, using password encryption as an example. Using decision trees and classification accuracy as a measure of information leakage, I evaluate the performance on several publicly available mobile datasets. One of the contributions in this research identifies fine grain application usage details as a suitable candidate for device fingerprinting, which maintains user classification accuracy even when obscuring the name and number of applications.

Description

Keywords

anaonymized data, machine learning, entropy, information security, log files, hashing, encryption, pattern recognition

Citation