EXPLORING A MACHINE LEARNING BASED APPROACH FOR ANALYZING ANONYMIZED DATA
Date
2017-04-11T12:56:18Z
Authors
Nheiley, Derek
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Information found in log files is often stored in human readable plain text. Research study participants do not want sensitive information recorded, and when data is made publicly available, participants ask that they cannot be individually linked to their data. Researchers anonymize publicly available datasets using randomly assigned encodings for users, and one-way hashing functions for encrypting other human readable plain text.
This research examines the effects of concatenating and hashing a list of nominal values to represent a single dataset feature, using password encryption as an example. Using decision trees and classification accuracy as a measure of information leakage, I evaluate the performance on several publicly available mobile datasets.
One of the contributions in this research identifies fine grain application usage details as a suitable candidate for device fingerprinting, which maintains user classification accuracy even when obscuring the name and number of applications.
Description
Keywords
anaonymized data, machine learning, entropy, information security, log files, hashing, encryption, pattern recognition