Data Exhaust in Voice Assistants: Analysis and Mitigation Approaches
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Voice assistants(VAs) such as Siri, Google Assistant, Cortana, and Alexa are increasingly
integrated into smartphones, smart home devices, and Internet of Things
(IoT) platforms. While offering convenience, these technologies generate significant
data exhaust, consisting of background data captured during both active use and
passive listening. This passive data generation often occurs without users’ awareness,
raising critical privacy, data governance, and security concerns. Despite their ubiquity,
a systematic understanding of how, when, and to what extent voice assistants
transmit data in real-world settings remains limited.
The objective of this thesis is to examine voice assistant privacy policies and
network traffic, develop a mobile application to notify users of security risks, and
propose mitigation methods. Firstly, we conducted a systematic survey of the privacy
policies of four major VAs, focusing on data collection, retention, third-party sharing,
transparency, and exploring mitigation methods to limit unnecessary data collection.
Based on these findings, Google Assistant was selected for detailed analysis due to
its deep integration with Google services and extensive data collection.
We subsequently developed an Android application to analyze PCAP files and
classify network traffic generated by voice assistants, particularly in background or
passive modes. The application identifies active background services, extracts Domain
Name System (DNS) queries, and detects unexpected third-party communications.
A built-in risk assessment system categorizes background activity into high, medium,
or low risk, providing users with clear, contextual explanations.
We further performed technical traffic analysis using tools such as Wireshark,
evaluating encryption patterns and traffic bursts to better understand behavioral
signatures. Our findings confirm that voice assistants can transmit user-related data
even without explicit interaction, often to external analytics and ad services.
This thesis presents a hybrid framework to uncover hidden data behaviors in voice
assistants and proposes mitigation strategies to reduce passive data leakage, enabling
more privacy-aware and transparent smart environments.
Description
Keywords
voice assistants
