DETECTING NO-OPINION RESPONSES IN THE CANADIAN LONGITUDINAL STUDY ON AGING (CLSA) DATASET USING UNSUPERVISED METHODS AND ACTIVE LEARNING
MetadataShow full item record
In open-ended surveys, participant answers that do not give any legitimate answer or opinion to the question being asked are called no-opinion responses. We consider the problem of detection of no-opinion answers in the CLSA dataset using a Machine Learning approach. The CLSA dataset contains verbatim responses from over 51,000 participants to the question of what promotes healthy aging. Our foremost goal is to clean the CLSA dataset to help foster the healthy aging study and pave a healthier way forward for the future generations. This thesis investigates the performance of existing state-of-the-art approaches, using distance measures coupled with embeddings and Active Learning to cluster and classify no-opinion responses. Among the unsupervised techniques we obtained the best performance using the BERT embeddings with Euclidean Distance. We also show that the Active Learning approach is a viable approach to identify no-opinion responses in a large survey, and in our experiments, the SVM based classifier had the best performance of 0.97 in the AUC score of the PR curve. Using this approach we identified 1157 instances of no-opinion responses in the CLSA dataset.