N-gram based keyword topic modelling for Canadian Longitudinal Study on Aging survey data
Canadian Longitudinal Study on Aging (CLSA) is a study and platform funded by the Canadian Institute for Health Research (CIHR) which focuses on why some people are healthier while others do not. To understand this, the research team conducted a population-based study of older adults aged 45-85 across Canada. During the interview, participants were asked a question which focused on getting their opinion about what promotes healthy aging. The response to this question is plain unstructured text data. The responses are short and informal making it challenging for text mining. Traditional topic modelling algorithms consider the documents as Bag-of-Word model and word's intra-document frequency which do not seem to work well with our dataset. This thesis focuses on identifying various themes present in the responses with the help of a novel topic modelling algorithm which uses character n-grams and inter-document frequency which solves the problems around short and noisy documents.