Ordinal Variable Imputation for Health Survey Data: A Comparison between Machine Learning and non-Machine Learning Methods
MetadataShow full item record
Introduction: Large amounts of data are available for analyses from survey datasets. However, missing data can potentially reduce statistical power and/or introduce bias into analyses when not addressed correctly. Data imputation methods can replace missing data with estimated values that are informed from existing data. Machine learning algorithms can improve the efficiency and accuracy of data imputation by automatically generating models that can fit to complex associations that may exist between variables in a dataset. Methods: This thesis uses a cross-sectional simulation study of the Canadian Community Health Survey 2014 public use microdata file to induce missingness into annual total household income, an ordinal variable with 5 classes. The simulation study includes 5 imputation models from machine learning algorithms and 2 non-machine learning imputation models (ordinal logistic regression and predictive mean matching) for each simulated dataset for a total of 84 imputed datasets. The evaluation uses an ordinal-sensitive distance measure and class transition tables to compare imputation model performance. Results: The imputation models from machine learning algorithms performed better than the non-machine learning imputation models with regards to the ordinal-sensitive distance measure (0.5-0.6 for machine learning vs 0.65-0.75 for non-machine learning, lower values indicate better performance). The class transition tables indicate that, while scoring above 80% accuracy in one class, machine learning models tend to overrepresent income classes that are easier to classify and produce imputed values that do not reflect the original class structure of the income variable. The machine learning models had very low accuracy (less than 5% in all algorithms except one) for the income class that was the most underrepresented in the imputed data. The non-machine learning models produced imputed values that reflected the original income class structure well but had poor accuracy (15-55% depending on the class) and also showed less ordinality than the imputed values from the machine learning models. Conclusion: Machine learning algorithms provide improvements in imputation accuracy for specific groups of observations and exhibit stronger ordinality in imputed data. However, the overrepresentation of specific classes in the imputed datasets may reduce the generalizability of machine learning imputation models. While situationally suitable for variables with specific classes that hold high value, or for variables where the ordinal structure is important, future research on addressing the bias in machine learning algorithms has the potential to further improve the performance and generalizability of machine learning methods for data imputation.