Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning
| dc.contributor.author | Naderi Khorshidi, Habibeh | |
| dc.contributor.copyright-release | Not Applicable | |
| dc.contributor.degree | Doctor of Philosophy | |
| dc.contributor.department | Faculty of Computer Science | |
| dc.contributor.ethics-approval | Not Applicable | |
| dc.contributor.external-examiner | Dr. Diana Inkpen | |
| dc.contributor.manuscripts | Not Applicable | |
| dc.contributor.thesis-reader | Dr. Evangelos Milios | |
| dc.contributor.thesis-reader | Dr. Sageev Oore | |
| dc.contributor.thesis-reader | Dr. Stan Matwin | |
| dc.contributor.thesis-supervisor | Dr. Frank Rudzicz | |
| dc.date.accessioned | 2026-04-28T17:30:19Z | |
| dc.date.available | 2026-04-28T17:30:19Z | |
| dc.date.defence | 2026-04-15 | |
| dc.date.issued | 2026-04-28 | |
| dc.description | This thesis addresses the challenge of learning effective multimodal representations in low-resource settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical interviews from parents and children to model both speech and language signals. The work systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. It proposes and evaluates a series of multimodal deep learning frameworks that integrate speech and language representations, demonstrating improved performance on a range of clinically relevant prediction tasks. | |
| dc.description.abstract | This thesis addresses the challenge of learning effective multimodal representations in low-resource clinical settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical audio interviews from parents and children to jointly model speech and language signals for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. In collaboration with the Department of Psychiatry at Dalhousie University, we curate a dataset of parent-child clinical interviews annotated at both segment and document levels, including fine-grained labels for emotion recognition, sentiment analysis, and criticism detection, as well as higher-level diagnostic outcomes such as ADHD, depression, and bipolar disorder. The thesis systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for multimodal representation learning. We first conduct a comprehensive study of transfer learning using 29 pre-trained language models and 22 audio models, evaluating their effectiveness under various adaptation strategies and analyzing the impact of limited and imbalanced data. Building on this, we explore PEFT as a more efficient alternative, applying six techniques, LoRA, AdaLoRA, OFT, LoHa, LoKr, and IA3, across multiple models. PEFT consistently outperforms conventional transfer learning, yielding improvements of +1.41\% and +0.8\% in average AUC on parent and offspring data, respectively. Finally, we propose a three-stage contrastive learning framework for multimodal representation learning. The approach combines unimodal encoding, cross-modal contrastive alignment, and downstream task learning with a MoE architecture that adaptively integrates modality-specific information. Across all evaluated tasks, the proposed frameworks consistently outperform strong unimodal and multimodal baselines, demonstrating the effectiveness of adaptive multimodal learning for mental health assessment in low-resource settings. | |
| dc.identifier.uri | https://hdl.handle.net/10222/86048 | |
| dc.language.iso | en | |
| dc.subject | Multimodal Representation Learning | |
| dc.subject | Transfer Learning | |
| dc.subject | Parameter-Efficient Fine-Tuning (PEFT) | |
| dc.subject | Contrastive Learning | |
| dc.subject | Mixture of Experts | |
| dc.subject | Cognitive and Emotion Recognition | |
| dc.subject | Mental Health Prediction | |
| dc.title | Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning |
