Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

Naderi Khorshidi, Habibeh

Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

dc.contributor.author	Naderi Khorshidi, Habibeh
dc.contributor.copyright-release	Not Applicable
dc.contributor.degree	Doctor of Philosophy
dc.contributor.department	Faculty of Computer Science
dc.contributor.ethics-approval	Not Applicable
dc.contributor.external-examiner	Dr. Diana Inkpen
dc.contributor.manuscripts	Not Applicable
dc.contributor.thesis-reader	Dr. Evangelos Milios
dc.contributor.thesis-reader	Dr. Sageev Oore
dc.contributor.thesis-reader	Dr. Stan Matwin
dc.contributor.thesis-supervisor	Dr. Frank Rudzicz
dc.date.accessioned	2026-04-28T17:30:19Z
dc.date.available	2026-04-28T17:30:19Z
dc.date.defence	2026-04-15
dc.date.issued	2026-04-28
dc.description	This thesis addresses the challenge of learning effective multimodal representations in low-resource settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical interviews from parents and children to model both speech and language signals. The work systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. It proposes and evaluates a series of multimodal deep learning frameworks that integrate speech and language representations, demonstrating improved performance on a range of clinically relevant prediction tasks.
dc.description.abstract	This thesis addresses the challenge of learning effective multimodal representations in low-resource clinical settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical audio interviews from parents and children to jointly model speech and language signals for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. In collaboration with the Department of Psychiatry at Dalhousie University, we curate a dataset of parent-child clinical interviews annotated at both segment and document levels, including fine-grained labels for emotion recognition, sentiment analysis, and criticism detection, as well as higher-level diagnostic outcomes such as ADHD, depression, and bipolar disorder. The thesis systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for multimodal representation learning. We first conduct a comprehensive study of transfer learning using 29 pre-trained language models and 22 audio models, evaluating their effectiveness under various adaptation strategies and analyzing the impact of limited and imbalanced data. Building on this, we explore PEFT as a more efficient alternative, applying six techniques, LoRA, AdaLoRA, OFT, LoHa, LoKr, and IA3, across multiple models. PEFT consistently outperforms conventional transfer learning, yielding improvements of +1.41\% and +0.8\% in average AUC on parent and offspring data, respectively. Finally, we propose a three-stage contrastive learning framework for multimodal representation learning. The approach combines unimodal encoding, cross-modal contrastive alignment, and downstream task learning with a MoE architecture that adaptively integrates modality-specific information. Across all evaluated tasks, the proposed frameworks consistently outperform strong unimodal and multimodal baselines, demonstrating the effectiveness of adaptive multimodal learning for mental health assessment in low-resource settings.
dc.identifier.uri	https://hdl.handle.net/10222/86048
dc.language.iso	en
dc.subject	Multimodal Representation Learning
dc.subject	Transfer Learning
dc.subject	Parameter-Efficient Fine-Tuning (PEFT)
dc.subject	Contrastive Learning
dc.subject	Mixture of Experts
dc.subject	Cognitive and Emotion Recognition
dc.subject	Mental Health Prediction
dc.title	Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

Files

Original bundle

Now showing 1 - 1 of 1

Name:: HabibehNaderiKhorshidi2026.pdf
Size:: 9.42 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.12 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Faculty of Graduate Studies Online Theses