Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

Naderi Khorshidi, Habibeh

Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

Files

HabibehNaderiKhorshidi2026.pdf (9.42 MB)

Date

2026-04-28

Authors

Naderi Khorshidi, Habibeh

Abstract

This thesis addresses the challenge of learning effective multimodal representations in low-resource clinical settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical audio interviews from parents and children to jointly model speech and language signals for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. In collaboration with the Department of Psychiatry at Dalhousie University, we curate a dataset of parent-child clinical interviews annotated at both segment and document levels, including fine-grained labels for emotion recognition, sentiment analysis, and criticism detection, as well as higher-level diagnostic outcomes such as ADHD, depression, and bipolar disorder. The thesis systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for multimodal representation learning. We first conduct a comprehensive study of transfer learning using 29 pre-trained language models and 22 audio models, evaluating their effectiveness under various adaptation strategies and analyzing the impact of limited and imbalanced data. Building on this, we explore PEFT as a more efficient alternative, applying six techniques, LoRA, AdaLoRA, OFT, LoHa, LoKr, and IA3, across multiple models. PEFT consistently outperforms conventional transfer learning, yielding improvements of +1.41\% and +0.8\% in average AUC on parent and offspring data, respectively. Finally, we propose a three-stage contrastive learning framework for multimodal representation learning. The approach combines unimodal encoding, cross-modal contrastive alignment, and downstream task learning with a MoE architecture that adaptively integrates modality-specific information. Across all evaluated tasks, the proposed frameworks consistently outperform strong unimodal and multimodal baselines, demonstrating the effectiveness of adaptive multimodal learning for mental health assessment in low-resource settings.

Description

This thesis addresses the challenge of learning effective multimodal representations in low-resource settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical interviews from parents and children to model both speech and language signals. The work systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. It proposes and evaluates a series of multimodal deep learning frameworks that integrate speech and language representations, demonstrating improved performance on a range of clinically relevant prediction tasks.

Keywords

Multimodal Representation Learning, Transfer Learning, Parameter-Efficient Fine-Tuning (PEFT), Contrastive Learning, Mixture of Experts, Cognitive and Emotion Recognition, Mental Health Prediction

URI

https://hdl.handle.net/10222/86048

Collections

Faculty of Graduate Studies Online Theses

Full item page

Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections