Multimodal Representation Learning for Mental Health: Transfer Learning, PEFT, and Contrastive Learning
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis addresses the challenge of learning effective multimodal representations in low-resource clinical settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical audio interviews from parents and children to jointly model speech and language signals for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance.
In collaboration with the Department of Psychiatry at Dalhousie University, we curate a dataset of parent-child clinical interviews annotated at both segment and document levels, including fine-grained labels for emotion recognition, sentiment analysis, and criticism detection, as well as higher-level diagnostic outcomes such as ADHD, depression, and bipolar disorder.
The thesis systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for multimodal representation learning. We first conduct a comprehensive study of transfer learning using 29 pre-trained language models and 22 audio models, evaluating their effectiveness under various adaptation strategies and analyzing the impact of limited and imbalanced data. Building on this, we explore PEFT as a more efficient alternative, applying six techniques, LoRA, AdaLoRA, OFT, LoHa, LoKr, and IA3, across multiple models. PEFT consistently outperforms conventional transfer learning, yielding improvements of +1.41\% and +0.8\% in average AUC on parent and offspring data, respectively.
Finally, we propose a three-stage contrastive learning framework for multimodal representation learning. The approach combines unimodal encoding, cross-modal contrastive alignment, and downstream task learning with a MoE architecture that adaptively integrates modality-specific information. Across all evaluated tasks, the proposed frameworks consistently outperform strong unimodal and multimodal baselines, demonstrating the effectiveness of adaptive multimodal learning for mental health assessment in low-resource settings.
Description
This thesis addresses the challenge of learning effective multimodal representations in low-resource settings by leveraging implicit supervision from paired modalities and introducing adaptive fusion mechanisms based on Mixture-of-Experts (MoE). Focusing on mental health assessment, it utilizes clinical interviews from parents and children to model both speech and language signals. The work systematically investigates transfer learning, parameter-efficient fine-tuning (PEFT), and contrastive audio-text learning for predicting emotion, sentiment, and higher-level psychological and cognitive outcomes under conditions of data scarcity and class imbalance. It proposes and evaluates a series of multimodal deep learning frameworks that integrate speech and language representations, demonstrating improved performance on a range of clinically relevant prediction tasks.
Keywords
Multimodal Representation Learning, Transfer Learning, Parameter-Efficient Fine-Tuning (PEFT), Contrastive Learning, Mixture of Experts, Cognitive and Emotion Recognition, Mental Health Prediction
