Towards a Label-Free and Representation-Based Metric for Evaluating Machine Learning Models
Abstract
In this work, we explore the viability of proposed label-free metrics to evaluate
models. We begin by examining the effect on linear probe accuracy which different
viable label schemes on an identical dataset may cause. We show that in a toy
setting, a notion of “complexity” for distinguishing classes can have predictive
capabilities for anticipating relative “difficulty” a population of models may
encounter for a comparison between classification tasks. In establishing these
arbitrary relative differences in valid formulations for an evaluation task, we justify
the search for a label scheme independent means to evaluate learning. To this end,
we examine label-free clustering-based metrics and entropy on representational
spaces at progressive milestones during self-supervised learning and on pre-trained
representational spaces. While clustering-based metrics show mixed success, entropy
may be viable for monitoring learning and cross-architectural comparisons, despite
displaying instability in early training and showing differing trends for certain
learning methodologies.