Learning Optical Flow with Auxiliary Cost Aggregation
Abstract
Optical flow represents motions for each pixel between two adjacent frames in a video sequence. Deep learning-based estimation approaches for optical flow have overshadowed the variational approaches over the past few years, as they achieve real-time estimation with reduced estimation error. The construction of deep learning-based estimation models heavily relies on the cost volume which is constructed through matrix multiplication and encodes the dense matching information between the given inputs. Long-range correlation and occlusion, however, remain challenging as information drawn from the cost volume is heavily weighted by the local correlation defined over a fixed window size. In this thesis, we propose to enrich the information used for the iterative residual flow decoding process with an Auxiliary Cost Aggregation (ACA) unit that constructs an auxiliary cost volume based on the top-k matches from the 4D cost volume and then augments it using Transformers. Additionally, a post-refinement module is also proposed to refine the predicted residual flow at the end of each iteration based on the feature's local coherence. Extensive experiments indicate that our model achieves better cross-dataset generalizability than two baseline models, RAFT and GMA. On the Sintel and KITTI benchmarks, our model outperforms RAFT and has comparable performance with other state-of-the-art (SOTA) models.