Trainable Weight Averaging: Efficient Training by Optimizing Historical Solutions
A parallel framework for large-scale training with efficiency in memory and computation is designed for TWA or EMA and manifests better adaptation to different stages of training.
Feb 26, 2023