T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

May 22, 2025·

Zhehao Huang

Yuhang Liu

Yixin Lou

Zhengbao He

Mingzhen He

Wenxing Zhou

Tao Li

Kehan Li

Zeyi Huang

Xiaolin Huang

· 3 min read

PDF Cite

Overview of T2I-ConBench benchmark for continual post-training of text-to-image diffusion models

Abstract

Continual post-training adapts a single text-to-image diffusion model to a stream of new tasks without training separate models, but naive post-training leads to severe forgetting of pretrained knowledge and harms zero-shot compositionality. To address the lack of a standardized evaluation protocol in this setting, this paper introduces T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios—item customization and domain enhancement—and evaluates methods along four key dimensions retention of generality, target-task performance, catastrophic forgetting, and cross-task generalization. The benchmark integrates automated metrics, human-preference modeling, and vision-language question answering into a comprehensive evaluation pipeline. Using three realistic task sequences and ten representative continual post-training methods, the authors show that no single approach dominates across all criteria, that joint oracle training is not universally optimal, and that cross-task generalization remains an open challenge, while releasing datasets, code, and tools to support future research in continual post-training for text-to-image diffusion models.

Type

Journal article

Publication

In arXiv preprint arXiv:2505.16875

Overview

This paper tackles continual post-training of large text-to-image diffusion models, where a single pretrained model must be sequentially adapted to new tasks while preserving its original capabilities and avoiding catastrophic forgetting.

To overcome the lack of a standard evaluation protocol, the authors propose T2I-ConBench, a unified benchmark for continual post-training that systematically measures how well methods retain general generative abilities, adapt to new tasks, prevent forgetting, and generalize across tasks.

Key Contributions

Unified benchmark for continual post-training: Introduces T2I-ConBench as a comprehensive benchmark specifically designed for continual post-training of text-to-image diffusion models, rather than generic continual learning or one-shot fine-tuning.
Two practical scenarios: item customization and domain enhancement: Covers personalized item customization and domain enhancement as two realistic and complementary post-training settings, reflecting common deployment needs for T2I models.
Four evaluation dimensions: Evaluates methods along four axes: (1) retention of pretrained generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization, enabling nuanced comparison of methods’ stability-plasticity trade-offs.
Comprehensive automated evaluation pipeline: Combines standard image-text metrics, a learned human-preference model, and vision-language QA into an automated pipeline to approximate human judgment and assess both quality and alignment.
Systematic benchmarking of methods: Benchmarks ten representative continual post-training methods on three realistic task sequences, revealing that no single method is best on all metrics, that oracle joint training is not always ideal, and that cross-task generalization remains unsolved.
Open resources: Releases datasets, code, and evaluation tools to support future research and reproducible comparisons in continual post-training for text-to-image diffusion models.

Method

The T2I-ConBench framework formalizes continual post-training as sequential adaptation of a fixed, pretrained text-to-image diffusion model on a series of disjoint task datasets, without revisiting earlier data.

Core components include:

Task sequences: Carefully designed continual task sequences that mix item customization and domain enhancement tasks, reflecting realistic streams of user demands and domain shifts.
Curated datasets: Diverse item-level datasets for personalized concept learning and domain-specific datasets (e.g., synthetic or specialized domains) for enhancement of generation quality and alignment.
Evaluation pipeline: An automated pipeline that:
- Uses standard T2I metrics for fidelity and alignment,
- Employs a human-preference model to approximate subjective judgments,
- Uses vision-language QA to test semantic consistency and compositional understanding.
Assessment across four axes: For each method and task sequence, T2I-ConBench measures:
1. Retention of generality – how well pretrained capabilities are preserved,
2. Target-task performance – quality and alignment on the current task,
3. Catastrophic forgetting – degradation on earlier tasks,
4. Cross-task generalization – ability to compose concepts from multiple tasks in novel prompts.

By fixing the base model, tasks, and evaluation protocol, T2I-ConBench isolates the effect of the continual post-training algorithm itself, enabling fair and reproducible comparison across methods.

Citation

If you find this work useful for your research, please consider citing:

@article{huang2025t2iconbench,
  title   = {T2I-ConBench: Text-to-Image Benchmark for Continual Post-training},
  author  = {Huang, Zhehao and Liu, Yuhang and Lou, Yixin and He, Zhengbao and He, Mingzhen and Zhou, Wenxing and Li, Tao and Li, Kehan and Huang, Zeyi and Huang, Xiaolin},
  journal = {arXiv preprint arXiv:2505.16875},
  year    = {2025},
  doi     = {10.48550/arXiv.2505.16875}
}

Last updated on May 22, 2025

Text-to-Image Continual Post-Training Diffusion Models Benchmark Catastrophic Forgetting

Authors

Zhehao Huang

Ph.D. Student

← Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models Aug 27, 2025

A Unified Gradient-based Framework for Task-agnostic Continual Learning-Unlearning May 21, 2025 →