T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

¹Shanghai Jiao Tong University, ²Huawei
^*Equal Contribution ^{^}Project Leader ^$Corresponding Author

Abstract

Continual post‑training adapts a single text‑to‑image diffusion model to learn new tasks without incurring the cost of separate models, but naïve post-training causes forgetting of pretrained knowledge and undermines zero‑shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post‑training. To address this, we introduce T2I‑ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human‑preference modeling, and vision‑language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post‑training for text‑to‑image models.

Key Takeaways

No single method excels everywhere.

"Oracle" joint learning is not a panacea.

Cross-task generalization remains an open challenge.

Experiment Results

Performance of continual post‑training methods on the sequential item‑customization task using PixArt‑α. ↑: higher is better. ↓: lower is better. "I+I" denotes Item‑Item cross‑task generalization. Excluding Base and Joint, the best result is shown in bold, the second‑best is underlined. For all metrics except Forget, red cells indicate a drop of more than 5% below Base for significant degradation, while green cells indicate an increase of more than 5% above Joint for significant outperformance of the traditional "oracle".

Performance of continual post‑training methods on the sequential domain enhancement task using PixArt‑α. ↑: higher is better. ↓: lower is better. "D+D" denotes Domain‑Domain cross‑task generalization. Excluding Base and Joint, the best result is shown in bold and the second‑best is underlined. For all metrics except Forget, red cells indicate a drop of more than 5% below Base for significant degradation, while green cells indicate an increase of more than 5% above Joint for significant outperformance of the traditional "oracle".

Item-domain adaptation (Order 1) results

Performance of continual post‑training methods for the sequential item‑domain adaptation task (Order 1) using PixArt-α. ↑: higher is better. ↓: lower is better. "I" and "D" denote Item and Domain, respectively, with combinations indicating cross‑task generalization evaluations. Excluding Base and Joint, the best result is shown in bold and the second‑best is underlined. For all metrics except Forget, red cells indicate a drop of more than 5% below Base for significant degradation. Since the traditional "oracle" Joint performs poorly in this mixed adaptation scenario, it is not used as the target to surpass.

Item-domain adaptation (Order 2) results

Performance of continual post‑training methods for the sequential item‑domain adaptation task (Order 2) using PixArt-α. ↑: higher is better. ↓: lower is better. "I" and "D" denote Item and Domain, respectively, with combinations indicating cross‑task generalization evaluations. Excluding Base and Joint, the best result is shown in bold and the second‑best is underlined. For all metrics except Forget, red cells indicate a drop of more than 5% below Base for significant degradation. Since the traditional "oracle" Joint performs poorly in this mixed adaptation scenario, it is not used as the target to surpass.

T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Abstract

Key Takeaways

Evaluation Pipeline

Evaluation pipeline of item customization.

Evaluation pipeline of cross-task generalization.

Continual Post-training Methods

Overview of the continual post‑training baselines evaluated in this work, encompassing rehearsal‑based, regularization‑based, and parameter‑isolation methods (sparse fine‑tuning and low‑rank adaptation).

Experiment Results