CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
ICCV2025CaptionSmiths is a controllable image captioning framework that allows smooth adjustment of caption properties such as length, descriptiveness, and word uniqueness—within a single model. Unlike existing models, which lack explicit conditioning and struggle with smooth transitions between styles, CaptionSmiths quantifies these properties as continuous scalar values and interpolates between learned endpoint representations (e.g., very short ↔ very long). This enables fine-grained control over caption styles. Experiments show that CaptionSmiths not only improves lexical alignment, but also reduces caption length control error by over 500% compared to strong baselines.
Overview during training (left) and inference (right). Left: We compute conditioning values for each caption in Condition Calculator. The conditions are converted into token embeddings via Condition Encoding. Then, the learnable parameters are trained to control the language pattern in the output caption. Right : In inference, users can either manually specify the conditioning scalars or employ an example language pattern as a sentence. Here's our demo text showcasing the power of markdown and KaTeX integration! Markdown allows you to easily format text using simple syntax.
Evaluating descriptiveness control
Figure: Increasing the input value corresponding to descriptiveness improves CLIPScore.
Figure: Results of varying descriptiveness score in caption generation.
Evaluating uniqueness control
Figure: Results of varying descriptiveness score in caption generation.
Evaluation on lexical alignment
Models | Parameter Size | MSCOCO (Short) | LN COCO (Middle) | Docci (Long) |
---|---|---|---|---|
LLaVA-1.5 | 7.1B | 0.0 | 1.1 | 2.4 |
Blip-3 | 4.6B | 57.5 | 2.0 | 3.6 |
Qwen2-VL-7B | 8.3B | 84.3 | 3.8 | 5.5 |
Vanilla Supervised | 21.3B | 96.6 | 23.6 | 1.4 |
Fine-tuned LLaVA-1.5 | 21.3B | 98.3 | 23.9 | 9.1 |
Vanilla | 7.1B | 13.2 | 21.7 | 7.5 |
Concap | 7.1B | 95.9 | 23.5 | 8.3 |
CaptionSmiths | 7.1B | 104.8 | 37.4 | 29.7 |
# arXiv version