CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

ICCV2025

Kuniaki Saito¹Donghyun Kim²Kwanyong Park³Atsushi Hashimoto¹Yoshitaka Ushiku¹

¹OMRON SINIC X Corporation²Korea University³University of Seoul

Overview

CaptionSmiths is a controllable image captioning framework that allows smooth adjustment of caption properties such as length, descriptiveness, and word uniqueness—within a single model. Unlike existing models, which lack explicit conditioning and struggle with smooth transitions between styles, CaptionSmiths quantifies these properties as continuous scalar values and interpolates between learned endpoint representations (e.g., very short ↔ very long). This enables fine-grained control over caption styles. Experiments show that CaptionSmiths not only improves lexical alignment, but also reduces caption length control error by over 500% compared to strong baselines.

CaptionSmiths

Method

Overview during training (left) and inference (right). Left: We compute conditioning values for each caption in Condition Calculator. The conditions are converted into token embeddings via Condition Encoding. Then, the learnable parameters are trained to control the language pattern in the output caption. Right : In inference, users can either manually specify the conditioning scalars or employ an example language pattern as a sentence. Here's our demo text showcasing the power of markdown and KaTeX integration! Markdown allows you to easily format text using simple syntax.

Results

Evaluating descriptiveness control

CLIPScore comparison across different captioning methods

Figure: Increasing the input value corresponding to descriptiveness improves CLIPScore.

Figure: Results of varying descriptiveness score in caption generation.

The graph above shows the qualittive results of controlling descriptiveness of captions. Increasing the value improves the CLIPScore. The figure shows the examples of generated captions in varying the descriptiveness condition. Increasing the value tends to output more descriptive captions.

Evaluating uniqueness control

Figure: Results of varying descriptiveness score in caption generation.

CaptionSmiths also controls the uniqueness in the vocabulary of output captions. The example above shows the results of outputting fine-grained words to describe entities in images.

Evaluation on lexical alignment

Models	Parameter Size	MSCOCO (Short)	LN COCO (Middle)	Docci (Long)
LLaVA-1.5	7.1B	0.0	1.1	2.4
Blip-3	4.6B	57.5	2.0	3.6
Qwen2-VL-7B	8.3B	84.3	3.8	5.5
Vanilla Supervised	21.3B	96.6	23.6	1.4
Fine-tuned LLaVA-1.5	21.3B	98.3	23.9	9.1
Vanilla	7.1B	13.2	21.7	7.5
Concap	7.1B	95.9	23.5	8.3
CaptionSmiths	7.1B	104.8	37.4	29.7

CaptionSmiths is a single model with 7.1B parameter size, but can handles generation of diverse captions.

Contact

GitHub issues

GitHub.com

contact@sinicx.com

OMRON SINIC X

Kuniaki Saito

Citation

# arXiv version