After some experiments at a somewhat larger scale, the situation is unfortunately a bit more complex.
Mis-captioning still has a relatively low cost, even relatively extreme mis-captioning has a modest effect on the final result. However, it is noticeable when you prompt for your instance, even without the caption.
Also, when you use the caption in a prompt, it basically learns the version from your training data. So you need several examples of each caption.