IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. "Think before you locate" could be a new path to explore.
I know we are moving away from Reddit. However, if I don't link, I feel like we may miss out good threads on r/machinelearning. Moreover, the authors don't only post arxiv links, they post other sutff such as Summary, Key points, ... (e.g this).
So can I at least put them in the posts instead of posting in a comment?
The idea is similar to BLIP-2. Both papers use learnable tokens as queries for a transformer decoder. This decoder query from vision space base on the trainable queries and prompt.