Missing text decoder weights for caption generation in released conch_ViT-B-16 checkpoint?

#35

by prakashroy - opened 7 days ago

Hi CONCH team, thank you for the fantastic work on this foundation model.
I have been experimenting with the conch_ViT-B-16 checkpoint and the custom OpenCLIP implementation provided in the repository. While the image/text embeddings and contrastive retrieval tasks work perfectly, I am running into issues when attempting to use the model for caption generation.
The Issue:
The published paper details that CONCH was trained on 1.17M image-caption pairs and evaluated on captioning tasks. Furthermore, the model object initialized via conch.open_clip_custom successfully exposes the text_decoder module and the .generate() method.
However, when passing an image tensor through .generate() and decoding the resulting token IDs using the CONCH tokenizer, the output does not reflect functional caption generation. It appears the generative text decoder weights are either omitted, frozen, or not functioning as intended in the currently available Hugging Face checkpoint.

mdhrbajpai

6 days ago

In the Note they have mentioned that the decoder is removed from the public release.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment