Instructions to use MahmoodLab/CONCH with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- timm
How to use MahmoodLab/CONCH with timm:
import timm model = timm.create_model("hf_hub:MahmoodLab/CONCH", pretrained=True) - Notebooks
- Google Colab
- Kaggle
Missing text decoder weights for caption generation in released conch_ViT-B-16 checkpoint?
Hi CONCH team, thank you for the fantastic work on this foundation model.
I have been experimenting with the conch_ViT-B-16 checkpoint and the custom OpenCLIP implementation provided in the repository. While the image/text embeddings and contrastive retrieval tasks work perfectly, I am running into issues when attempting to use the model for caption generation.
The Issue:
The published paper details that CONCH was trained on 1.17M image-caption pairs and evaluated on captioning tasks. Furthermore, the model object initialized via conch.open_clip_custom successfully exposes the text_decoder module and the .generate() method.
However, when passing an image tensor through .generate() and decoding the resulting token IDs using the CONCH tokenizer, the output does not reflect functional caption generation. It appears the generative text decoder weights are either omitted, frozen, or not functioning as intended in the currently available Hugging Face checkpoint.
In the Note they have mentioned that the decoder is removed from the public release.