Florence-2 Finetuned for Medieval Illustration Description
This model is a finetuned version of the Microsoft Florence-2-base model, specifically trained to generate detailed descriptions of medieval illustrations based on provided image and text data.
Model Description
The base model, Florence-2, is a versatile visual language model. This finetuned version has been trained on a dataset of medieval illustrations. The dataset contained only key description terms, that were transformed in plausible captions with gpt-4o-mini OpenAI model. The goal of the finetuning was to improve the model's ability to generate accurate and relevant captions for this specific domain.
Training Data
The model was finetuned using a dataset containing images of medieval illustrations and key descrptive terms. The dataset was processed to normalize the terms and prepare the data for training.
Preprocessing:
- key descriptive terms from the Timel thesaurus were translated from french into english using the LLM model Mistral-Small-24B-Instruct-2501.
- these terms were transformed in structured sentences โ plausible captions โ through gpt-4o-mini OpenAI model
Usage
This model can be used for generating descriptions of medieval illustrations.
python from transformers import AutoModelForCausalLM, AutoProcessor import torch from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#loading model and processor
AutoModelForCausalLM.from_pretrained("francipaolo/florence-2-pal-comp-v1", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("francipaolo/florence-2-pal-comp-v1", trust_remote_code=True)
#defining function
def run_example(task_prompt, text_input, image):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
# Ensure the image is in RGB mode
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
return parsed_answer
for idx in range(2):
image = data['train'][idx]['image'] #pass any PIL Image object (converted to RGB)
description = run_example('<MORE_DETAILED_CAPTION>', '', image)
print(f"Generated Description: {description}")
display(image.resize([350, 350]))
- Downloads last month
- 5
Model tree for francipaolo/florence-2-pal-comp-v2
Base model
microsoft/Florence-2-base