Florence-2 Finetuned for Medieval Illustration Description

This model is a finetuned version of the Microsoft Florence-2-base model, specifically trained to generate detailed descriptions of medieval illustrations based on provided image and text data.

Model Description

The base model, Florence-2, is a versatile visual language model. This finetuned version has been trained on a dataset of medieval illustrations. The dataset contained only key description terms, that were transformed in plausible captions with gpt-4o-mini OpenAI model. The goal of the finetuning was to improve the model's ability to generate accurate and relevant captions for this specific domain.

Training Data

The model was finetuned using a dataset containing images of medieval illustrations and key descrptive terms. The dataset was processed to normalize the terms and prepare the data for training.

Preprocessing:

key descriptive terms from the Timel thesaurus were translated from french into english using the LLM model Mistral-Small-24B-Instruct-2501.
these terms were transformed in structured sentences – plausible captions – through gpt-4o-mini OpenAI model

Usage

This model can be used for generating descriptions of medieval illustrations.

python from transformers import AutoModelForCausalLM, AutoProcessor import torch from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#loading model and processor
AutoModelForCausalLM.from_pretrained("francipaolo/florence-2-pal-comp-v1", trust_remote_code=True).to(device) 
processor = AutoProcessor.from_pretrained("francipaolo/florence-2-pal-comp-v1", trust_remote_code=True)

#defining function
def run_example(task_prompt, text_input, image):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer


for idx in range(2):
    image = data['train'][idx]['image'] #pass any PIL Image object (converted to RGB) 
    description = run_example('<MORE_DETAILED_CAPTION>', '', image)
    print(f"Generated Description: {description}")
    display(image.resize([350, 350]))

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for francipaolo/florence-2-pal-comp-v2

Base model

microsoft/Florence-2-base

Finetuned

(17)

this model