---
language: en
tags:
  - image-scoring
  - aesthetics
  - siglip2
  - pytorch
  - vision
license: apache-2.0
---

# AestheticSigLIP

Image aesthetic scorer (1–10) built on **SigLIP 2 So400m NaFlex** (~430M params), fine-tuned with multi-layer feature tapping, SORD loss, and ranking loss.

## Usage

```bash
pip install torch torchvision pillow huggingface_hub
```

```python
from huggingface_hub import hf_hub_download
# clone or copy predict.py + model/ + naflex.py + config.py from the repo
from predict import AestheticScorer

scorer = AestheticScorer.from_pretrained("somepago/AestheticSigLIP")

# single image
score = scorer.rate("photo.jpg")          # float, e.g. 7.42

# batch
scores = scorer.rate(["a.jpg", "b.jpg"])  # [7.42, 3.81]

# PIL image directly
from PIL import Image
score = scorer.rate(Image.open("photo.jpg"))
```

**CLI:**
```bash
python predict.py photo.jpg photo2.jpg --repo somepago/AestheticSigLIP
```

## Score scale

| Score | Meaning |
|-------|---------|
| 1–2 | Blurry, broken, heavy watermarks |
| 3–4 | Bad framing, low effort, obvious AI slop |
| 5–6 | Generic, forgettable — typical web image |
| 7 | Good — clear intent, solid composition |
| 8 | Very good — strong visual impact |
| 9–10 | Exceptional — award-level work |

## Architecture

```
Image (any aspect ratio)
  ↓
NaFlex preprocessing — aspect-ratio-aware patching (max 256 patches, patch=16px)
  ↓
SigLIP 2 So400m encoder (27 transformer blocks, 1152-d, 16 heads)
  ├── tap layer  8 → masked mean pool → 1152-d
  ├── tap layer 17 → masked mean pool → 1152-d
  └── final pooled (multi-head attention probe) → 1152-d
  ↓
Concatenate [pool | tap8 | tap17] → 3456-d
  ↓
MLP head: 3456 → 768 → 256 → 8 bucket logits
  ↓
softmax → expected value over bucket centers → score ∈ [1, 10]
```

Multi-layer tapping lets early layers contribute low-level cues (color, sharpness) while later layers contribute composition and semantic quality.

## Training

- **Base model**: google/siglip2-so400m-patch16-naflex (pretrained weights)
- **Training data**: ~134k images hand-curated from diverse web sources, with aesthetic scores annotated by Gemini Flash, spanning the following categories:
  - **Curated photography** — 21 thematic categories: landscape, portrait, wildlife, macro, architecture, fashion, food, street, product, astrophotography, and more
  - **General photography** — community-shared and web-sourced real photos (both high and low quality)
  - **AI-generated images** — outputs from multiple text-to-image diffusion model families
  - **AI community content** — social/community-shared AI art and renders
  - **Traditional & digital art** — oil paintings, watercolors, charcoal drawings, digital illustrations
  - **Graphic design & typography** — posters, logos, typographic layouts
  - **Stock/commercial imagery** — professional stock photography and product shots
  - **Negative examples** — images with heavy text overlays, watermarked content, low-quality web images, and broken/corrupt images
- **Loss**: SORD (soft ordinal) on 8 non-uniform score buckets + auxiliary ranking loss (λ=0.3)
- **Optimizer**: AdamW with LLRD (layer-wise LR decay 0.7×), backbone LR 1e-5, head LR 1e-3
- **EMA**: exponential moving average (decay=0.9998), used for this checkpoint

## Evaluation vs Qwen3-VL-32B harsh critic (1967 images, 18 sources)

| | SRCC | MAE |
|---|---|---|
| AestheticSigLIP | 0.671 | 1.04 |

Per-source highlights: `real-lq/hq` (SRCC 0.78/0.64), `pinterest-curated` (0.65), `bad-text` (0.86). Weakest on generic AI art (`pickapic` 0.24, `playground` 0.15) — these categories are underrepresented in training.

## Citation

If you would like to cite this work in an academic context, you can use this BibTeX snippet:

```bibtex
@misc{aestheticsiglip2026,
  author    = {Somepalli, Gowthami},
  title     = {AestheticSigLIP: Image Aesthetic Scoring on SigLIP 2 NaFlex},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/somepago/AestheticSigLIP}
}
```