--- language: en tags: - image-scoring - aesthetics - siglip2 - pytorch - vision license: apache-2.0 --- # AestheticSigLIP Image aesthetic scorer (1–10) built on **SigLIP 2 So400m NaFlex** (~430M params), fine-tuned with multi-layer feature tapping, SORD loss, and ranking loss. ## Usage ```bash pip install torch torchvision pillow huggingface_hub ``` ```python from huggingface_hub import hf_hub_download # clone or copy predict.py + model/ + naflex.py + config.py from the repo from predict import AestheticScorer scorer = AestheticScorer.from_pretrained("somepago/AestheticSigLIP") # single image score = scorer.rate("photo.jpg") # float, e.g. 7.42 # batch scores = scorer.rate(["a.jpg", "b.jpg"]) # [7.42, 3.81] # PIL image directly from PIL import Image score = scorer.rate(Image.open("photo.jpg")) ``` **CLI:** ```bash python predict.py photo.jpg photo2.jpg --repo somepago/AestheticSigLIP ``` ## Score scale | Score | Meaning | |-------|---------| | 1–2 | Blurry, broken, heavy watermarks | | 3–4 | Bad framing, low effort, obvious AI slop | | 5–6 | Generic, forgettable — typical web image | | 7 | Good — clear intent, solid composition | | 8 | Very good — strong visual impact | | 9–10 | Exceptional — award-level work | ## Architecture ``` Image (any aspect ratio) ↓ NaFlex preprocessing — aspect-ratio-aware patching (max 256 patches, patch=16px) ↓ SigLIP 2 So400m encoder (27 transformer blocks, 1152-d, 16 heads) ├── tap layer 8 → masked mean pool → 1152-d ├── tap layer 17 → masked mean pool → 1152-d └── final pooled (multi-head attention probe) → 1152-d ↓ Concatenate [pool | tap8 | tap17] → 3456-d ↓ MLP head: 3456 → 768 → 256 → 8 bucket logits ↓ softmax → expected value over bucket centers → score ∈ [1, 10] ``` Multi-layer tapping lets early layers contribute low-level cues (color, sharpness) while later layers contribute composition and semantic quality. ## Training - **Base model**: google/siglip2-so400m-patch16-naflex (pretrained weights) - **Training data**: ~134k images hand-curated from diverse web sources, with aesthetic scores annotated by Gemini Flash, spanning the following categories: - **Curated photography** — 21 thematic categories: landscape, portrait, wildlife, macro, architecture, fashion, food, street, product, astrophotography, and more - **General photography** — community-shared and web-sourced real photos (both high and low quality) - **AI-generated images** — outputs from multiple text-to-image diffusion model families - **AI community content** — social/community-shared AI art and renders - **Traditional & digital art** — oil paintings, watercolors, charcoal drawings, digital illustrations - **Graphic design & typography** — posters, logos, typographic layouts - **Stock/commercial imagery** — professional stock photography and product shots - **Negative examples** — images with heavy text overlays, watermarked content, low-quality web images, and broken/corrupt images - **Loss**: SORD (soft ordinal) on 8 non-uniform score buckets + auxiliary ranking loss (λ=0.3) - **Optimizer**: AdamW with LLRD (layer-wise LR decay 0.7×), backbone LR 1e-5, head LR 1e-3 - **EMA**: exponential moving average (decay=0.9998), used for this checkpoint ## Evaluation vs Qwen3-VL-32B harsh critic (1967 images, 18 sources) | | SRCC | MAE | |---|---|---| | AestheticSigLIP | 0.671 | 1.04 | Per-source highlights: `real-lq/hq` (SRCC 0.78/0.64), `pinterest-curated` (0.65), `bad-text` (0.86). Weakest on generic AI art (`pickapic` 0.24, `playground` 0.15) — these categories are underrepresented in training. ## Citation If you would like to cite this work in an academic context, you can use this BibTeX snippet: ```bibtex @misc{aestheticsiglip2026, author = {Somepalli, Gowthami}, title = {AestheticSigLIP: Image Aesthetic Scoring on SigLIP 2 NaFlex}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/somepago/AestheticSigLIP} } ```