MuRIL Indian Address NER v1

Fine-tuned google/muril-base-cased for Indian address component detection in Hindi (Devanagari), English, and Hinglish (Roman-script Hindi-English code-mix).

Labels

Tag Meaning Example
ADDRESS_HOUSE House / flat / door / plot number "H.No. 12", "मकान नं. 21", "Flat 4B"
ADDRESS_BUILDING Building / apartment / society name "Prestige Residency"
ADDRESS_STREET Street, road, lane, gali, marg "MG Road", "गली नं. 4"
ADDRESS_LANDMARK Landmark anchor ("near / opposite X") "near Apollo Hospital", "मंदिर के पास"
ADDRESS_LOCALITY Area, colony, nagar, mohalla, sector "Koramangala", "Gandhi Nagar"
ADDRESS_CITY City or town "Bengaluru", "बेंगलुरु"
ADDRESS_STATE State or union territory "Karnataka", "कर्नाटक"
ADDRESS_PIN 6-digit Indian PIN code (optional) "560001"

PIN code is not required — the model recognises addresses without a PIN.

Performance (Benchmark v1 — 2026-04-20)

Evaluated on a held-out slice of data/generated/address_benchmark_v1 (synthetic Indian addresses across Devanagari, English, and Hinglish).

Entity Precision Recall F1
ADDRESS_LANDMARK 0.998 1.000 0.999
ADDRESS_STREET 0.996 1.000 0.998
ADDRESS_HOUSE 0.995 1.000 0.998
ADDRESS_PIN 0.995 1.000 0.997
ADDRESS_BUILDING 0.991 1.000 0.996
ADDRESS_CITY 0.968 0.981 0.974
ADDRESS_STATE 0.752 0.858 0.801
ADDRESS_LOCALITY 0.702 0.851 0.769
Overall 0.864 0.925 0.815

Training: A100-SXM4-40GB, 4 epochs, 26,728 examples, no overfitting detected.

Limitations

  • Trained on synthetic data only (v1). Real-world performance will improve in v2 after adding ai4bharat/naamapadam and lince-benchmark/lince supervision.
  • ADDRESS_LOCALITY precision is lower than other entities (0.702) — the model over-predicts locality in Devanagari prose outside address context.
  • Coverage is limited to the in-repo gazetteer (~50 cities, 33 states, 100+ localities).

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="mukuls9971/muril-indian-address-ner-v1",
    aggregation_strategy="simple",
)

results = ner("H.No. 12, MG Road, Koramangala, Bengaluru - 560034")
# or Devanagari
results = ner("मकान नं. 21, गांधी नगर, भोपाल - 462001")
# or Hinglish
results = ner("Makan No. 4, Gandhi Nagar ke paas, Bhopal")

Training Details

  • Base model: google/muril-base-cased
  • Dataset: Synthetic Indian address corpus v1 (seed 42)
  • Epochs: 4, batch size 8, max_length 192
  • Learning rate: 2e-5, warmup ratio 0.1, weight decay 0.01
  • Weighted loss: enabled (class imbalance handling)
  • Run ID: 20260420_030651_muril-address-benchmark-v1
Downloads last month
18
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results