ar / README.md
omarkamali's picture
Upload all models and assets for ar (latest)
e240c58 verified
---
language: ar
language_name: Arabic
language_family: arabic
tags:
- wikilangs
- nlp
- tokenizer
- embeddings
- n-gram
- markov
- wikipedia
- feature-extraction
- sentence-similarity
- tokenization
- n-grams
- markov-chain
- text-mining
- fasttext
- babelvec
- vocabulous
- vocabulary
- monolingual
- family-arabic
license: mit
library_name: wikilangs
pipeline_tag: text-generation
datasets:
- omarkamali/wikipedia-monthly
dataset_info:
name: wikipedia-monthly
description: Monthly snapshots of Wikipedia articles across 300+ languages
metrics:
- name: best_compression_ratio
type: compression
value: 4.347
- name: best_isotropy
type: isotropy
value: 0.8111
- name: best_alignment_r10
type: alignment
value: 0.7660
- name: vocabulary_size
type: vocab
value: 986324
generated: 2026-03-04
---
# Arabic โ€” Wikilangs Models
Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Arabic** Wikipedia by [Wikilangs](https://wikilangs.org).
๐ŸŒ [Language Page](https://wikilangs.org/languages/ar/) ยท ๐ŸŽฎ [Playground](https://wikilangs.org/playground/?lang=ar) ยท ๐Ÿ“Š [Full Research Report](RESEARCH_REPORT.md)
## Language Samples
Example sentences drawn from the Arabic Wikipedia corpus:
> ุชุตุบูŠุฑ K \ ูƒูŠ \ ู‡ูˆ ุงู„ุญุฑู ุงู„ุญุงุฏูŠ ุงู„ุนุดุฑ ููŠ ุงู„ุฃุจุฌุฏูŠุฉ The Oxford English Dictionary, 2nd ed., online ูˆูŠู…ุซู„ ู‡ุฐุง ุงู„ุญุฑู ุงู„ุตูˆุช ุงู„ุทุจู‚ูŠ ุงู„ูˆู‚ููŠ ุงู„ู…ู‡ู…ูˆุณ ููŠ ุงู„ูƒูŠู…ูŠุงุกุŒ ูŠุฑู…ุฒ K ู„ุนู†ุตุฑ ุงู„ุจูˆุชุงุณูŠูˆู… ู…ุฑุงุฌุน ู„ุงุชูŠู†ูŠุฉ
> : ุฅุญุฏู‰ ูˆู„ุงูŠุงุช ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ. ู…ุฏูŠู†ุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฃูƒุจุฑ ู…ุฏู† ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ ูˆุฅุญุฏู‰ ุฃูƒุจุฑู‡ุง ููŠ ุงู„ุนุงู„ู…. ู…ู‚ุงุทุนุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฅุญุฏู‰ ู…ู‚ุงุทุนุงุช ูˆู„ุงูŠุฉ ู†ูŠูˆูŠูˆุฑูƒ. ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู†
> ุฃุจูˆ ุฅุจุฑุงู‡ูŠู… ุงู„ูุงุฑุงุจูŠ ุฃุฏูŠุจ ู†ุญูˆูŠ ู„ุบูˆูŠ ุฃุจูˆ ู†ุตุฑ ู…ุญู…ุฏ ุงู„ูุงุฑุงุจูŠ ููŠู„ุณูˆู ู…ุดุงุฆูŠ ู…ุณู„ู… ูˆุทุจูŠุจ
> ุฅุณุญุงู‚ ู†ูŠูˆุชู† ุนุงู„ู… ุฅู†ุฌู„ูŠุฒูŠ ู†ูŠูˆุชู† ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ู‚ูˆุฉ. ุฐูƒูˆุฑ ุฅู†ุฌู„ูŠุฒูŠุฉ ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู†
> ุจูˆุชุงู† (ู…ู…ู„ูƒุฉ) ุจูˆุชุงู† ู…ู…ู„ูƒุฉ ููŠ ุฌุจุงู„ ุงู„ู‡ู…ุงู„ุงูŠุง ุจูŠู† ุงู„ู‡ู†ุฏ ูˆุงู„ุตูŠู†. ุจูˆุชุงู† (ูƒูŠู…ูŠุงุก) ุฃุญุฏ ุงู„ุฃู„ูƒุงู†ุงุชุŒ ูŠุชูƒูˆู† ู…ู† ุฃุฑุจุน ุฐุฑุงุช ูƒุฑุจูˆู†.
## Quick Start
### Load the Tokenizer
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("ar_tokenizer_32k.model")
text = "ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏ"
tokens = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(tokens) # subword pieces
print(ids) # integer ids
# Decode back
print(sp.DecodeIds(ids))
```
<details>
<summary><b>Tokenization examples (click to expand)</b></summary>
**Sample 1:** `ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏโ€ฆ`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `โ–ุงุณุช ูˆุฏูŠ ูˆู‡ ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠ ุฒ ู†ูŠ โ–ุฃูู„ุงู… โ€ฆ (+22 more)` | 32 |
| 16k | `โ–ุงุณุช ูˆุฏูŠ ูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุช โ€ฆ (+10 more)` | 20 |
| 32k | `โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more)` | 17 |
| 64k | `โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more)` | 17 |
**Sample 2:** `ุจุงุณูƒุงู„ ู‚ุฏ ุชุนู†ูŠ: ุงู„ุจุงุณูƒุงู„ุŒ ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ุถุบุท ู„ุบุฉ ุจุงุณูƒุงู„ุŒ ู„ุบุฉ ุจุฑู…ุฌุฉ ุงู„ููŠู„ุณูˆู ุจุงุณูƒุงู„ุŒโ€ฆ`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `โ–ุจุง ุณูƒ ุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุง ุณูƒ ุงู„ ุŒ โ€ฆ (+29 more)` | 39 |
| 16k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+18 more)` | 28 |
| 32k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more)` | 25 |
| 64k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more)` | 25 |
**Sample 3:** `ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆ ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉุŒ ุฒุงุฆูŠุฑ ุณุงุจู‚ู‹ุงุŒ ุนุงุตู…ุชู‡ุง ูƒูŠู†ุดุงุณุง. ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆุŒ ุนุงุตโ€ฆ`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู† ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ ู‹ุง โ€ฆ (+21 more)` | 31 |
| 16k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ€ฆ (+16 more)` | 26 |
| 32k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง โ€ฆ (+12 more)` | 22 |
| 64k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง . โ€ฆ (+10 more)` | 20 |
</details>
### Load Word Embeddings
```python
from gensim.models import KeyedVectors
# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv")
similar = wv.most_similar("word", topn=5)
for word, score in similar:
print(f" {word}: {score:.3f}")
```
### Load N-gram Model
```python
import pyarrow.parquet as pq
df = pq.read_table("ar_3gram_word.parquet").to_pandas()
print(df.head())
```
## Models Overview
![Performance Dashboard](visualizations/performance_dashboard.png)
| Category | Assets |
|----------|--------|
| Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes |
| N-gram models | 2 / 3 / 4 / 5-gram (word & subword) |
| Markov chains | Context 1โ€“5 (word & subword) |
| Embeddings | 32d, 64d, 128d โ€” mono & aligned |
| Vocabulary | Full frequency list + Zipf analysis |
| Statistics | Corpus & model statistics JSON |
## Metrics Summary
| Component | Model | Key Metric | Value |
|-----------|-------|------------|-------|
| Tokenizer | 8k BPE | Compression | 3.25x |
| Tokenizer | 16k BPE | Compression | 3.65x |
| Tokenizer | 32k BPE | Compression | 4.03x |
| Tokenizer | 64k BPE | Compression | 4.35x ๐Ÿ† |
| N-gram | 2-gram (subword) | Perplexity | 426 ๐Ÿ† |
| N-gram | 2-gram (word) | Perplexity | 359,826 |
| N-gram | 3-gram (subword) | Perplexity | 4,163 |
| N-gram | 3-gram (word) | Perplexity | 775,988 |
| N-gram | 4-gram (subword) | Perplexity | 27,277 |
| N-gram | 4-gram (word) | Perplexity | 1,494,234 |
| N-gram | 5-gram (subword) | Perplexity | 133,736 |
| N-gram | 5-gram (word) | Perplexity | 1,059,510 |
| Markov | ctx-1 (subword) | Predictability | 0.0% |
| Markov | ctx-1 (word) | Predictability | 0.0% |
| Markov | ctx-2 (subword) | Predictability | 17.3% |
| Markov | ctx-2 (word) | Predictability | 67.4% |
| Markov | ctx-3 (subword) | Predictability | 29.5% |
| Markov | ctx-3 (word) | Predictability | 89.5% |
| Markov | ctx-4 (subword) | Predictability | 35.2% |
| Markov | ctx-4 (word) | Predictability | 96.5% ๐Ÿ† |
| Vocabulary | full | Size | 986,324 |
| Vocabulary | full | Zipf Rยฒ | 0.9920 |
| Embeddings | mono_32d | Isotropy | 0.8111 |
| Embeddings | mono_64d | Isotropy | 0.7841 |
| Embeddings | mono_128d | Isotropy | 0.7556 |
| Embeddings | aligned_32d | Isotropy | 0.8111 ๐Ÿ† |
| Embeddings | aligned_64d | Isotropy | 0.7841 |
| Embeddings | aligned_128d | Isotropy | 0.7556 |
| Alignment | aligned_32d | R@1 / R@5 / R@10 | 13.4% / 35.0% / 48.6% |
| Alignment | aligned_64d | R@1 / R@5 / R@10 | 28.6% / 54.0% / 65.6% |
| Alignment | aligned_128d | R@1 / R@5 / R@10 | 37.2% / 65.0% / 76.6% ๐Ÿ† |
๐Ÿ“Š **[Full ablation study, per-model breakdowns, and interpretation guide โ†’](RESEARCH_REPORT.md)**
---
## About
Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) โ€” monthly snapshots of 300+ Wikipedia languages.
A project by **[Wikilangs](https://wikilangs.org)** ยท Maintainer: [Omar Kamali](https://omarkamali.com) ยท [Omneity Labs](https://omneitylabs.com)
### Citation
```bibtex
@misc{wikilangs2025,
author = {Kamali, Omar},
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
year = {2025},
doi = {10.5281/zenodo.18073153},
publisher = {Zenodo},
url = {https://huggingface.co/wikilangs},
institution = {Omneity Labs}
}
```
### Links
- ๐ŸŒ [wikilangs.org](https://wikilangs.org)
- ๐ŸŒ [Language page](https://wikilangs.org/languages/ar/)
- ๐ŸŽฎ [Playground](https://wikilangs.org/playground/?lang=ar)
- ๐Ÿค— [HuggingFace models](https://huggingface.co/wikilangs)
- ๐Ÿ“Š [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
- ๐Ÿ‘ค [Omar Kamali](https://huggingface.co/omarkamali)
- ๐Ÿค Sponsor: [Featherless AI](https://featherless.ai)
**License:** MIT โ€” free for academic and commercial use.
---
*Generated by Wikilangs Pipeline ยท 2026-03-04 13:56:39*