| --- |
| language: ar |
| language_name: Arabic |
| language_family: arabic |
| tags: |
| - wikilangs |
| - nlp |
| - tokenizer |
| - embeddings |
| - n-gram |
| - markov |
| - wikipedia |
| - feature-extraction |
| - sentence-similarity |
| - tokenization |
| - n-grams |
| - markov-chain |
| - text-mining |
| - fasttext |
| - babelvec |
| - vocabulous |
| - vocabulary |
| - monolingual |
| - family-arabic |
| license: mit |
| library_name: wikilangs |
| pipeline_tag: text-generation |
| datasets: |
| - omarkamali/wikipedia-monthly |
| dataset_info: |
| name: wikipedia-monthly |
| description: Monthly snapshots of Wikipedia articles across 300+ languages |
| metrics: |
| - name: best_compression_ratio |
| type: compression |
| value: 4.347 |
| - name: best_isotropy |
| type: isotropy |
| value: 0.8111 |
| - name: best_alignment_r10 |
| type: alignment |
| value: 0.7660 |
| - name: vocabulary_size |
| type: vocab |
| value: 986324 |
| generated: 2026-03-04 |
| --- |
| |
| # Arabic โ Wikilangs Models |
|
|
| Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Arabic** Wikipedia by [Wikilangs](https://wikilangs.org). |
|
|
| ๐ [Language Page](https://wikilangs.org/languages/ar/) ยท ๐ฎ [Playground](https://wikilangs.org/playground/?lang=ar) ยท ๐ [Full Research Report](RESEARCH_REPORT.md) |
|
|
| ## Language Samples |
|
|
| Example sentences drawn from the Arabic Wikipedia corpus: |
|
|
| > ุชุตุบูุฑ K \ ูู \ ูู ุงูุญุฑู ุงูุญุงุฏู ุงูุนุดุฑ ูู ุงูุฃุจุฌุฏูุฉ The Oxford English Dictionary, 2nd ed., online ููู
ุซู ูุฐุง ุงูุญุฑู ุงูุตูุช ุงูุทุจูู ุงููููู ุงูู
ูู
ูุณ ูู ุงูููู
ูุงุกุ ูุฑู
ุฒ K ูุนูุตุฑ ุงูุจูุชุงุณููู
ู
ุฑุงุฌุน ูุงุชูููุฉ |
|
|
| > : ุฅุญุฏู ููุงูุงุช ุงูููุงูุงุช ุงูู
ุชุญุฏุฉ ุงูุฃู
ุฑูููุฉ. ู
ุฏููุฉ ูููููุฑู: ุฃูุจุฑ ู
ุฏู ุงูููุงูุงุช ุงูู
ุชุญุฏุฉ ุงูุฃู
ุฑูููุฉ ูุฅุญุฏู ุฃูุจุฑูุง ูู ุงูุนุงูู
. ู
ูุงุทุนุฉ ูููููุฑู: ุฅุญุฏู ู
ูุงุทุนุงุช ููุงูุฉ ูููููุฑู. ุชูุถูุญ ุฃุณู
ุงุก ุฃู
ุงูู |
|
|
| > ุฃุจู ุฅุจุฑุงููู
ุงููุงุฑุงุจู ุฃุฏูุจ ูุญูู ูุบูู ุฃุจู ูุตุฑ ู
ุญู
ุฏ ุงููุงุฑุงุจู ูููุณูู ู
ุดุงุฆู ู
ุณูู
ูุทุจูุจ |
|
|
| > ุฅุณุญุงู ูููุชู ุนุงูู
ุฅูุฌููุฒู ูููุชู ูุญุฏุฉ ููุงุณ ุงูููุฉ. ุฐููุฑ ุฅูุฌููุฒูุฉ ุชูุถูุญ ุฃุณู
ุงุก ุฃู
ุงูู |
|
|
| > ุจูุชุงู (ู
ู
ููุฉ) ุจูุชุงู ู
ู
ููุฉ ูู ุฌุจุงู ุงููู
ุงูุงูุง ุจูู ุงูููุฏ ูุงูุตูู. ุจูุชุงู (ููู
ูุงุก) ุฃุญุฏ ุงูุฃููุงูุงุชุ ูุชููู ู
ู ุฃุฑุจุน ุฐุฑุงุช ูุฑุจูู. |
|
|
| ## Quick Start |
|
|
| ### Load the Tokenizer |
|
|
| ```python |
| import sentencepiece as spm |
| |
| sp = spm.SentencePieceProcessor() |
| sp.Load("ar_tokenizer_32k.model") |
| |
| text = "ุงุณุชูุฏูููุงุช ุฃููุงู
ูุงูุช ุฏูุฒูู ุฃููุงู
ูุงูุช ุฏูุฒูู ู
ูุชุฌุน ูุงูุช ุฏูุฒูู ุงูุนุงูู
ู ุฏูุฒูู ูุงูุฏ" |
| tokens = sp.EncodeAsPieces(text) |
| ids = sp.EncodeAsIds(text) |
| |
| print(tokens) # subword pieces |
| print(ids) # integer ids |
| |
| # Decode back |
| print(sp.DecodeIds(ids)) |
| ``` |
|
|
| <details> |
| <summary><b>Tokenization examples (click to expand)</b></summary> |
|
|
| **Sample 1:** `ุงุณุชูุฏูููุงุช ุฃููุงู
ูุงูุช ุฏูุฒูู ุฃููุงู
ูุงูุช ุฏูุฒูู ู
ูุชุฌุน ูุงูุช ุฏูุฒูู ุงูุนุงูู
ู ุฏูุฒูู ูุงูุฏโฆ` |
|
|
| | Vocab | Tokens | Count | |
| |-------|--------|-------| |
| | 8k | `โุงุณุช ูุฏู ูู ุงุช โุฃููุงู
โูุงูุช โุฏู ุฒ ูู โุฃููุงู
โฆ (+22 more)` | 32 | |
| | 16k | `โุงุณุช ูุฏู ููุงุช โุฃููุงู
โูุงูุช โุฏูุฒูู โุฃููุงู
โูุงูุช โุฏูุฒูู โู
ูุช โฆ (+10 more)` | 20 | |
| | 32k | `โุงุณุชูุฏูููุงุช โุฃููุงู
โูุงูุช โุฏูุฒูู โุฃููุงู
โูุงูุช โุฏูุฒูู โู
ูุชุฌุน โูุงูุช โุฏูุฒูู โฆ (+7 more)` | 17 | |
| | 64k | `โุงุณุชูุฏูููุงุช โุฃููุงู
โูุงูุช โุฏูุฒูู โุฃููุงู
โูุงูุช โุฏูุฒูู โู
ูุชุฌุน โูุงูุช โุฏูุฒูู โฆ (+7 more)` | 17 | |
|
|
| **Sample 2:** `ุจุงุณูุงู ูุฏ ุชุนูู: ุงูุจุงุณูุงูุ ูุญุฏุฉ ููุงุณ ุงูุถุบุท ูุบุฉ ุจุงุณูุงูุ ูุบุฉ ุจุฑู
ุฌุฉ ุงููููุณูู ุจุงุณูุงูุโฆ` |
|
|
| | Vocab | Tokens | Count | |
| |-------|--------|-------| |
| | 8k | `โุจุง ุณู ุงู โูุฏ โุชุนูู : โุงูุจุง ุณู ุงู ุ โฆ (+29 more)` | 39 | |
| | 16k | `โุจุงุณูุงู โูุฏ โุชุนูู : โุงูุจุงุณู ุงู ุ โูุญุฏุฉ โููุงุณ โุงูุถุบุท โฆ (+18 more)` | 28 | |
| | 32k | `โุจุงุณูุงู โูุฏ โุชุนูู : โุงูุจุงุณู ุงู ุ โูุญุฏุฉ โููุงุณ โุงูุถุบุท โฆ (+15 more)` | 25 | |
| | 64k | `โุจุงุณูุงู โูุฏ โุชุนูู : โุงูุจุงุณู ุงู ุ โูุญุฏุฉ โููุงุณ โุงูุถุบุท โฆ (+15 more)` | 25 | |
|
|
| **Sample 3:** `ุฌู
ููุฑูุฉ ุงููููุบู ุงูุฏูู
ูุฑุงุทูุฉุ ุฒุงุฆูุฑ ุณุงุจููุงุ ุนุงุตู
ุชูุง ูููุดุงุณุง. ุฌู
ููุฑูุฉ ุงููููุบูุ ุนุงุตโฆ` |
|
|
| | Vocab | Tokens | Count | |
| |-------|--------|-------| |
| | 8k | `โุฌู
ููุฑูุฉ โุงูููู ุบู โุงูุฏูู
ูุฑุงุทูุฉ ุ โุฒ ุงุฆ ูุฑ โุณุงุจู ูุง โฆ (+21 more)` | 31 | |
| | 16k | `โุฌู
ููุฑูุฉ โุงููููุบู โุงูุฏูู
ูุฑุงุทูุฉ ุ โุฒ ุงุฆ ูุฑ โุณุงุจููุง ุ โุนุงุตู
ุชูุง โฆ (+16 more)` | 26 | |
| | 32k | `โุฌู
ููุฑูุฉ โุงููููุบู โุงูุฏูู
ูุฑุงุทูุฉ ุ โุฒุงุฆ ูุฑ โุณุงุจููุง ุ โุนุงุตู
ุชูุง โูููุดุงุณุง โฆ (+12 more)` | 22 | |
| | 64k | `โุฌู
ููุฑูุฉ โุงููููุบู โุงูุฏูู
ูุฑุงุทูุฉ ุ โุฒุงุฆูุฑ โุณุงุจููุง ุ โุนุงุตู
ุชูุง โูููุดุงุณุง . โฆ (+10 more)` | 20 | |
|
|
| </details> |
|
|
| ### Load Word Embeddings |
|
|
| ```python |
| from gensim.models import KeyedVectors |
| |
| # Aligned embeddings (cross-lingual, mapped to English vector space) |
| wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv") |
| |
| similar = wv.most_similar("word", topn=5) |
| for word, score in similar: |
| print(f" {word}: {score:.3f}") |
| ``` |
|
|
| ### Load N-gram Model |
|
|
| ```python |
| import pyarrow.parquet as pq |
| |
| df = pq.read_table("ar_3gram_word.parquet").to_pandas() |
| print(df.head()) |
| ``` |
|
|
| ## Models Overview |
|
|
|  |
|
|
| | Category | Assets | |
| |----------|--------| |
| | Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes | |
| | N-gram models | 2 / 3 / 4 / 5-gram (word & subword) | |
| | Markov chains | Context 1โ5 (word & subword) | |
| | Embeddings | 32d, 64d, 128d โ mono & aligned | |
| | Vocabulary | Full frequency list + Zipf analysis | |
| | Statistics | Corpus & model statistics JSON | |
|
|
| ## Metrics Summary |
|
|
| | Component | Model | Key Metric | Value | |
| |-----------|-------|------------|-------| |
| | Tokenizer | 8k BPE | Compression | 3.25x | |
| | Tokenizer | 16k BPE | Compression | 3.65x | |
| | Tokenizer | 32k BPE | Compression | 4.03x | |
| | Tokenizer | 64k BPE | Compression | 4.35x ๐ | |
| | N-gram | 2-gram (subword) | Perplexity | 426 ๐ | |
| | N-gram | 2-gram (word) | Perplexity | 359,826 | |
| | N-gram | 3-gram (subword) | Perplexity | 4,163 | |
| | N-gram | 3-gram (word) | Perplexity | 775,988 | |
| | N-gram | 4-gram (subword) | Perplexity | 27,277 | |
| | N-gram | 4-gram (word) | Perplexity | 1,494,234 | |
| | N-gram | 5-gram (subword) | Perplexity | 133,736 | |
| | N-gram | 5-gram (word) | Perplexity | 1,059,510 | |
| | Markov | ctx-1 (subword) | Predictability | 0.0% | |
| | Markov | ctx-1 (word) | Predictability | 0.0% | |
| | Markov | ctx-2 (subword) | Predictability | 17.3% | |
| | Markov | ctx-2 (word) | Predictability | 67.4% | |
| | Markov | ctx-3 (subword) | Predictability | 29.5% | |
| | Markov | ctx-3 (word) | Predictability | 89.5% | |
| | Markov | ctx-4 (subword) | Predictability | 35.2% | |
| | Markov | ctx-4 (word) | Predictability | 96.5% ๐ | |
| | Vocabulary | full | Size | 986,324 | |
| | Vocabulary | full | Zipf Rยฒ | 0.9920 | |
| | Embeddings | mono_32d | Isotropy | 0.8111 | |
| | Embeddings | mono_64d | Isotropy | 0.7841 | |
| | Embeddings | mono_128d | Isotropy | 0.7556 | |
| | Embeddings | aligned_32d | Isotropy | 0.8111 ๐ | |
| | Embeddings | aligned_64d | Isotropy | 0.7841 | |
| | Embeddings | aligned_128d | Isotropy | 0.7556 | |
| | Alignment | aligned_32d | R@1 / R@5 / R@10 | 13.4% / 35.0% / 48.6% | |
| | Alignment | aligned_64d | R@1 / R@5 / R@10 | 28.6% / 54.0% / 65.6% | |
| | Alignment | aligned_128d | R@1 / R@5 / R@10 | 37.2% / 65.0% / 76.6% ๐ | |
| |
| ๐ **[Full ablation study, per-model breakdowns, and interpretation guide โ](RESEARCH_REPORT.md)** |
| |
| --- |
| |
| ## About |
| |
| Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) โ monthly snapshots of 300+ Wikipedia languages. |
| |
| A project by **[Wikilangs](https://wikilangs.org)** ยท Maintainer: [Omar Kamali](https://omarkamali.com) ยท [Omneity Labs](https://omneitylabs.com) |
| |
| ### Citation |
| |
| ```bibtex |
| @misc{wikilangs2025, |
| author = {Kamali, Omar}, |
| title = {Wikilangs: Open NLP Models for Wikipedia Languages}, |
| year = {2025}, |
| doi = {10.5281/zenodo.18073153}, |
| publisher = {Zenodo}, |
| url = {https://huggingface.co/wikilangs}, |
| institution = {Omneity Labs} |
| } |
| ``` |
| |
| ### Links |
| |
| - ๐ [wikilangs.org](https://wikilangs.org) |
| - ๐ [Language page](https://wikilangs.org/languages/ar/) |
| - ๐ฎ [Playground](https://wikilangs.org/playground/?lang=ar) |
| - ๐ค [HuggingFace models](https://huggingface.co/wikilangs) |
| - ๐ [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) |
| - ๐ค [Omar Kamali](https://huggingface.co/omarkamali) |
| - ๐ค Sponsor: [Featherless AI](https://featherless.ai) |
| |
| **License:** MIT โ free for academic and commercial use. |
| |
| --- |
| *Generated by Wikilangs Pipeline ยท 2026-03-04 13:56:39* |
| |