Release 0.1.0

Files changed (4) hide show

README.md ADDED Viewed

+---
+language: ml
+license: mit
+tags:
+  - malayalam
+  - tokenizer
+  - bpe
+library_name: tokenizers
+version: 0.1.0
+---
+# Malayalam BPE Tokenizer
+A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus.
+Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library
+with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam
+Unicode conjuncts.
+## Details
+| Property | Value |
+|---|---|
+| Algorithm | BPE (Byte Pair Encoding) |
+| Vocabulary size | 16,000 |
+| Pre-tokenizer | Metaspace (`▁`) |
+| Normalizer | NFC + Strip |
+| Special tokens | `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` |
+## Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer")
+text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
+tokens = tokenizer.tokenize(text)
+print(tokens)
+encoded = tokenizer(text, return_tensors="pt")
+print(encoded)
+```
+## Notes
+- Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte
+  UTF-8 sequences into invalid bytes.
+- NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
+- Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).

special_tokens_map.json ADDED Viewed

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "mask_token": "<mask>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "model_max_length": 1000000000,
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "mask_token": "<mask>",
+  "clean_up_tokenization_spaces": false
+}