santhosh commited on
Commit
8f93d1a
·
verified ·
1 Parent(s): 28ccf78

Release 0.1.0

Browse files
Files changed (4) hide show
  1. README.md +49 -0
  2. special_tokens_map.json +7 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +10 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ml
3
+ license: mit
4
+ tags:
5
+ - malayalam
6
+ - tokenizer
7
+ - bpe
8
+ library_name: tokenizers
9
+ version: 0.1.0
10
+ ---
11
+
12
+ # Malayalam BPE Tokenizer
13
+
14
+ A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus.
15
+ Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library
16
+ with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam
17
+ Unicode conjuncts.
18
+
19
+ ## Details
20
+
21
+ | Property | Value |
22
+ |---|---|
23
+ | Algorithm | BPE (Byte Pair Encoding) |
24
+ | Vocabulary size | 16,000 |
25
+ | Pre-tokenizer | Metaspace (`▁`) |
26
+ | Normalizer | NFC + Strip |
27
+ | Special tokens | `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` |
28
+
29
+ ## Usage
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer")
35
+
36
+ text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
37
+ tokens = tokenizer.tokenize(text)
38
+ print(tokens)
39
+
40
+ encoded = tokenizer(text, return_tensors="pt")
41
+ print(encoded)
42
+ ```
43
+
44
+ ## Notes
45
+
46
+ - Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte
47
+ UTF-8 sequences into invalid bytes.
48
+ - NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
49
+ - Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "unk_token": "<unk>",
5
+ "pad_token": "<pad>",
6
+ "mask_token": "<mask>"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 1000000000,
4
+ "bos_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "unk_token": "<unk>",
7
+ "pad_token": "<pad>",
8
+ "mask_token": "<mask>",
9
+ "clean_up_tokenization_spaces": false
10
+ }