Talkie 1930 13B Instruct β MLX
MLX port of lewtun/talkie-1930-13b-it-hf for Apple Silicon. Refer to the upstream model card for training-data, evaluation, and provenance details β this card covers only the MLX conversion.
Talkie is a 13B instruction-tuned decoder-only transformer whose outputs are styled as pre-1930s English prose. It uses a custom architecture (custom RoPE convention, weightless RMSNorm, per-head and per-layer scalar gains, embedding-skip residuals, scaled lm_head weights) that is not currently in transformers/models/.
Native Talkie support was added to mlx-lm in PR #1231.
Variants
| Repo | Quantization | bpw | Approx. size |
|---|---|---|---|
warshanks/talkie-1930-13b-it-mlx-bf16 |
none (bf16) | 16 | 25 GB |
warshanks/talkie-1930-13b-it-mlx-8bit |
affine 8-bit, group 64 | 8.5 | 13 GB |
warshanks/talkie-1930-13b-it-mlx-6bit |
affine 6-bit, group 64 | 6.5 | 10 GB |
warshanks/talkie-1930-13b-it-mlx-4bit |
mixed 4-bit (lm_head=q8, embed=bf16, blocks 14/37/38=q8, rest q4) |
5.18 | 8 GB |
warshanks/talkie-1930-13b-it-mlx-4bit-DWQ |
DWQ-calibrated 4-bit | 4.5 | 7 GB |
For 4-bit, prefer the DWQ build. Bare q4 of this model degrades into repetition on long generations; DWQ calibration recovers clean output (validation loss 0.037 vs β0.25 for bare q4 in our run).
Installation
pip install -U mlx-lm
Talkie support is in mlx-lm β₯ the version that includes PR #1231. Until released, install from source:
pip install -U git+https://github.com/ml-explore/mlx-lm
Basic generation
from mlx_lm import load, generate
model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")
messages = [{"role": "user", "content": "Write an essay predicting what life will be like in the year 1960."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
CLI:
mlx_lm.generate \
--model warshanks/talkie-1930-13b-it-mlx-4bit-DWQ \
--prompt "<|user|>What were the causes of the French Revolution?<|end|><|assistant|>" \
--max-tokens 512 --temp 0.7
Multi-turn chat
from mlx_lm import load, generate
model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")
messages = [{"role": "user", "content": "What were the causes of the French Revolution?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
reply = generate(model, tokenizer, prompt=prompt, max_tokens=512)
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512))
Chat template
<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>
Applied automatically by tokenizer.apply_chat_template().
Architecture (as observed in the source checkpoint and modeling code)
| Component | Value |
|---|---|
| Parameters | 13B |
| Layers | 40 |
| Attention heads | 40 (MHA, no GQA) |
| Hidden size | 5120 |
| Head dimension | 128 |
| Intermediate size (MLP) | 13696 |
| Position encoding | RoPE (ΞΈ = 1,000,000), inverse-rotation convention |
| Activation | SwiGLU |
| Normalization | weightless RMSNorm (pre-norm) |
| Context length | 2048 |
| Vocabulary | 65,540 |
| Precision | bfloat16 |
Architectural quirks the MLX port reproduces:
- Custom RoPE β formula
y1 = x1*cos + x2*sin,y2 = -x1*sin + x2*cos(rotation by βΞΈ, the inverse of the HF/Llama convention).mx.fast.ropeis not directly usable; the port ships a smallTalkieRoPEclass. - Weightless RMSNorm β applied at the embedding output, before each attention block, before each MLP block, on the post-RoPE Q and K tensors, and before the final
lm_head. No learned scale; reduction in fp32 then cast back. - Per-head Q gain β learnable scalar per attention head applied to queries after RoPE + Q-norm.
- Per-layer scalar gains β
attn_gainandmlp_gain(initialized to(2L)^-0.5) scale the residual contributions;embed_skip(initialized to0.0) scales an extra residual from the post-first-norm embedding into every block. - lm_head with weight gain β stored as a raw
(vocab, hidden)parameter plus a scalarlm_head_gain. Folded into a regularnn.Linearweight insanitize()so quantization treats it normally.
Conversion details
These weights were produced by running mlx_lm.convert on lewtun/talkie-1930-13b-it-hf after adding the new talkie model module to mlx-lm. The conversion was generated and validated with the transformers-to-mlx skill.
Numerical agreement vs the upstream transformers model on a 94-token paragraph prompt (CPU, bf16 both sides):
Logits diff: max=2.0000 mean=0.0785 median=0.0625
Top-10 overlap: 10/10 (last position)
Top-1 agreement: 98.9% (across all 94 positions)
Within typical bf16 transformers/MLX disagreement.
The 4-bit variants required architecture-aware tuning. Bare q4 produced repetition on long greedy decoding, so two recovery paths are shipped:
-mlx-4bitβ mixed-precision recipe via customquant_predicate. A per-block sensitivity scan (in-memorymx.quantizeβmx.dequantizethen logit MSE vs bf16) flagged blocks 14, 37, and 38 as outliers. Final config:lm_head=q8,embed=bf16, blocks {14, 37, 38} at q8, all other Linear layers at q4.-mlx-4bit-DWQβmlx_lm.dwqdistillation calibration with default learning rate (1e-6, 512 samples, 512-token sequences, batch 1, gradient checkpointing). 512 iterations, final validation loss 0.037. Beats the mixed-q4 build on long-form generation.
mlx_lm.awq is not yet supported for talkie β the AWQ scaling step requires absorbing an input-scale into the upstream norm's weight, but Talkie's RMSNorms have no learned weight.
License
Apache 2.0 β same as upstream.
- Downloads last month
- 189
4-bit
Model tree for warshanks/talkie-1930-13b-it-mlx-4bit-DWQ
Base model
talkie-lm/talkie-1930-13b-base