Talkie 1930 13B Instruct — MLX

MLX port of lewtun/talkie-1930-13b-it-hf for Apple Silicon. Refer to the upstream model card for training-data, evaluation, and provenance details — this card covers only the MLX conversion.

Talkie is a 13B instruction-tuned decoder-only transformer whose outputs are styled as pre-1930s English prose. It uses a custom architecture (custom RoPE convention, weightless RMSNorm, per-head and per-layer scalar gains, embedding-skip residuals, scaled lm_head weights) that is not currently in transformers/models/.

Native Talkie support was added to mlx-lm in PR #1231.

Variants

Repo	Quantization	bpw	Approx. size
`warshanks/talkie-1930-13b-it-mlx-bf16`	none (bf16)	16	25 GB
`warshanks/talkie-1930-13b-it-mlx-8bit`	affine 8-bit, group 64	8.5	13 GB
`warshanks/talkie-1930-13b-it-mlx-6bit`	affine 6-bit, group 64	6.5	10 GB
`warshanks/talkie-1930-13b-it-mlx-4bit`	mixed 4-bit (`lm_head=q8`, `embed=bf16`, blocks 14/37/38=q8, rest q4)	5.18	8 GB
`warshanks/talkie-1930-13b-it-mlx-4bit-DWQ`	DWQ-calibrated 4-bit	4.5	7 GB

For 4-bit, prefer the DWQ build. Bare q4 of this model degrades into repetition on long generations; DWQ calibration recovers clean output (validation loss 0.037 vs ≈0.25 for bare q4 in our run).

Installation

pip install -U mlx-lm

Talkie support is in mlx-lm ≥ the version that includes PR #1231. Until released, install from source:

pip install -U git+https://github.com/ml-explore/mlx-lm

Basic generation

from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "Write an essay predicting what life will be like in the year 1960."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

CLI:

mlx_lm.generate \
  --model warshanks/talkie-1930-13b-it-mlx-4bit-DWQ \
  --prompt "<|user|>What were the causes of the French Revolution?<|end|><|assistant|>" \
  --max-tokens 512 --temp 0.7

Multi-turn chat

from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "What were the causes of the French Revolution?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
reply = generate(model, tokenizer, prompt=prompt, max_tokens=512)

messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512))

Chat template

<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>

Applied automatically by tokenizer.apply_chat_template().

Architecture (as observed in the source checkpoint and modeling code)

Component	Value
Parameters	13B
Layers	40
Attention heads	40 (MHA, no GQA)
Hidden size	5120
Head dimension	128
Intermediate size (MLP)	13696
Position encoding	RoPE (θ = 1,000,000), inverse-rotation convention
Activation	SwiGLU
Normalization	weightless RMSNorm (pre-norm)
Context length	2048
Vocabulary	65,540
Precision	bfloat16

Architectural quirks the MLX port reproduces:

Custom RoPE — formula y1 = x1*cos + x2*sin, y2 = -x1*sin + x2*cos (rotation by −θ, the inverse of the HF/Llama convention). mx.fast.rope is not directly usable; the port ships a small TalkieRoPE class.
Weightless RMSNorm — applied at the embedding output, before each attention block, before each MLP block, on the post-RoPE Q and K tensors, and before the final lm_head. No learned scale; reduction in fp32 then cast back.
Per-head Q gain — learnable scalar per attention head applied to queries after RoPE + Q-norm.
Per-layer scalar gains — attn_gain and mlp_gain (initialized to (2L)^-0.5) scale the residual contributions; embed_skip (initialized to 0.0) scales an extra residual from the post-first-norm embedding into every block.
lm_head with weight gain — stored as a raw (vocab, hidden) parameter plus a scalar lm_head_gain. Folded into a regular nn.Linear weight in sanitize() so quantization treats it normally.

Conversion details

These weights were produced by running mlx_lm.convert on lewtun/talkie-1930-13b-it-hf after adding the new talkie model module to mlx-lm. The conversion was generated and validated with the transformers-to-mlx skill.

Numerical agreement vs the upstream transformers model on a 94-token paragraph prompt (CPU, bf16 both sides):

Logits diff:    max=2.0000   mean=0.0785   median=0.0625
Top-10 overlap: 10/10  (last position)
Top-1 agreement: 98.9% (across all 94 positions)

Within typical bf16 transformers/MLX disagreement.

The 4-bit variants required architecture-aware tuning. Bare q4 produced repetition on long greedy decoding, so two recovery paths are shipped:

-mlx-4bit — mixed-precision recipe via custom quant_predicate. A per-block sensitivity scan (in-memory mx.quantize → mx.dequantize then logit MSE vs bf16) flagged blocks 14, 37, and 38 as outliers. Final config: lm_head=q8, embed=bf16, blocks {14, 37, 38} at q8, all other Linear layers at q4.
-mlx-4bit-DWQ — mlx_lm.dwq distillation calibration with default learning rate (1e-6, 512 samples, 512-token sequences, batch 1, gradient checkpointing). 512 iterations, final validation loss 0.037. Beats the mixed-q4 build on long-form generation.

mlx_lm.awq is not yet supported for talkie — the AWQ scaling step requires absorbing an input-scale into the upstream norm's weight, but Talkie's RMSNorms have no learned weight.