Talkie 1930 13B Instruct β€” MLX

MLX port of lewtun/talkie-1930-13b-it-hf for Apple Silicon. Refer to the upstream model card for training-data, evaluation, and provenance details β€” this card covers only the MLX conversion.

Talkie is a 13B instruction-tuned decoder-only transformer whose outputs are styled as pre-1930s English prose. It uses a custom architecture (custom RoPE convention, weightless RMSNorm, per-head and per-layer scalar gains, embedding-skip residuals, scaled lm_head weights) that is not currently in transformers/models/.

Native Talkie support was added to mlx-lm in PR #1231.

Variants

Repo Quantization bpw Approx. size
warshanks/talkie-1930-13b-it-mlx-bf16 none (bf16) 16 25 GB
warshanks/talkie-1930-13b-it-mlx-8bit affine 8-bit, group 64 8.5 13 GB
warshanks/talkie-1930-13b-it-mlx-6bit affine 6-bit, group 64 6.5 10 GB
warshanks/talkie-1930-13b-it-mlx-4bit mixed 4-bit (lm_head=q8, embed=bf16, blocks 14/37/38=q8, rest q4) 5.18 8 GB
warshanks/talkie-1930-13b-it-mlx-4bit-DWQ DWQ-calibrated 4-bit 4.5 7 GB

For 4-bit, prefer the DWQ build. Bare q4 of this model degrades into repetition on long generations; DWQ calibration recovers clean output (validation loss 0.037 vs β‰ˆ0.25 for bare q4 in our run).

Installation

pip install -U mlx-lm

Talkie support is in mlx-lm β‰₯ the version that includes PR #1231. Until released, install from source:

pip install -U git+https://github.com/ml-explore/mlx-lm

Basic generation

from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "Write an essay predicting what life will be like in the year 1960."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

CLI:

mlx_lm.generate \
  --model warshanks/talkie-1930-13b-it-mlx-4bit-DWQ \
  --prompt "<|user|>What were the causes of the French Revolution?<|end|><|assistant|>" \
  --max-tokens 512 --temp 0.7

Multi-turn chat

from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "What were the causes of the French Revolution?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
reply = generate(model, tokenizer, prompt=prompt, max_tokens=512)

messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512))

Chat template

<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>

Applied automatically by tokenizer.apply_chat_template().

Architecture (as observed in the source checkpoint and modeling code)

Component Value
Parameters 13B
Layers 40
Attention heads 40 (MHA, no GQA)
Hidden size 5120
Head dimension 128
Intermediate size (MLP) 13696
Position encoding RoPE (ΞΈ = 1,000,000), inverse-rotation convention
Activation SwiGLU
Normalization weightless RMSNorm (pre-norm)
Context length 2048
Vocabulary 65,540
Precision bfloat16

Architectural quirks the MLX port reproduces:

  • Custom RoPE β€” formula y1 = x1*cos + x2*sin, y2 = -x1*sin + x2*cos (rotation by βˆ’ΞΈ, the inverse of the HF/Llama convention). mx.fast.rope is not directly usable; the port ships a small TalkieRoPE class.
  • Weightless RMSNorm β€” applied at the embedding output, before each attention block, before each MLP block, on the post-RoPE Q and K tensors, and before the final lm_head. No learned scale; reduction in fp32 then cast back.
  • Per-head Q gain β€” learnable scalar per attention head applied to queries after RoPE + Q-norm.
  • Per-layer scalar gains β€” attn_gain and mlp_gain (initialized to (2L)^-0.5) scale the residual contributions; embed_skip (initialized to 0.0) scales an extra residual from the post-first-norm embedding into every block.
  • lm_head with weight gain β€” stored as a raw (vocab, hidden) parameter plus a scalar lm_head_gain. Folded into a regular nn.Linear weight in sanitize() so quantization treats it normally.

Conversion details

These weights were produced by running mlx_lm.convert on lewtun/talkie-1930-13b-it-hf after adding the new talkie model module to mlx-lm. The conversion was generated and validated with the transformers-to-mlx skill.

Numerical agreement vs the upstream transformers model on a 94-token paragraph prompt (CPU, bf16 both sides):

Logits diff:    max=2.0000   mean=0.0785   median=0.0625
Top-10 overlap: 10/10  (last position)
Top-1 agreement: 98.9% (across all 94 positions)

Within typical bf16 transformers/MLX disagreement.

The 4-bit variants required architecture-aware tuning. Bare q4 produced repetition on long greedy decoding, so two recovery paths are shipped:

  • -mlx-4bit β€” mixed-precision recipe via custom quant_predicate. A per-block sensitivity scan (in-memory mx.quantize β†’ mx.dequantize then logit MSE vs bf16) flagged blocks 14, 37, and 38 as outliers. Final config: lm_head=q8, embed=bf16, blocks {14, 37, 38} at q8, all other Linear layers at q4.
  • -mlx-4bit-DWQ β€” mlx_lm.dwq distillation calibration with default learning rate (1e-6, 512 samples, 512-token sequences, batch 1, gradient checkpointing). 512 iterations, final validation loss 0.037. Beats the mixed-q4 build on long-form generation.

mlx_lm.awq is not yet supported for talkie β€” the AWQ scaling step requires absorbing an input-scale into the upstream norm's weight, but Talkie's RMSNorms have no learned weight.

License

Apache 2.0 β€” same as upstream.

Downloads last month
189
Safetensors
Model size
13B params
Tensor type
BF16
Β·
U32
Β·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for warshanks/talkie-1930-13b-it-mlx-4bit-DWQ

Quantized
(6)
this model

Collection including warshanks/talkie-1930-13b-it-mlx-4bit-DWQ