ConvGPT 164M SYNTH EC 250B TOKENS
This is an Early Checkpoint (EC) of the ConvGPT architecture, a novel model designed for maximal hidden size compression.
Model Details
- Architecture: ConvGPT
- Checkpoint Step: 172,000
- Parameters: 163,952,769
- Num layers: 32
- Hidden size: 1296
- Transformer dimension: 144
- Vocab size: 65538
- Intermediate size: 3072
- Num attention heads: 16
- Num kv heads: 8
- Head dim: 128
- Tie word embeddings: True
Architecture Highlights
ConvGPT introduces a novel approach to Large Language Model compression by integrating 2D convolutional networks directly into the pre-training architecture, rather than relying on post-training quantization or pruning. Designed specifically for Mobile/Edge (SLM) use cases, it achieves significant parameter reduction while maintaining high reasoning capabilities.
- Convolutional Embedding Compression: Unlike standard Transformers that maintain a constant hidden size throughout, ConvGPT utilizes a Conv2D + Average Pooling layer to compress the input hidden state vector by a factor of 9x before it enters the residual stream. This allows the model to maintain high-dimensional information in the embedding layer and prediction head while operating on a highly efficient, smaller vector in the decoder layers.
- Causal masking in 2D: The architecture implements specialized padding and reshaping mechanisms during the convolution steps to strictly preserve autoregressive causality. This eliminates "token leakage" (look-ahead bias), ensuring the model remains robust during generation and prevents the test-time degradation often seen in naive convolutional language models.
Extreme Parameter Efficiency:
Current Model: 164M parameters (comparable performance to a standard 722M parameter architecture) - a ~4.4x size reduction.
Scaling Potential: The architecture scales efficiently; a configuration with
hidden_size=2048results in just 266M parameters compared to a 1.7B parameter baseline (a 6.5x reduction).Performance-to-Size Ratio: Trained on 250B tokens (PleIAs/SYNTH), this 164M model achieves >30% on GPQA-Diamond, a significant outlier for its size class, demonstrating that logic and reasoning capabilities can be preserved even with aggressive vector compression.
Normalization Stability: Includes post-convolution normalization to manage vector value scaling, ensuring training stability and consistent generation output.
Training Details
This model is currently being trained using the Google TPU Research Cloud (TRC).
- Dataset: PleIAs/SYNTH
- Tokens Processed: ~250 Billion
- Hardware: TPUv4-16
- Training Time: ~30 Days
- Effective Batch Size: 512
- Context Length: 4096 tokens
- Learning rate: P1: 1e-3 (75B), P2: 1e-4 (175B)
- Weight decay: P1: 0.0, P2: 0.01
- Optimizer: AdamW
- Precision: BFloat16
Usage
Note: You must use trust_remote_code=True as this model utilizes custom modeling code (modeling_convgpt.py).
import torch
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
model_id = "mkurman/ConvGPT-SYNTH-250B-EC"
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model with custom code trust
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map='cuda',
trust_remote_code=True
).eval()
streamer = TextStreamer(
tokenizer, skip_prompt=False, decode_kwargs={"skip_special_tokens": False}
)
# Prepare input
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": "what is hypertension?"}],
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
)
print(f"Input IDs: {input_ids}")
# Generate
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids.to(model.device),
max_new_tokens=128,
streamer=streamer,
use_cache=True,
# Important: Keep repetition_penalty at 1.0 for this early checkpoint
repetition_penalty=1.0,
)
You can also find support for vLLM and SGLang in my GitHub repository.
Acknowledgments
This model was trained using Cloud TPUs provided by Google's TPU Research Cloud (TRC) program.
Special thanks to Pierre-Carl Langlais and the PleIAs team for the high-quality SYNTH dataset.
Repo
- Downloads last month
- 7