Training, Distilling, and Embedding Tiny Models in Video Games

Community Article Published January 1, 2026

( In which we teach Stockfish how to trash-talk like a chess park hustler. )

Written by Ben from Yellowjacket Games R&D

Modern gameplay engines play at super-grand-master level and can run on your smart toaster, but do they make entertaining opponents? Typically, no; they have to be dumbed down in some way to match their human counterpart's skill level. They have to be configured to move "slowly" sometimes if you want to emulate human behavior, and getting an engine to play in a "human style" remains one of the highest bars to clear.

The remainder of this essay will center around the game of Chess; future essays will address other game genres like the Card-Battler ( MTG, Manacaster, Hearthstone, Pokemon, etc) and the Tabletop Wargame ( WH40k, Littoral Commander, Stratego ). These are only a few of many potential genres that stand to benefit disproportionaltely from embedded intelligence. By nature, they are logic-heavy, turn-based, and quantified, and that offers us a perfect sandbox for benchmarking digital intelligence. This sandbox is part of a bigger "machines playing games" playground that encompasses realtime games as well, but Yellowjacket focuses entirely on turn-based strategy games.

Nothing New Under the Sun or Moon

As one twitter poster astutely commented, Chess players have been dealing with automation for decades. The game, as always, adapts and evolves as a matter of course, and its players must adapt alongside or perish. Some of these adaptations are quite clever and they span a broad spectrum of approaches to humanizing how machines play. An interested individual could spend ages researching these approaches and come out the other side with more questions than answers.

Smol & Mighty, We Are the 1%

I am told by unbiased sources that a certain and reputatble "Hugging Face" company has trained up a tiny, powerful 3B param LLM that fits in a majority of gaming PCs' VRAM. As a game studio, we cite the Steam Hardware Survey to approximate that 75% of Steam gamers have atleast 6gb of VRAM. This is enough to serve up a reasonable quant of a 3B model with a few gigs of VRAM for textures. Turn-based games are not particularly demanding on GPUs, which is part of why they are such a perfect playground for hardware-accelerated intelligence.

Serving the bottom quartile with on-device models is a challenge especially since the bottom decile are running on a iGPU with 512mb of, and I use the term loosely, "VRAM". It is tempting to decree these people "GPU Paupers", fit to eat only bread and drink only leftover lukewarm coolant, but this is not in the radically inclusive spirit of Chess. Therefore, we are left with only one choice - make it work on anything.

We can understand "On Anything" to mean "The floor of the Steam Hardware Survey". This is not an exercise in "can we run Doom on a smart toaster"; it's a practical application with a well defined minimum. So this means we must figure out what the literal 1% of worst GPUs are, and ensure we can serve up something instead of nothing.

A Fool's Errand

SmoLLM and its cohort are indeed tiny and mighty, but there is a vast difference between the bottom quartile of GPUs and the bottom decile. We had good luck testing these 3B models running at conversational speeds on a modest GTX 1660 Super 6gb card in a low end consumer PC. Without overstating, I can say that the performance was ten times faster than I expected on a 5 year old Turing-architecture card, and the quality was on another level compared to 3B models even 12 months ago.

In other words, the vast majority of GPUs in Steam User PCs can run a 3B model at a good quant and enjoy a high quality conversation partner using only their on-premesis, offline device, at no cost to the game studio. Even if the Studio only needed to serve tiny LLMs to the bottom decile on Steam, any reasonable Indie game launch would quickly require a datacenter buildout to support that level of concurrency. We did the math on serving up fleets of tiny LLMs to users and, well, thats a subject for another article. I'll spoil the ending; it's madness. Don't do it.

So once again, we're left with that bottom decile problem; anything we can serve up to 90% of Steam users is by definition un-usable for 10% of them. The gap between even 2gb of VRAM on a discrete GPU and a 512mb iGPU is immense and cannot be overstated. *In summary, trying to get any real LLM onto an iGPU is an errand for a damned fool, to put it politely.

It's Not Rocket Science

The miracle of these tiny LLMs is partially their depth of knowledge, partially their ability to perform multi-stage reasoning. It's like listening to a genius polymath extemporize on their subject matter of expertise, even if they stumble occasionally, mis-cite something, or somewhat misunderstand the assignment. We should not overlook the fundamental truth that a tiny sliver of shiny hot rock inside a box in your office is talking to you, writing your code, and singing songs for you.

These miracles of technology are, regrettably, totally useless for playing a tabletop strategy game.

When was the last time you played Chess and decided to pause mid-game and discuss rocket science with your opponent? Probably never. If you did, your opponent would probably say "hey, can we focus on the game at-hand?" It is exactly this rational and human response that gives us a key insight into the next step, and indeed it is not rocket science; in fact, it involves making sure we never think about rocket science ever again.

Can a Chess Player Compose a Symphony? Can You?

Poor Will Smith. A grown-up man trash-talked & taunted to tears by a robot. We at Yellowjacket can only aspire to such great heights of machine-driven verbal abuse...but with only 512mb of questionable "V"RAM to work with, we probably can't afford to be including weights for composing symphonies, engineering rockets, or decoding Linear A. In fact, we probably can't spend a single bit on weights not somehow related to acting like the chosen role.

Enter Distillation, the concept of using a big smart model to bake a subset of specific capacities to the exclusion of anything else. In a way, you can consider these to be application-specific models in the way that ASICs were Application-Specific Integrated Circuits designed for only one task.

We are fortunate to exist in a regime where the hardware layer of compute is sufficiently abstract, and that our "ASIC"s are more like application-specific instruction contexts; anything more than one or two degrees of separation from the core competence of the new model is never even brought into its purview. Therefore, a tiny model cannot become "distracted" by attention activations that are too far-afield from the primary subject matter.

All of this, and for what?

For the lulz, that's what. No, seriously. Our company is a game studio. We make Games. Games are meant to be Played. Play is meant to be Fun. Laughing aloud is a typical human indicator of enjoyment. Therefore, games are meant to be fun and people laughing indicates we have done our job successfully. So yes indeed, we are attempting to distill ludicrously small models capable of surface-level strategic reasoning when backed by a non-GPU driven gameplay engine, and we are optimizing them to taunt you, trash talk you, offer you insane mid-game wagers, and generally "remember" you.

This has been the promise of "LLMs in Games" for quite some time, but the brutal truth comes down to, as usual in game design, The Maths. The Almighty Maths never lie. To serve up an LLM in a game, you must make one of three choices:

  1. Get it to run on everyone's toaster.
  2. Provide cloudside offlad for the GPU-Poor.
  3. Exclude some bottom % of the GPU-Poor.

Engineering Mantra says you cannot have all three of Good, Cheap and Fast. The only way to get LLMs as real ingame "characters" is to either 1) get it running at the edge ( Cheap and Good but Not Fast ) or 2) host your own data center ( Fast and Good but Not Cheap ). #3 is not Good, but it is Fast and Cheap, which is why it is excluded from rational consideration. #2 is not practical unless you are already managing fleets of hardware for other reasons. Our studio has a dedicated GPU rig with some a6000s; we're not managing racks here, and we don't intend to start**.

And in the end, nothing even matters ( except Chess )

There is a certain poetry about using a "teacher" LLM to distill a one-dimensional Chess opponent with no strategic abilities whatsoever, optimized purely to be a mid-game convo partner. It feels like one of those things we were promised when those "Metaverse" projects were screaming into the void for relevance, and it seems like something that has been missing in games, broadly.

Now we know why; because the bitter truth is that you're either training your own stupidly-small models, running your own data center, or screwing over the GPU poor.

The only way to get models small enough to run at the true "edge floor", the true down-down-to-goblin-town iGPUs at the bottom of the Steam hardware survey, is to distill the hell out of an already-tiny model such that its final form has no understanding of a world beyond Chess.

The poetry is self-recursive. Chess itself is an incredible distillation of so many concepts into a tiny ruleset, it only stands to reason that any language model trained to "only talk about Chess" may in fact retain a large amount of its core reasoning abilities by pure coincidence; after enough trash-talk and banter about a game in progress, some actual logic traces of If This Then That may begin to bubble up.

It would be a rather fitting way to humanize a digital opponent; simply never teach it about anything except Chess. Perhaps ignorance truly is bliss.


** We won't be managing server racks, but we may end up starting a Twitch stream called "Jackass Builds" where we revive old hardware to do things you can't believe it can do these days. All your 3060-12's are belong to us!!! Stay tuned for more details.

( Lack Of ) AI DISCLOSURE: This essay was written, edited, and proofread by a human, without LLM assistance. When the LLM can write better than me, I'll let it write for me. Till then, this is 100% human-written slop! Enjoy, and if you don't, you can't blame ChatGPT for this one.

Community

Article author

Corrections welcome. We're still kinda new to all of this. Be nice, we make video games for a living.

Sign up or log in to comment