TurboTensors: Optimizing CPU LLM Performance

I developed an open-source CPU-based inference engine called TurboTensors:

- Python + Numba JIT

- Optimized memory access, kernel fusion, separate prefill/decode paths

- Significant speed improvements on low-to-mid range CPUs for Turkish LLMs (e.g., Kayra-1)

Looking for technical feedback on:

1. Kernel fusion and memory alignment strategies

2. KV caching and parallel execution optimizations

3. Real-world applicability for edge or CPU-heavy systems

Project link: GitHub - sixfingerdev/TurboTensors: Maximum performance CPU Inference Engine for LLMs. Built with Numba-JIT kernels and custom memory management to outperform standard implementations on edge devices. Specifically optimized for Turkish LLMs (Kayra).

Any feedback or suggestions are highly appreciated.

1 Like