I developed an open-source CPU-based inference engine called TurboTensors:
- Python + Numba JIT
- Optimized memory access, kernel fusion, separate prefill/decode paths
- Significant speed improvements on low-to-mid range CPUs for Turkish LLMs (e.g., Kayra-1)
Looking for technical feedback on:
1. Kernel fusion and memory alignment strategies
2. KV caching and parallel execution optimizations
3. Real-world applicability for edge or CPU-heavy systems
Any feedback or suggestions are highly appreciated.