Dev-X25874 - Overview

Pinned Loading

LoRA fine-tuned a 120B LLM to classify LLVM compiler pass interactions. 5.4% → 75.0% accuracy.

Python
GPU-resident persistent kernel with CUDA Dynamic Parallelism. Zero CPU intervention, lock-free task queue.

Cuda
Hybrid KDA+MLA attention architecture with 75% memory reduction and 6x faster long-context inference.

Python
JAX-native reward modelling toolkit for RL fine-tuning of LLMs. Composable rewards, distributed via pjit.

Python
Quantisation-native LLM inference engine for 1-bit and ternary models. All matrix ops run via XNOR+POPCNT, never dequantized.

Rust
High-performance CUDA kernel library — matrix ops, fused attention, parallel reductions. 3+ TFLOPS on Ampere.

Cuda