Dev-X25874 - Overview
Pinned Loading
-
LoRA fine-tuned a 120B LLM to classify LLVM compiler pass interactions. 5.4% → 75.0% accuracy.
Python
-
GPU-resident persistent kernel with CUDA Dynamic Parallelism. Zero CPU intervention, lock-free task queue.
Cuda
-
Hybrid KDA+MLA attention architecture with 75% memory reduction and 6x faster long-context inference.
Python
-
JAX-native reward modelling toolkit for RL fine-tuning of LLMs. Composable rewards, distributed via pjit.
Python
-
Quantisation-native LLM inference engine for 1-bit and ternary models. All matrix ops run via XNOR+POPCNT, never dequantized.
Rust
-
High-performance CUDA kernel library — matrix ops, fused attention, parallel reductions. 3+ TFLOPS on Ampere.
Cuda