Job Description
Description du poste
You will perform experiments to understand GPU internals, find creative solutions to accelerate critical computational sections used in LLM inference, and write optimized GPU kernels accordingly. Then test, profile, and optimize again.
Contribute to our monokernel pipeline, the single persistent GPU program that covers the full decode pass from QKV projection to LM head sampling, across AMD and NVIDIA architectures.
Work on low-level GPU optimization, including impossibly-fast grid synchronizations and inter-GPU collectives, and optimized GEMM and attention kernels for specific batch sizes and context lengths.
Build profiling infrastructure inside a monokernel, including custom instrumentation, device-timestamp frameworks, and per-stage analysis to translate machine behavior into concrete engineering decisions.
Scale the stack to third-party MoE models such as DeepSeek v4 and Qwen 3 to push generation speed on t...