TritonGym

A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation

Yue Guan*, Yichen Lin*, Xu Zhao, Jianzhu Yao, Xinwei Qiang, Zhongkai Yu, Pramod Viswanath, Yufei Ding, Adnan Aziz

* Equal contribution

TritonGym is a benchmark and orchestration framework for evaluating how well large language models can write performant Triton GPU kernels — not just one-shot, but inside multi-step agent loops with compilation, verification, and iterative refinement. The benchmark spans a maintained operator set, out-of-distribution tasks, and DSL extensions (Gluon & TLX), and standardizes tool access so workflow design and intrinsic model capability can be cleanly separated.

operators
3splits
4agents
models evaluated

Official Leaderboard

Models are ranked by Pass@1 (correctness, max abs error ≤ 0.01 vs the PyTorch reference) and Perf@1 (oracle Triton latency / generated kernel latency, averaged over all operators with 0 for failures). Click a column header to sort. Toggle the split tabs to compare model behavior across the standard, OOD, DSL, and full benchmarks.

# Model Agent Pass@1 ▼ Perf@1
Loading…

Pass@1 over the full benchmark (164 operators) means correct outputs across all splits. Perf@1 > 1 means the generated kernel is faster than the hand-tuned Triton oracle. Numbers come from the published TritonGym evaluation.

Dataset

The benchmark spans 164 operators across three splits:

  • Standard — 139 common GPU kernels (matmul, attention, normalization, activations, quantization, …) with reference PyTorch implementations and oracle Triton kernels.
  • OOD — 13 novel operators less likely to be in pre-training corpora.
  • DSL — 12 operators specified in domain-specific languages (Gluon, TLX), evaluating the generalization of agents to new programming abstractions.

Each operator is evaluated on multiple input shapes; correctness uses a max-absolute-error threshold of 0.01 against the PyTorch reference, and performance is the latency ratio against the oracle Triton kernel.

Dataset composition
Operator coverage across the standard / OOD / DSL splits.

Agentic Workflows

TritonGym standardizes tool access (compilation, verification, profiling) so workflow design can be evaluated independently of the underlying LLM. Four workflows ship out of the box:

One-shot

Single-pass generation from the operator spec — the lower bound on what an LLM can do without feedback.

Geak

Multi-agent pipeline with generator, compiler, verifier, optimizer, and reflector roles.

AlphaEvolve

Iterative refinement using evaluator feedback (up to 5 attempts). Currently the best workflow on the benchmark.

Leader

Diff-based iterative agent that proposes incremental code edits across rounds.

Agent workflows
Workflow comparison across the four agents.

Where Models Fail

Error breakdown
Failure mode breakdown by category.
Trial trend
Pass-rate trend across iterative attempts.

Paper

Download PDF   GitHub

@inproceedings{tritongym2026,
  title     = {TritonGym: A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation},
  author    = {Guan, Yue and Lin, Yichen and Zhao, Xu and Yao, Jianzhu and Qiang, Xinwei and Yu, Zhongkai and Viswanath, Pramod and Ding, Yufei and Aziz, Adnan},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  note      = {Under review}
}