TritonGym

A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation

Yue Guan^*, Yichen Lin^*, Xu Zhao, Jianzhu Yao, Xinwei Qiang, Zhongkai Yu, Pramod Viswanath, Yufei Ding, Adnan Aziz

^* Equal contribution

TritonGym is a benchmark and orchestration framework for evaluating how well large language models can write performant Triton GPU kernels — not just one-shot, but inside multi-step agent loops with compilation, verification, and iterative refinement. The benchmark spans a maintained operator set, out-of-distribution tasks, and DSL extensions (Gluon & TLX), and standardizes tool access so workflow design and intrinsic model capability can be cleanly separated.

—operators

3splits

4agents

—models evaluated

Read Paper (PDF) Code & Dataset

Official Leaderboard

Models are ranked by Pass@1 (correctness, max abs error ≤ 0.01 vs the PyTorch reference) and Perf@1 (oracle Triton latency / generated kernel latency, averaged over all operators with 0 for failures). Click a column header to sort. Toggle the split tabs to compare model behavior across the standard, OOD, DSL, and full benchmarks.

#	Model	Agent	Pass@1 ▼	Perf@1
Loading…

Pass@1 over the full benchmark (164 operators) means correct outputs across all splits. Perf@1 > 1 means the generated kernel is faster than the hand-tuned Triton oracle. Numbers come from the published TritonGym evaluation.

Dataset

The benchmark spans 164 operators across three splits:

Standard — 139 common GPU kernels (matmul, attention, normalization, activations, quantization, …) with reference PyTorch implementations and oracle Triton kernels.
OOD — 13 novel operators less likely to be in pre-training corpora.
DSL — 12 operators specified in domain-specific languages (Gluon, TLX), evaluating the generalization of agents to new programming abstractions.

Each operator is evaluated on multiple input shapes; correctness uses a max-absolute-error threshold of 0.01 against the PyTorch reference, and performance is the latency ratio against the oracle Triton kernel.

Dataset composition — Operator coverage across the standard / OOD / DSL splits.

Agentic Workflows

TritonGym standardizes tool access (compilation, verification, profiling) so workflow design can be evaluated independently of the underlying LLM. Four workflows ship out of the box:

One-shot

Single-pass generation from the operator spec — the lower bound on what an LLM can do without feedback.

Geak

Multi-agent pipeline with generator, compiler, verifier, optimizer, and reflector roles.

AlphaEvolve

Iterative refinement using evaluator feedback (up to 5 attempts). Currently the best workflow on the benchmark.

Leader

Diff-based iterative agent that proposes incremental code edits across rounds.

Agent workflows — Workflow comparison across the four agents.

Where Models Fail

Error breakdown — Failure mode breakdown by category.

Trial trend — Pass-rate trend across iterative attempts.

Paper

Download PDF GitHub

@inproceedings{tritongym2026,
  title     = {TritonGym: A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation},
  author    = {Guan, Yue and Lin, Yichen and Zhao, Xu and Yao, Jianzhu and Qiang, Xinwei and Yu, Zhongkai and Viswanath, Pramod and Ding, Yufei and Aziz, Adnan},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  note      = {Under review}
}