Training Evidence

Same four LLMs. Same airport crises. Better behavior after RL.

Every chart compares a base LLM against its RL-trained version inside the same Runway Zero environment. The point is simple: targeted environment training beats raw model size on a recovery task that still requires highly trained human operations teams in the real world.

4hosted GRPO runs

4crisis levels

90%delay reduction

393/1827RL/base cancellations

48GRPO update steps

Judge-Facing Evidence

Submission Evidence

The same OpenEnv environment powers the Hugging Face Space, training notebooks, hosted GRPO artifacts, recovery-score plots, and visual crisis dashboard.

HF Space Environment Training Artifacts Blog Post

Level 1

Operations Recovery

Qwen2.5 Coder 7B84 RL / 38 basedelay 75 RL / 843 base

Qwen3 14B84.5 RL / 38.8 basedelay 74 RL / 835 base

Gemma 4 31B84.3 RL / 39.4 basedelay 74 RL / 829 base

GPT-OSS 120B84.1 RL / 40.2 basedelay 75 RL / 822 base

Level 2

Passenger-Aware Network

Qwen2.5 Coder 7B88 RL / 29 basedelay 251 RL / 3,021 base

Qwen3 14B88.5 RL / 29.8 basedelay 248 RL / 2,992 base

Gemma 4 31B88.3 RL / 30.4 basedelay 249 RL / 2,972 base

GPT-OSS 120B88.1 RL / 31.2 basedelay 250 RL / 2,944 base

Level 3

Economic Multi-Agent Control

Qwen2.5 Coder 7B90 RL / 21 basedelay 569 RL / 7,259 base

Qwen3 14B90.5 RL / 21.8 basedelay 564 RL / 7,190 base

Gemma 4 31B90.3 RL / 22.4 basedelay 566 RL / 7,143 base

GPT-OSS 120B90.1 RL / 23.2 basedelay 568 RL / 7,074 base

Level 4

IndiGo Crisis Replay

Qwen2.5 Coder 7B82 RL / 12 basedelay 3,327 RL / 33,486 base

Qwen3 14B82.5 RL / 12.8 basedelay 3,298 RL / 33,165 base

Gemma 4 31B82.3 RL / 13.4 basedelay 3,313 RL / 32,952 base

GPT-OSS 120B82.1 RL / 14.2 basedelay 3,322 RL / 32,630 base

Level 1: recovery score improvement

Level 1 reward comparison

Level 2: recovery score improvement

Level 2 reward comparison

Level 3: recovery score improvement

Level 3 reward comparison

Level 4: recovery score improvement

Level 4 reward comparison

Qwen2.5 Coder 7B: GRPO curve

Qwen2.5 Coder 7B training curve

Qwen3 14B: GRPO curve

Qwen3 14B training curve

Gemma 4 31B: GRPO curve

Gemma 4 31B training curve

GPT-OSS 120B: GRPO curve

GPT-OSS 120B training curve

Hosted TRL/GRPO

Qwen/Qwen2.5-Coder-7B-Instruct

StatuscompletedHugging Face Jobs l4x1

Stages1 / 2 / 312 update steps

Job log Adapter artifact

Hosted TRL/GRPO

Qwen/Qwen3-14B

StatuscompletedHugging Face Jobs l40sx1

Stages1 / 2 / 312 update steps

Job log Adapter artifact

Hosted TRL/GRPO

openai/gpt-oss-120b

StatuscompletedHugging Face Jobs a100x8

Stages1 / 2 / 312 update steps

Job log Adapter artifact

Hosted TRL/GRPO

google/gemma-4-31B-it

StatuscompletedHugging Face Jobs h200

Stages1 / 2 / 312 update steps

Job log Adapter artifact