OpenAI Gym built the wrong abstraction for the wrong era
OpenAI Gym shipped in 2016 and did exactly what it needed to do: give researchers a standard API for training agents on classic control tasks. CartPole, Atari, MuJoCo. The interface was clean. reset() returns an observation, step(action) returns the next one plus a reward, and that was enough to unify a decade of RL work under one import.
But Gym was built around a specific model: one process, one machine, observation spaces defined as NumPy arrays, environments imported as Python classes. That model was correct for the research it was designed to serve. It is the wrong model for what RL is used for now.
When you fine-tune a language model with GRPO, you are already in a distributed setting. Rollouts are happening across multiple GPUs, often across multiple machines, and the reward model may be running somewhere else entirely. In that world, the environment is not really a local Python class. It is a service. Pretending otherwise creates a lot of glue code that feels like engineering debt rather than research.
Gym was an API for a Python process. OpenEnv is an API for a network service. That distinction matters more than it sounds.
The problem gets worse when you try to share environments. With Gym, sharing meant publishing a Python package, pinning versions, and hoping the other person had the same system dependencies. That is a bad foundation for an ecosystem of reusable training tasks. We have something like a dataset hub for data, but not for environments, and the packaging model is a big reason why.
Three verbs, a Docker container, and a Hugging Face Space
OpenEnv is not a drop-in replacement for Gymnasium. It does not come with Atari wrappers or continuous control benchmarks. What it gives you instead is a pattern: each environment is a FastAPI server in a Docker container, exposing three endpoints, /reset, /step, and /state, over HTTP and WebSocket. The client is a typed Python class that can talk to that server from anywhere on the network.
The API shape is intentionally close to Gymnasium, so the quick-start feels familiar. But under the hood, it is a different model. You do not import the environment, you connect to it. The observation is not just a NumPy array, it is a validated Pydantic model. The action is not a loose gym.Space sample, it is a typed object. And the server can run anywhere: your laptop, a cloud VM, or a Hugging Face Space.
reset(): starts a new episode and returns a typed Observation plus a session identifier. It also accepts seed and episode_id so runs stay reproducible.
step(action): advances the episode by one action and returns a StepResult with the observation, reward, done flag, and optional metadata. The action is a validated Pydantic model, not a raw array.
state(): returns the current episode state without advancing it. That is useful for debugging, multi-client flows, and benchmark inspection when you do not want to spend a step.
The typed model is probably the biggest shift in how you design environments. In Gym, the observation was whatever you returned from step(), usually a NumPy array and sometimes a dict. In OpenEnv, you define an Observation class with named Pydantic fields. That class becomes the contract between the environment and any RL framework that connects to it. The server validates before sending, the client validates before handing it to the training loop, and a whole category of silent bugs disappears before it can turn into NaN losses three hours later.
What designing a cyber-range benchmark taught me
Pwned is a deterministic, partially observable cyber-range built on OpenEnv. It has two public suites: a Pentesting suite where agents pivot through a hidden network topology using commands like nmap, ssh, and exfiltrate; and a SecOps suite where agents respond to incident tickets using evidence they must observe and cite. Both suites share the same HTTP transport, the same session model, and the same deterministic reward engine. Building on OpenEnv forced three design decisions I would not have thought to make starting from a Gym subclass.
The session model is not optional when you run concurrent episodes
OpenEnv’s reset() returns a session_id, and every later call includes it. That is the right default because HTTP is stateless. Without a session identifier flowing through every request, you cannot run multiple episodes concurrently against the same server, which is exactly what distributed RL training wants to do. In Gym, state was implicit because it lived inside the environment object. If you wanted two episodes at once, you were reaching for threading locks or separate processes. OpenEnv makes the session boundary explicit and bakes it into the transport, which is why one server can handle hundreds of concurrent episodes cleanly.
The public/private state boundary is an architectural guarantee, not a naming convention
Pwned has a hidden network topology. The real graph of hosts, subnets, and pivot paths is never sent to the agent. In a local Gym environment, hidden state and observed state usually live in the same Python object. That makes it easy to accidentally log the full state to metrics and leak information the agent should not have. OpenEnv’s client/server boundary gives you a stronger guarantee. PwnedPublicState is what the server serialises. The hidden PwnedEnvironment.state stays in-process and never goes over the wire. Partial observability stops being a convention and becomes part of the architecture.
Evidence minting is a temporal contract, not a data structure
The SecOps suite requires agents to cite evidence by ID in the final report. Evidence entries are minted at observation time, the moment an agent receives a StepResult. That is when observable facts get stamped with deterministic IDs like ev-0001. They cannot be backfilled later. If an agent cites ev-0023 when the episode only produced twelve entries, it loses score. This fits naturally with OpenEnv’s step-based model because each step() call returns one observation, which is the atomic unit of what the agent knows. Minting evidence at that moment gives you a clean record of when each fact became available, so benchmark grading stays reproducible.
The HTTP latency trade-off only cuts one way
Classic RL objections to HTTP-based environments are fair. An in-process env.step() call takes microseconds, while an HTTP round-trip takes milliseconds. For Atari or continuous control tasks that need millions of steps, that gap matters. Vectorised environments, with 64 or 256 copies of the same Gym environment on one machine, exist for exactly this reason: they amortise per-step compute and keep the GPU busy.
But post-training RL for language models flips the math. Each "step" in a language model episode is a forward pass through billions of parameters, so you are already spending hundreds of milliseconds or even seconds. In that setting, the HTTP round-trip is background noise. The bottleneck is the model, not the transport. And unlike Atari, where you can vectorise 256 environments on one CPU, language model training already spans multiple GPU nodes. The trainer and the environment need to live on different hosts by default.
OpenEnv RFC 004 adds trajectory-level scoring, where the reward signal is computed over the whole episode instead of at each step. That matches how GRPO actually works: generate a rollout, score it as a whole, then compute the gradient from the outcome. Making this a first-class protocol feature means training frameworks do not need awkward workarounds just to express outcome-based reward.
The Hugging Face Spaces deployment model pushes this further. Once an OpenEnv environment is on HF Spaces, it is available at a stable URL. A TRL training script on a GPU cluster, an Oumi run on a university machine, or a SkyRL job on a cloud VM can all connect to the same environment without special infrastructure coordination. The person operating the environment does not have to be the same person training the model. Gym never really supported that way of working.
The architectural reckoning
The two models reflect different assumptions about what RL training looks like. Neither is universally better. But one of them matches how frontier model training is actually done.