What I Learnt Building RL Environments with OpenEnv

What I Learnt Building RL Environments with OpenEnv

I built Pwned, a deterministic and partially observable cyber-range benchmark, on top of Meta-PyTorch’s OpenEnv. This piece is about what that architecture taught me, why HTTP-native environments make sense for post-training RL, and why the old Gym mental model is starting to crack.

Dhyaneesh DS / April 2026 / ~2,600 words

01 /

The Gymnasium Problem

OpenAI Gym built the wrong abstraction for the wrong era

OpenAI Gym shipped in 2016 and did exactly what it needed to do: give researchers a standard API for training agents on classic control tasks. CartPole, Atari, MuJoCo. The interface was clean. reset() returns an observation, step(action) returns the next one plus a reward, and that was enough to unify a decade of RL work under one import.

But Gym was built around a specific model: one process, one machine, observation spaces defined as NumPy arrays, environments imported as Python classes. That model was correct for the research it was designed to serve. It is the wrong model for what RL is used for now.

When you fine-tune a language model with GRPO, you are already in a distributed setting. Rollouts are happening across multiple GPUs, often across multiple machines, and the reward model may be running somewhere else entirely. In that world, the environment is not really a local Python class. It is a service. Pretending otherwise creates a lot of glue code that feels like engineering debt rather than research.

Gym was an API for a Python process. OpenEnv is an API for a network service. That distinction matters more than it sounds.

The problem gets worse when you try to share environments. With Gym, sharing meant publishing a Python package, pinning versions, and hoping the other person had the same system dependencies. That is a bad foundation for an ecosystem of reusable training tasks. We have something like a dataset hub for data, but not for environments, and the packaging model is a big reason why.

02 /

What OpenEnv Actually Is

Three verbs, a Docker container, and a Hugging Face Space

OpenEnv is not a drop-in replacement for Gymnasium. It does not come with Atari wrappers or continuous control benchmarks. What it gives you instead is a pattern: each environment is a FastAPI server in a Docker container, exposing three endpoints, /reset, /step, and /state, over HTTP and WebSocket. The client is a typed Python class that can talk to that server from anywhere on the network.

The API shape is intentionally close to Gymnasium, so the quick-start feels familiar. But under the hood, it is a different model. You do not import the environment, you connect to it. The observation is not just a NumPy array, it is a validated Pydantic model. The action is not a loose gym.Space sample, it is a typed object. And the server can run anywhere: your laptop, a cloud VM, or a Hugging Face Space.

The three core verbs

reset(): starts a new episode and returns a typed Observation plus a session identifier. It also accepts seed and episode_id so runs stay reproducible.

step(action): advances the episode by one action and returns a StepResult with the observation, reward, done flag, and optional metadata. The action is a validated Pydantic model, not a raw array.

state(): returns the current episode state without advancing it. That is useful for debugging, multi-client flows, and benchmark inspection when you do not want to spend a step.

The typed model is probably the biggest shift in how you design environments. In Gym, the observation was whatever you returned from step(), usually a NumPy array and sometimes a dict. In OpenEnv, you define an Observation class with named Pydantic fields. That class becomes the contract between the environment and any RL framework that connects to it. The server validates before sending, the client validates before handing it to the training loop, and a whole category of silent bugs disappears before it can turn into NaN losses three hours later.

03 /

Building Pwned on OpenEnv

What designing a cyber-range benchmark taught me

Pwned is a deterministic, partially observable cyber-range built on OpenEnv. It has two public suites: a Pentesting suite where agents pivot through a hidden network topology using commands like nmap, ssh, and exfiltrate; and a SecOps suite where agents respond to incident tickets using evidence they must observe and cite. Both suites share the same HTTP transport, the same session model, and the same deterministic reward engine. Building on OpenEnv forced three design decisions I would not have thought to make starting from a Gym subclass.

The session model is not optional when you run concurrent episodes

OpenEnv’s reset() returns a session_id, and every later call includes it. That is the right default because HTTP is stateless. Without a session identifier flowing through every request, you cannot run multiple episodes concurrently against the same server, which is exactly what distributed RL training wants to do. In Gym, state was implicit because it lived inside the environment object. If you wanted two episodes at once, you were reaching for threading locks or separate processes. OpenEnv makes the session boundary explicit and bakes it into the transport, which is why one server can handle hundreds of concurrent episodes cleanly.

The public/private state boundary is an architectural guarantee, not a naming convention

Pwned has a hidden network topology. The real graph of hosts, subnets, and pivot paths is never sent to the agent. In a local Gym environment, hidden state and observed state usually live in the same Python object. That makes it easy to accidentally log the full state to metrics and leak information the agent should not have. OpenEnv’s client/server boundary gives you a stronger guarantee. PwnedPublicState is what the server serialises. The hidden PwnedEnvironment.state stays in-process and never goes over the wire. Partial observability stops being a convention and becomes part of the architecture.

Evidence minting is a temporal contract, not a data structure

The SecOps suite requires agents to cite evidence by ID in the final report. Evidence entries are minted at observation time, the moment an agent receives a StepResult. That is when observable facts get stamped with deterministic IDs like ev-0001. They cannot be backfilled later. If an agent cites ev-0023 when the episode only produced twelve entries, it loses score. This fits naturally with OpenEnv’s step-based model because each step() call returns one observation, which is the atomic unit of what the agent knows. Minting evidence at that moment gives you a clean record of when each fact became available, so benchmark grading stays reproducible.

04 /

Why Post-Training RL Needed This

The HTTP latency trade-off only cuts one way

Classic RL objections to HTTP-based environments are fair. An in-process env.step() call takes microseconds, while an HTTP round-trip takes milliseconds. For Atari or continuous control tasks that need millions of steps, that gap matters. Vectorised environments, with 64 or 256 copies of the same Gym environment on one machine, exist for exactly this reason: they amortise per-step compute and keep the GPU busy.

But post-training RL for language models flips the math. Each "step" in a language model episode is a forward pass through billions of parameters, so you are already spending hundreds of milliseconds or even seconds. In that setting, the HTTP round-trip is background noise. The bottleneck is the model, not the transport. And unlike Atari, where you can vectorise 256 environments on one CPU, language model training already spans multiple GPU nodes. The trainer and the environment need to live on different hosts by default.

RFC 004: Delayed Rewards

OpenEnv RFC 004 adds trajectory-level scoring, where the reward signal is computed over the whole episode instead of at each step. That matches how GRPO actually works: generate a rollout, score it as a whole, then compute the gradient from the outcome. Making this a first-class protocol feature means training frameworks do not need awkward workarounds just to express outcome-based reward.

The Hugging Face Spaces deployment model pushes this further. Once an OpenEnv environment is on HF Spaces, it is available at a stable URL. A TRL training script on a GPU cluster, an Oumi run on a university machine, or a SkyRL job on a cloud VM can all connect to the same environment without special infrastructure coordination. The person operating the environment does not have to be the same person training the model. Gym never really supported that way of working.

05 /

Gym vs OpenEnv

The architectural reckoning

The two models reflect different assumptions about what RL training looks like. Neither is universally better. But one of them matches how frontier model training is actually done.

The environment ecosystem that should exist but does not yet

Datasets for language model training have a hub. You can search Hugging Face, preview a dataset, and load it in three lines of code. Training environments have no equivalent. You can publish a paper, link a GitHub repo, and hope someone manages to reproduce your setup. Or you can push to HF Spaces with openenv push and hand anyone a URL.

The barrier to contributing a training environment used to require understanding the MDP formalism, reading the Gym registration docs, testing compatibility with every RL framework, and maintaining a pip package. The OpenEnv model reduces this to: implement three methods, write a Dockerfile, run one command. Your environment is then accessible to TRL, SkyRL, ART, Oumi, and Unsloth training scripts without any additional integration work, because they all speak the same HTTP protocol.

This is how ecosystem flywheels start. The Hugging Face datasets hub works because contributing is easy and discoverability is high. OpenEnv has the same structure in principle. The catalog is still small, with a handful of toy environments, a chess env, and some Atari wrappers. But the underlying bet is strong: if environments become as shareable as datasets, any one lab can explore a much larger research surface. Instead of every team rebuilding the same infrastructure, more of the field’s compute can go into the models themselves.

Four things I would tell myself before starting

Design the observation first

Your Observation class is your environment’s public API

In Gym you could change what the environment returned whenever you wanted because it was just an array. In OpenEnv, your Observation is a Pydantic model that other training code depends on. Change it carelessly and you break the wire format. It is worth spending time on field names, types, and nullability before you write the simulation. The typed model helps a lot, but only if you treat it as a real interface.

Session hygiene

Treat session_id as a first-class resource, not a routing token

Concurrent evaluation means multiple sessions running at the same time. If your environment has any global mutable state, whether that is a random seed, a shared data structure, or a file handle, it will eventually create race conditions that are miserable to debug under load. The session model only helps if the environment behind it is genuinely session-scoped. It is much easier to design for concurrency early than to retrofit it later.

Reward design

The step penalty is not cosmetic. It is your efficiency signal.

Pwned charges −0.01 per step and −1.01 on detection or tamper. The step penalty punishes noisy, reckless action sequences that happen to arrive at the right answer. Without it, a policy that runs nmap exhaustively on every subnet will outscore one that reasons about what it has already discovered. If you want agents that are efficient as well as effective, the reward function needs to make efficiency legible from the start.

Benchmark vs training environment

Decide which one you are building and make it unambiguous

A training environment should maximise signal density through reward shaping, curriculum, and dense feedback. A benchmark should maximise reproducibility through deterministic grading, fixed rubrics, and no stochastic reward. Pwned is a benchmark first, so the same seed, the same actions, and the same cited evidence IDs produce the same score every run. That made the grader easier to design, but it also constrained the reward engineering. It is worth deciding who the primary user is before you write the first line of simulation logic.

08 /

The Bottom Line

Where this is going

OpenEnv is still early. The team says that explicitly, the API will change, and the environment catalog is thin compared to what Gymnasium built over a decade. But the architectural bet is the right one. Environments for post-training RL should be network services. They should be typed, containerised, and easy to deploy onto shared infrastructure.

The research that comes next, better reward functions, harder tasks, more realistic simulation, gets easier once the infrastructure for sharing environments is already in place. The obvious analogy is datasets. Before Hugging Face standardised the loading API and launched the Hub, every paper came with its own preprocessing script and download link. After that, reproducing a paper’s data pipeline became one line of code. OpenEnv is trying to create the same shift for training environments. It is still an open question whether the catalog reaches critical mass, but the architecture is right, and the timing is better than it first appears.

The next step for RL research infrastructure is not just better algorithms. It is better shared environments. OpenEnv feels like the right bet on what that should look like.

Pwned is live on Hugging Face Spaces and the code is public at github.com/dhyaneesh/pwned. If you want to run an agent against it or build on the environment protocol, inference.py in the repo root is the fastest path in.