rlstructures, Tutorial 5: A2C with Generalized Advantage Estimation

3 min readMar 1, 2021

Based on rlstructures v0.2

Other Tutorials:

In this tutorial, we describe how we can implement actor-critic methods.

How we can use auto-reset environments to avoid wasting computation time
Explain how the actor-critic loss can be computed with recurrent architectures

Recurrent Architectures and Policies

A first step is to implement the underlying model and corresponding agent. In our case, the model is a classic recurrent model that outputs at each timestep both action probabilities, and a critic value. Note that the critic and action. models will share the same internal state.

Based on this model, and as illustrated in the REINFORCE tutorial, we can define an agent working in both the stochastic and deterministic modes. In that case, the agent state will be a vector. We also propose to include the agent_step information into the agent state to illustrate how complex internal states can be implemented.

Note that the critic value is not computed since it is not needed for inference. It will be only used at loss computation time.

Environment Creation

In A2C, we do not need to sample complete episodes, and can just sample T timesteps at each acquisition step. Moreover, on interesting aspect is to work with environments that automatically restart when one episode is finished such that we will not acquire trajectories of different lengths. To do so, we will use the GymEnvInf wrapper that implements the auto-reset functionality and model n_envs() single gym instances, that autoreset and thus never stop.

Learning Batcher

As usual, we need to create the learning batcher in charge of sampling T timesteps. In our case, the value of T is self.config[“a2c_timesteps”] and the batcher is created as follows:

Note that the evaluation batcher will still work on full episodes (without auto-reset)and will make use of a different environment creation method (see source code) that does not use the GymEnvInf wrapper but the GymEnv one.

Learning Loop

The learning loop is very similar to the REINFORCE loop with one main difference: The batcher is reset only at the beginning since we are using auto-reset environments (and thus we don’t need to reset the environments at each acquisition)

Actor-Critic Loss

When computing the A2C Loss, given one transition (s,a,s’), we need to compute the critic values V(s) and V(s’). This cannot be obtained by just replaying the agent as in REINFORCE since we need to compute more variables that what was computed by the Agent at inference time. In order to to do so, rlstructures allows one to define a replay function that is different than the agent __call__ function. In the Agent declaration, let us add the following method:

The replay function used for A2C

This function takes as an input a trajectories variables corresponding to the trajectories returned by the batcher. The argument t is the time index in these trajectories, and state is the internal state at time t-1 or None if t==0

Like the __call__ function , as an output, this function will produce two DictTensor, the first one corresponding to some compute values — the critic values V(s) and V(s’) in our case — and the update agent state.

Now, in the learning loop, this function can be used as a replay function through:

As an output, the replayed variable contains the “critic”, “_critic”, “action_probabilities” fields produced by our replay function in a TemporalDictTensor.

Then the loss can be easily computed:

This loss function makes use of the get_gae method nt described in this tutorial. The complete code can be found in rlstructures/rlalgos/a2c_gae

Note that the replay function we declare also works on GPU, allowing a loss computation on GPU. More information about how we can use GPUs is given in another tutorial.