rlstructures, Tutorial 2: Understanding the library

Based on rlstructures v0.2

Other Tutorials:

In this article, we detail the different concepts used by rlstructures that will allow anyone to implement its own RL algorithm. The concepts are:

  1. Data structures: rlstructures provide two main data structures that are used everywhere, namely DictTensor and TemporalDictTensor (and Trajectories that are just a pair of one DictTensor and one TemporalDictTensor)
  2. Agent API: the agent API allows one to implement policies acting on a batch of environments in a simple way
  3. Batcher principles: rlstructures provide a Batcher class that allows one to execute multiple policies over multiple environments using multiple processes and is a key element of the library allowing easy scaling. We will show you how to use batcher in this tutorial.

Data structures: DictTensor and TemporalDictTensor

Said otherwise, Agents and Environments interact by using DictTensor while a Batcher also compute a TemporalDictTensor to store the interaction trace.

DictTensor

  • the first dimension of all the tensors in the dictionary is the batch dimension, and all the tensors must have the same first dimension size accessible through d.n_elems()
  • all the tensors must be on the same device such that d.device() returns the device of the dictionary, d.to(…) allowing one to move the whole dictionary on a different device

Each tensor in a DictTensor is indexed by a key such that d[key] returns the corresponding tensor.

Figure 1: Example of use of DictTensor

Note that, by convention, the tensors can be organized in a simple hierarchy by using prefixes as illustrated in the figure.

Note: a DictTensor can be empty a=DictTensor({}) such that a.empty()==True and a.device()==None

TemporalDictTensor

  • the first dimension is the batch dimension accessible through td.n_elems()
  • the second dimension of the tensors is the temporal dimension

Since TemporalDictTensor represents n_elems() sequences, it allows these sequences to have different lengths.

  • td.lengths is a n_elems() sized tensor that returns the length of each the n_elems() sequences
  • td.mask() returns a binary masking matrix that tells which ‘element’ is in a sequence or not. Note that td.lengths and td.mask() represent the same information about the length of each sequence, but in a different format

Again, elementary tensors can be accessed by their respective key using td[key] and a TemporalDictTensor can be moved to different devices through td.to(…)

Figure 2 provides a simple example of how TemporalDictTensors can be used

Figure 2: Using TemporalDictTensor

Trajectories

Agent and Environment

Environments using openAI gym

Figure 3: Creating a rlstructures environment composed of 4 openAI gym environments

Note that the wrapper works with environments that return list/numpy.array or dictionary of lists/numpy.array as observations. For implementing more complex environment, one has to override the rlstructures.VecEnv class.

The resulting class is a container of env.n_envs() instances (4 instances in our example) of the openAI gym environment executed simultaneously (i.e a batch of environments). Note that, when creating k environments, they are initialized with seed values seed+0, seed+1, seed+2 , …, seed+k

When wrapping an openAI gym environment into a rlstructures environment, the wrapper will also produce additional observations like the last action taken, the initial_state flag, etc… (see the tutorial/tutorial_environments.py and tutorial/playing_with_envs.py file for more details about the output produced by a rlstructures.VecEnv object)

Agent

Figure 4: Agent inputs and outputs

Conceptually, an Agent receives a batch of agent states and a batch of observations and returns a batch of actions and a batch of (next) agent states (see Figure 3). In addition, as an input, the Agent may received an agent_info which is typically a user chosen information that parameterizes the behaviour of the agent (e.g stochastic or deterministic mode, value of epsilon for epsilon greedy policies, etc..). The observation, state and agent_info are of the same size i.e observation.n_elems()==state.n_elems()==agent_info.n_elems(). Moreover, the agent may have access to its whole agent history (e.g to implement transformer-based policies where maintaining an agent state is not enough) but this functionality is not presented here, and is desactivated when Agent.require_history() returns False, which is the default value. When using a gym wrapper, the Agent has to produced a agent_do[‘action’] output, but it can also produce any other relevant information like action probabilities, timestep… (e.g for debugging, for facilitating loss computation, etc…)

Example 1: A simple uniform Agent

Let us implement a simple agent sampling random actions in a discrete set of actions.

Figure 4: A simple uniform agent that uses the episode timestep as an internal state.

Note: the masked_dicttensor function is a helper that mixes two DictTensor using a mask value for each of the n_elems() elements of the DictTensors. In the illustrated case, it initializes the agent state of each element to the initial_state value when the agent is facing some initial states i.e environments have restarted.

What is agent_info? Implementing multiple policies in one Agent

As you can see, the agent also has an agent_info value as input. This input can be used to implement different behaviours within a same agent (e.g epsilon-greedy with different values of epsilon, deterministic and stochastic agent in one class, etc..), and thus to simulate multiple different policies that act simultaneously on a batch of environments. As an example, we modify the previous agent such that it now implements two distinct policies depending on the agent_info[“which_policy”] value: a first policy which returns a random action, and a second policy which always returns action 0

This ability to have multiple policies in one agent is a key characteristic of rlstructures that is very useful when implementing complex policies and models, for instance learning multiple policies at the same time. We will present more complex uses of this characteristics in future tutorials, but basically, rlstructures allows to run multiple policies (parameterized by agent_info) over multiple environments (parameterized by env_info) in batch mode which is powerful functionnality.

What is the ‘’history’’ argument in the Agent function?

The __call__ function also takes as input an history argument that may contain the whole history of the agent (i.e all previous observations and actions in the trajectory) and is useful for implementing transformers-like policies. By default, this argument’s value is None and we will talk about it in future tutorials.

Batchers

To keep the library as simple as possible, we provide a unique Batcher class that works as follows:

  1. When Batcher.reset is called, a batch of agent and environments are reset (i.e agent states are initialized through Agent.initial_state and environments are initialzed through Env.reset).
  2. When Batcher.execute is called, the acquisition of the T next timesteps is launched.
  3. When Batcher.get is called, the batcher returns the acquired Trajectories, together with a value telling us how many environments are still running (since some environments may have stopped during the T timesteps acquisition). Note that Batcher.get can be used in blocking and non-blocking mode depending if you want to wait until the acquisition is terminated, or you want to do some parallel computations while the acquisition process is running.

Building a Batcher

  • the number of processes
  • the number of timesteps T that will be acquired at each call
  • the function and arguments to create the rlstructures.VecEnv class (note that one such env will be created per batcher process). The environment creation function must have a seed argument to configure the seed of the environments.
  • the seeds that will be transmitted to each rlstructures.VecEnv in each process
  • the function and arguments to create the rlstructures.RL_Agent (note that one such agent will be created per batcher process)
  • Two examples of agent_info and env_info structures that both the agent and the env will receive (note that we did not still describe the env_info argument that is always an empty DictTensor when using openAI gym environments). These values are given to configure internal datastructures and must be either empty or such that n_elems()==1. Note that when using gym environments, we always have env_info=DictTensor({}).
Figure 5: Create a batcher with 4 processes, each process executing 4 single environments.

Important: the number of trajectories returned by a batcher (at the first execute call) is always n_processes*n_env which is 4*4=16 in the example provided in the figure.

Using Batchers

Blocking mode: the Batcher.get method can be executed in blocking=True or blocking=False mode. When blocking=True, the call will wait until the T timesteps have been acquired. When blocking=False, Batcher.get will immediately returns a pair of values which will be None,None if the T timesteps have not been completed by the batcher (allowing for instance to acquire trajectories while doing expensive computation, without blocking the learning process. In our implementation, it is typically used to do evaluation while learning — see next tutorials).

Batcher Trajectories

In trajectories.info, one can access:

  1. The agent_info and env_info value used to sample the trajectories — through info.truncate_key(“agent_info/”) or info.truncate_key(“env_info/”)
  2. The agent state when the acquisition has started — through info.truncate_key(“agent_state/”)

In the trajectories.trajectories variable, one can access:

  1. The observation received at each timestep t (key is “observation/…”)
  2. The action taken by the agent (key is “action/…”)
  3. The observation received once the action has been executed (key is “_observation/…”)

Note that importantly (see GymEnvInf wrapper for instance), the value of trajectories.trajectories[“_observation/…”] at time t can be different than the value of trajectories.trajectories[“observation/…”] at time t+1. Indeed, when an episode stops, the value of trajectories.trajectories[“_observation/…”] is the last observation of the episode, while trajectories.trajectories[“observation/…”] at time t+1 is the first observation of the next episodes (and does not contain observations for environment that have stopped previously). The trajectories.trajectories variable thus represent a sequence of transitions (s,a,s’) where s is trajectories.trajectories[“observation/…”], a is trajectories.trajectories[“action/…”] and s’ is trajectories.trajectories[“_observation/…”]. Also note that the action an agent produces is a DictTensor an can contain many different values (like the action to send to the environment, but also action probabilities, baseline values, debugging information, etc…) The same appends for the agent state that can contain any useful information as a DictTensor. We will show in a future tutorial how this can be used for instance to implement recurrent or/and hierarchical policies easily.

#4 Conclusion

Note that rlstructures also contains a tutorial/playing_with_rlstructures.py file to allow you to explore these different aspects.

Don’t hesitate to come back to us with any question on the Facebook group https://www.facebook.com/groups/834804787067021

Research Scientist at Facebook/FAIR -- publications are my owns

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store