6.6. Training and Tuning

In RL, the agent and the environment interact over a sequence of time steps. At each time step, the agent receives an observation of the current state of the environment and selects an action. The environment then transitions to a new state and returns a reward signal to the agent. This process continues until some terminal state is reached.

The agent uses the observed state-action-reward sequences to update its policy, either through model-based methods that estimate the underlying dynamics of the environment, or model-free methods that directly estimate the value or the policy. The policy is used to select actions in subsequent interactions with the environment, allowing the agent to learn from its mistakes and improve over time.

In MLPro-RL, a class RLTraining inherits the functionality from class Training in the basic function level, where the RLTraining class are used for training and hyperparameter tuning of RL agents. We implement episodic training algorithms and make the corresponding extended training data and results as well as the trained agents available in the file system. In this RL training, we always start with a defined random initial state of the environment and evaluate at each time step whether one of the following three categories is satisfied,

Event Success: This means that the defined target state is reached and the actual episode is ended.

Event Broken: This means that the defined target state is no longer reachable and the actual episode is ended.

Event Timeout: This means that the maximum training cycles for an episode are reached and the actual episode is ended.

If none of the events is satisfied, then the training continues. The goal of the training is to maximize the score of the repetitive evaluations. In this case, a stagnation detection functionality can be incorporated to avoid a long training time without any more improvements. The training can be ended, once the stagnation is detected. For more information, you can read Section 4.3 of MLPro 1.0 paper.

In MLPro-RL, we simplify the process of setting up an RL scenario and training for both single-agent and multi-agent RL, as shown below:

Single-Agent Scenario Creation

from mlpro.rl.models import *

class MyScenario(Scenario):

    C_NAME      = 'MyScenario'

    def _setup(self, p_mode, p_ada:bool, p_logging:bool):
        """
        Here's the place to explicitely setup the entire rl scenario. Please bind your env to
        self._env and your agent to self._agent.

        Parameters:
            p_mode              Operation mode of environment (see Environment.C_MODE_*)
            p_ada               Boolean switch for adaptivity of agent
            p_logging           Boolean switch for logging functionality
       """

       # Setup environment
       self._env    = MyEnvironment(....)

       # Setup an agent with selected policy
       self._agent = Agent(
           p_policy=MyPolicy(
            p_state_space=self._env.get_state_space(),
            p_action_space=self._env.get_action_space(),
            ....
            ),
            ....
        )

# Instantiate scenario
myscenario  = MyScenario(p_scenario=myscenario, ....)

# Train agent in scenario
training    = Training(....)
training.run()

Multi-Agent Scenario Creation

from mlpro.rl.models import *

class MyScenario(Scenario):

    C_NAME      = 'MyScenario'

    def _setup(self, p_mode, p_ada:bool, p_logging:bool):
        """
        Here's the place to explicitely setup the entire rl scenario. Please bind your env to
        self._env and your agent to self._agent.

        Parameters:
            p_mode              Operation mode of environment (see Environment.C_MODE_*)
            p_ada               Boolean switch for adaptivity of agent
            p_logging           Boolean switch for logging functionality
       """

       # Setup environment
       self._env    = MyEnvironment(....)

       # Create an empty mult-agent
       self._agent     = MultiAgent(....)

       # Add Single-Agent #1 with own policy (controlling sub-environment #1)
       self._agent.add_agent = Agent(
           self._agent = Agent(
               p_policy=MyPolicy(
                p_state_space=self._env.get_state_space().spawn[....],
                p_action_space=self._env.get_action_space().spawn[....],
                ....
                ),
                ....
            ),
            ....
        )

       # Add Single-Agent #2 with own policy (controlling sub-environment #2)
       self._agent.add_agent = Agent(....)

       ....

# Instantiate scenario
myscenario  = MyScenario(p_scenario=myscenario, ....)

# Train agent in scenario
training    = Training(....)
training.run()

Cross Reference

A sample application video of MLPro-RL on a UR5 robot

Howto RL-AGENT-002: Train an Agent with Own Policy

Howto RL-AGENT-004: Train Multi-Agent with Own Policy

Howto RL-AGENT-011: Train and Reload Single Agent (Gym)

Howto RL-AGENT-021: Train and Reload Single Agent (MuJoCo)

Howto RL-ATT-001: Train and Reload Single Agent using Stagnation Detection (Gym)

Howto RL-ATT-002: Train and Reload Single Agent using Stagnation Detection (MuJoCo)

Howto RL-MB-001: Train and Reload Model Based Agent (Gym)

Howto RL-MB-002: MBRL with MPC on Grid World Environment

MLPro-BF-ML: Training and Tuning