Howto RL-011: Train a wrapped Stable Baslines 3 policy on MLPro’s native UR5 environment (Paper)

Prerequisites

Please install the following packages to run this examples properly:

Executable code

    ## -------------------------------------------------------------------------------------------------
## -- Project : MLPro - A Synoptic Framework for Standardized Machine Learning Tasks
## -- Package : mlpro
## -- Module  : howto_rl_011_train_ur5_environment_with_wrapped_sb3_policy.py
## -------------------------------------------------------------------------------------------------
## -- History :
## -- yyyy-mm-dd  Ver.      Auth.    Description
## -- 2021-11-18  0.0.0     MRD      Creation
## -- 2021-11-18  1.0.0     MRD      Initial Release
## -- 2021-12-07  1.0.1     DA       Refactoring
## -- 2022-02-11  1.1.0     DA       Special derivate for publication
## -- 2022-05-23  1.2.0     MRD      Add visualize toggle on UR5JointControl for gazebo GUI
## -- 2022-06-06  1.2.1     MRD      Add real connection option
## -- 2022-06-13  1.2.2     MRD      Update possibility to run separate simulator and training
## --                                setting separate ROS Server IP
## -------------------------------------------------------------------------------------------------

"""
Ver. 1.2.2 (2022-06-13)

This module shows how to use SB3 wrapper to train UR5 robot (derivate for paper).
"""

from mlpro.rl.models import *
from mlpro.rl.pool.envs.ur5jointcontrol import UR5JointControl
from stable_baselines3 import PPO
from mlpro.wrappers.sb3 import WrPolicySB32MLPro
from pathlib import Path


# 1 Implement your own RL scenario
class ScenarioUR5A2C(RLScenario):
    C_NAME = 'Matrix'

    def _setup(self, p_mode, p_ada, p_logging):
        # 1.1 Setup environment
        self._env = UR5JointControl(
            p_build=True, 
            p_real=p_mode,
            p_start_simulator=True,
            p_start_ur_driver=True,
            # p_ros_server_ip="172.19.10.199",
            p_net_interface="enp0s31f6",
            p_robot_ip="172.19.10.41",
            # p_reverse_ip="172.19.10.140", 
            p_visualize=self._visualize, 
            p_logging=p_logging)

        policy_sb3 = PPO(
            policy="MlpPolicy",
            n_steps=20,
            env=None,
            _init_setup_model=False,
            device="cpu",
            seed=1)

        policy_wrapped = WrPolicySB32MLPro(
            p_sb3_policy=policy_sb3,
            p_cycle_limit=self._cycle_limit,
            p_observation_space=self._env.get_state_space(),
            p_action_space=self._env.get_action_space(),
            p_ada=p_ada,
            p_logging=p_logging)

        # 1.2 Setup standard single-agent with own policy
        return Agent(
            p_policy=policy_wrapped,
            p_envmodel=None,
            p_name='Smith',
            p_ada=p_ada,
            p_logging=p_logging
        )


# 2 Train agent in scenario
now = datetime.now()

training = RLTraining(
    p_scenario_cls=ScenarioUR5A2C,
    p_env_mode=Mode.C_MODE_SIM,
    p_cycle_limit=5000,
    p_cycles_per_epi_limit=-1,
    p_collect_states=True,
    p_collect_actions=True,
    p_collect_rewards=True,
    p_collect_training=True,
    p_visualize=True,
    p_path=str(Path.home()),
    p_logging=Log.C_LOG_WE)

training.run()

Results

The Gazebo GUI should be the first thing that shows up. The UR5 robot will move depending on the given action and the training is run. When the training is done, the logged rewards will be plotted using the matplotlib library.

The plotted figure is not reproducible due to the simulator’s nature of simulating real world scenario. Although seeds can be set for the random generator, the sampling cannot be done at the exact same time during different runs. For a more reproducible results, Howto RL-012 is more appropriate.