Howto RL-015: Train a wrapped Stable Baselines 3 policy with stagnation detection
Prerequisites
- Please install the following packages to run this examples properly:
Executable code
## -------------------------------------------------------------------------------------------------
## -- Project : MLPro - A Synoptic Framework for Standardized Machine Learning Tasks
## -- Package : mlpro
## -- Module : howto_rl_015_train_wrapped_sb3_policy_with_stagnation_detection.py
## -------------------------------------------------------------------------------------------------
## -- History :
## -- yyyy-mm-dd Ver. Auth. Description
## -- 2022-01-20 0.0.0 MRD Creation
## -- 2022-01-20 1.0.0 MRD Released first version
## -- 2022-05-17 1.0.1 DA Just a litte comment maintenance
## -- 2022-07-20 1.0.2 SY Update due to the latest introduction of Gym 0.25
## -------------------------------------------------------------------------------------------------
"""
Ver. 1.0.2 (2022-07-20)
This module shows how to train with SB3 Wrapper and stagnation detection
"""
import gym
from stable_baselines3 import A2C, PPO, DQN, DDPG, SAC
from mlpro.rl.models import *
from mlpro.wrappers.openai_gym import WrEnvGYM2MLPro
from mlpro.wrappers.sb3 import WrPolicySB32MLPro
from pathlib import Path
# 1 Implement your own RL scenario
class MyScenario(RLScenario):
C_NAME = 'Matrix'
def _setup(self, p_mode, p_ada, p_logging):
# 1 Setup environment
gym_env = gym.make('CartPole-v1', new_step_api=True, render_mode=None)
self._env = WrEnvGYM2MLPro(gym_env, p_logging=p_logging)
# 2 Instantiate PPO Policy from SB3
policy_sb3 = PPO(
policy="MlpPolicy",
n_steps=5,
env=None,
_init_setup_model=False,
device="cpu",
seed=1)
# 3 Wrap the policy
policy_wrapped = WrPolicySB32MLPro(
p_sb3_policy=policy_sb3,
p_cycle_limit=self._cycle_limit,
p_observation_space=self._env.get_state_space(),
p_action_space=self._env.get_action_space(),
p_ada=p_ada,
p_logging=p_logging)
# 4 Setup standard single-agent with own policy
return Agent(
p_policy=policy_wrapped,
p_envmodel=None,
p_name='Smith',
p_ada=p_ada,
p_logging=p_logging
)
# 2 Create scenario and start training
if __name__ == "__main__":
# 2.1 Parameters for demo mode
cycle_limit = 5000
adaptation_limit = 50
stagnation_limit = 5
eval_frequency = 5
eval_grp_size = 5
logging = Log.C_LOG_WE
visualize = True
path = str(Path.home())
else:
# 2.2 Parameters for internal unit test
cycle_limit = 50
adaptation_limit = 5
stagnation_limit = 5
eval_frequency = 2
eval_grp_size = 1
logging = Log.C_LOG_NOTHING
visualize = False
path = None
# 2.3 Create and run training object
training = RLTraining(
p_scenario_cls=MyScenario,
p_cycle_limit=cycle_limit,
p_adaptation_limit=adaptation_limit,
p_stagnation_limit=stagnation_limit,
p_eval_frequency=eval_frequency,
p_eval_grp_size=eval_grp_size,
p_path=path,
p_visualize=visualize,
p_logging=logging)
training.run()
Results
After the environment is initialised, the training will run for the specified amount of limits. When stagnation is detected, the training will be stopped.
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: ------------------------------------------------------------------------------
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Training Results of run 0
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: ------------------------------------------------------------------------------
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: ------------------------------------------------------------------------------
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Scenario : RL-Scenario Matrix
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Model : Agent Smith
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Start time stamp : YYYY-MM-DD HH:MM:SS.SSSSSS
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- End time stamp : YYYY-MM-DD HH:MM:SS.SSSSSS
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Duration : HH:MM:SS.SSSSSS
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Start cycle id : 0
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- End cycle id :
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Training cycles :
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Evaluation cycles :
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Adaptations :
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- High score :
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Results stored in : "C:\Users\%username%\YYYY-MM-DD HH:MM:SS Training RL"
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Training Episodes : 120
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: -- Evaluations : 25
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: ------------------------------------------------------------------------------
YYYY-MM-DD HH:MM:SS.SSSSSS W Results RL: ------------------------------------------------------------------------------
- After termination the local result folder contains the training result files:
agent_actions.csv
env_rewards.csv
env_states.csv
evaluation.csv
summary.csv
trained model.pkl