RL-AGENTS - Agents

../../../../../../_images/MLPro-RL-Agents_class_diagram.drawio.png

Ver. 1.7.0 (2023-03-27)

This module provides model classes for policies, model-free and model-based agents and multi-agents.

class mlpro.rl.models_agents.Policy(p_observation_space: MSpace, p_action_space: MSpace, p_id=None, p_buffer_size: int = 1, p_ada: bool = True, p_visualize: bool = False, p_logging=True)

Bases: Model

This class represents the policy of a single-agent. It is adaptive and can be trained with State-Action-Reward (SAR) data that will be expected as a SAR buffer object. The three main learning paradigms of machine learning to train a policy are supported:

a) Training by Supervised Learning The entire SAR data set inside the SAR buffer shall be adapted.

b) Training by Reinforcement Learning The latest SAR data record inside the SAR buffer shall be adapted.

c) Training by Unsupervised Learning All state data inside the SAR buffer shall be adapted.

Furthermore, a policy class can compute actions from states.

Hyperparameters of the policy should be stored in the internal object self._hp_list, so that they can be tuned from outside. Optionally a policy-specific callback method can be called on changes. For more information see class HyperParameterList.

Parameters:

p_observation_space (MSpace) – Subspace of an environment that is observed by the policy.
p_action_space (MSpace) – Action space object.
p_id – Optional external id
p_buffer_size (int) – Size of internal buffer. Default = 1.
p_ada (bool) – Boolean switch for adaptivity. Default = True.
p_visualize (bool) – Boolean switch for env/agent visualisation. Default = False.
p_logging – Log level (see constants of class Log). Default = Log.C_LOG_ALL.

C_TYPE = 'Policy'

C_NAME = '????'

C_BUFFER_CLS: alias of SARSBuffer

get_observation_space() → MSpace

get_action_space() → MSpace

compute_action(p_obs: State) → Action

Specific action computation method to be redefined.

Parameters:: p_obs (State) – Observation data.
Returns:: action – Action object.
Return type:: Action

_adapt(p_sars_elem: SARSElement) → bool

Adapts the policy based on State-Action-Reward-State (SARS) data.

Parameters:: p_sars_elem (SARSElement) – Object of type SARSElement.
Returns:: adapted – True, if something has been adapted. False otherwise.
Return type:: bool

class mlpro.rl.models_agents.ActionPlanner(p_state_thsld=1e-08, p_logging=True)

Bases: Log

Template class for action planning algorithms to be used as part of model-based planning agents. The goal is to find the shortest sequence of actions that leads to a maximum reward.

Parameters:

p_state_thsld (float) – Threshold for metric difference between two states to be equal. Default = 0.00000001.
p_logging – Log level (see constants of class Log). Default = Log.C_LOG_ALL.

C_TYPE = 'Action Planner'

setup(p_policy: Policy, p_envmodel: EnvModel, p_prediction_horizon=0, p_control_horizon=0, p_width_limit=0)

Setup of action planner object in concrete planning scenario. Must be called before first planning. Optional custom method _setup() is called at the end.

Parameters:

p_policy (Policy) – Policy of an agent.
p_envmodel (EnvModel) – Environment model.
p_prediction_horizon (int) – Optional static maximum planning depth (=length of action path to be predicted). Can be overridden by method compute_action(). Default=0.
p_control_horizon (int) – The length of predicted action path to be applied. Can be overridden by method compute_action(). Default=0.
p_width_limit (int) – Optional static maximum planning width (=number of alternative actions per planning level). Can be overridden by method compute_action(). Default=0.

_setup(): Optional custom setup method.

compute_action(p_obs: State, p_prediction_horizon=0, p_control_horizon=0, p_width_limit=0) → Action

Computes a path of actions with defined length that maximizes the reward of the given environment model. The planning algorithm itself is to be implemented in the custom method _plan_action().

Parameters:

p_obs (State) – Observation data.
p_prediction_horizon (int) – Optional dynamic maximum planning depth (=length of action path to be predicted) that overrides the static limit of method setup(). Default=0 (no override).
p_control_horizon (int) – The length of predicted action path to be applied that overrides the static limit of method setup(). Default=0 (no override).
p_width_limit (int) – Optional dynamic maximum planning width (=number of alternative actions per planning level) that overrides the static limit of method setup(). Default=0 (no override).

Returns:

action – Best action as result of the planning process.

Return type:

Action

_plan_action(p_obs: State) → SARSBuffer

Custom planning algorithm to fill the internal action path (self._action_path). Search width and depth are restricted by the attributes self._width_limit and self._prediction_horizon.

Parameters:: p_obs (State) – Observation data.
Returns:: action_path – Sequence of SARSElement objects with included actions that lead to the best possible reward.
Return type:: SARSBuffer

clear_action_path()

class mlpro.rl.models_agents.RLScenarioMBInt(p_mode=0, p_ada: bool = True, p_cycle_limit=0, p_visualize: bool = True, p_logging=True)

Bases: RLScenario

Internal use in class Agent. Intended for the training of the policy with the environment model of a model-based (single) agent.

C_NAME = 'MB(intern)'

_setup(**p_kwargs) → Model

Custom method to set up the ML scenario. Please bind your environment to self._env and return the agent as model.

Parameters:

p_mode – Operation mode. See Mode.C_VALID_MODES for valid values. Default = Mode.C_MODE_SIM
p_ada (bool) – Boolean switch for adaptivity.
p_visualize (bool) – Boolean switch for env/agent visualisation. Default = True.
p_logging – Log level (see constants of class Log).

Returns:

agent – Agent model (object of type Agent or Multi-agent)

Return type:

Agent

setup_ext(p_env: EnvBase, p_policy: Policy, p_logging: Log)

class mlpro.rl.models_agents.Agent(p_policy: Policy, p_envmodel: EnvModel = None, p_em_acc_thsld=0.9, p_action_planner: ActionPlanner = None, p_predicting_horizon=0, p_controlling_horizon=0, p_planning_width=0, p_name='', p_ada=True, p_visualize: bool = True, p_logging=True, **p_mb_training_param)

Bases: Policy

This class represents a single agent model.

Parameters:

p_policy (Policy) – Policy object.
p_envmodel (EnvModel) – Optional environment model object. Default = None.
p_em_acc_thsld (float) – Optional threshold for environment model accuracy (whether the envmodel is ‘good’ enough to be used to train the policy). Default = 0.9.
p_action_planner (ActionPlanner) – Optional action planner object (obligatory for model based agents). Default = None.
p_predicting_horizon (int) – Optional predicting horizon (obligatory for model based agents). Default = 0.
p_controlling_horizon (int) – Optional controlling (obligatory for model based agents). Default = 0.
p_planning_width (int) – Optional planning width (obligatory for model based agents). Default = 0.
p_name (str) – Optional name of agent. Default = ‘’.
p_ada (bool) – Boolean switch for adaptivity. Default = True.
p_visualize (bool) – Boolean switch for env/agent visualisation. Default = False.
p_logging – Log level (see constants of class mlpro.bf.various.Log). Default = Log.C_LOG_ALL.
p_mb_training_param (dict) – Optional parameters for internal policy training with environment model (see parameters of class RLTraining). Hyperparameter tuning and data logging is not supported here. The suitable scenario class is added internally.

C_TYPE = 'Agent'

C_NAME = ''

_init_hyperparam(**p_par)

Implementation specific hyperparameters can be added here. Please follow these steps: a) Add each hyperparameter as an object of type HyperParam to the internal hyperparameter

space object self._hyperparam_space

Create hyperparameter tuple and bind to self._hyperparam_tuple
Set default value for each hyperparameter

Parameters:: p_par (Dict) – Further model specific hyperparameters, that are passed through constructor.

switch_logging(p_logging)

Sets new log level.

Parameters:: p_logging – Log level (constant C_LOG_LEVELS contains valid values)

switch_adaptivity(p_ada: bool)

Switches adaption functionality on/off.

Parameters:: p_ada (bool) – Boolean switch for adaptivity

set_log_level(p_level)

get_observation_space() → MSpace

get_action_space() → MSpace

_extract_observation(p_state: State) → State

set_random_seed(p_seed=None): Resets the internal random generator using the given seed.

compute_action(p_state: State) → Action

Default implementation of a single agent.

Parameters:: p_state (State) – State of the related environment.
Returns:: action – Action object.
Return type:: Action

_adapt(p_state: State, p_reward: Reward) → bool

Default adaptation implementation of a single agent.

Parameters:

p_state (State) – State object.
p_reward (Reward) – Reward object.

Returns:

result – True, if something has been adapted. False otherwise.

Return type:

bool

_adapt_policy_by_model()

clear_buffer(): Clears internal buffer (if buffering is active).

class mlpro.rl.models_agents.MultiAgent(p_name: str = '', p_ada: bool = True, p_visualize: bool = False, p_logging=True)

Bases: Agent

Multi-Agent.

Parameters:

p_name (str) – Name of agent. Default = ‘’.
p_ada (bool) – Boolean switch for adaptivity. Default = True.
p_visualize (bool) – Boolean switch for env/agent visualisation. Default = False.
p_logging – Log level (see constants of class Log). Default = Log.C_LOG_ALL.

C_TYPE = 'Multi-Agent'

C_NAME = ''

switch_logging(p_logging) → None

Sets new log level.

Parameters:: p_logging – Log level (constant C_LOG_LEVELS contains valid values)

switch_adaptivity(p_ada: bool)

Switches adaption functionality on/off.

Parameters:: p_ada (bool) – Boolean switch for adaptivity

set_log_level(p_level)

add_agent(p_agent: Agent, p_weight=1.0) → None

Adds agent object to internal list of agents.

Parameters:

p_agent (Agent) – Agent object to be added.
p_weight (float) – Optional weight for the agent. Default = 1.0.

get_agents()

get_agent(p_agent_id)

Returns information of a single agent.

Returns:: agent_info – agent_info[0] is the agent object itself and agent_info[1] it’s weight
Return type:: tuple

get_observation_space() → MSpace

get_action_space() → MSpace

set_random_seed(p_seed=None): Resets the internal random generator using the given seed.

compute_action(p_state: State) → Action

Default implementation of a single agent.

Parameters:: p_state (State) – State of the related environment.
Returns:: action – Action object.
Return type:: Action

_adapt(p_state: State, p_reward: Reward) → bool

Default adaptation implementation of a single agent.

Parameters:

p_state (State) – State object.
p_reward (Reward) – Reward object.

Returns:

result – True, if something has been adapted. False otherwise.

Return type:

bool

clear_buffer(): Clears internal buffer (if buffering is active).

init_plot(p_figure: Figure = None, p_plot_settings: list = Ellipsis, p_plot_depth: int = 0, p_detail_level: int = 0, p_step_rate: int = 0, **p_kwargs): Doesn’t support embedded plot of underlying agent hierarchy.

update_plot(**p_kwargs)

Updates the plot.

Parameters:: **p_kwargs – Implementation-specific plot data and/or parameters.