Double Pendulum

../../../../../../../_images/MPro-RL-DoublePendulum-class-diagram.drawio.png

Ver. 3.0.0 (2023-05-30)

The Double Pendulum environment is an implementation of a classic control problem of Double Pendulum system. The dynamics of the system are based on the Double Pendulum implementation by Matplotlib. The double pendulum is a system of two poles, with the inner pole connected to a fixed point at one end and to outer pole at other end. The native implementation of Double Pendulum consists of an input motor providing the torque in either directions to actuate the system.

class mlpro.rl.pool.envs.doublependulum.DoublePendulumRoot(p_id=None, p_name=None, p_buffer_size=0, p_range_max=0, p_autorun=0, p_class_shared=None, p_mode=0, p_latency=None, p_max_torque=20, p_l1=1.0, p_l2=1.0, p_m1=1.0, p_m2=1.0, p_g=9.8, p_init_angles='random', p_history_length=5, p_fct_strans: FctSTrans = None, p_fct_success: FctSuccess = None, p_fct_broken: FctBroken = None, p_fct_reward: FctReward = None, p_mujoco_file=None, p_frame_skip=None, p_state_mapping=None, p_action_mapping=None, p_camera_conf=None, p_visualize: bool = False, p_plot_level: int = 2, p_rst_balancing='rst_002', p_rst_swinging='rst_003', p_rst_swinging_outer_pole='rst_004', p_reward_weights: list = None, p_reward_trend: bool = False, p_reward_window: int = 0, p_random_range: list = None, p_balancing_range: list = (-0.2, 0.2), p_swinging_outer_pole_range=(0.2, 0.5), p_break_swinging: bool = False, p_logging=True)

Bases: DoublePendulumSystemRoot, Environment

This is the root double pendulum environment class inherited from Environment class with four dimensional state space and underlying implementation of the Double Pendulum dynamics, default reward strategy.

Parameters:
  • p_mode – Mode of environment. Possible values are Mode.C_MODE_SIM(default) or Mode.C_MODE_REAL.

  • p_latency (timedelta) – Optional latency of environment. If not provided, the internal value of constant C_LATENCY is used by default.

  • p_max_torque (float, optional) – Maximum torque applied to pendulum. The default is 20.

  • p_l1 (float, optional) – Length of pendulum 1 in m. The default is 0.5

  • p_l2 (float, optional) – Length of pendulum 2 in m. The default is 0.25

  • p_m1 (float, optional) – Mass of pendulum 1 in kg. The default is 0.5

  • p_m2 (float, optional) – Mass of pendulum 2 in kg. The default is 0.25

  • p_g (float, optional) – Gravitational acceleration. The default is 9.8

  • p_init_angles (str, optional) – C_ANGLES_UP starts the pendulum in an upright position C_ANGLES_DOWN starts the pendulum in a downward position C_ANGLES_RND starts the pendulum from a random position.

  • p_history_length (int, optional) – Historical trajectory points to display. The default is 5.

  • p_fct_strans (FctSTrans, optional) – The custom State transition function.

  • p_fct_success (FctSuccess, optional) – The custom Success Function.

  • p_fct_broken (FctBroken, optional) – The custom Broken Function.

  • p_fct_reward (FctReward, optional) – The custom Reward Function.

  • p_mujoco_file (optional) – The corresponding mujoco file

  • p_frame_skip (optional) – Number of frames to be skipped for visualization.

  • p_state_mapping (optional) – State mapping configurations.

  • p_action_mapping (optional) – Action mapping configurations.

  • p_camera_conf (optional) – Camera configurations for mujoco specific visualization.

  • p_visualize (bool) – Boolean switch for visualisation. Default = False.

  • p_plot_level (int) – Types and number of plots to be plotted. Default = ALL C_PLOT_DEPTH_ENV only plots the environment C_PLOT_DEPTH_REWARD only plots the reward C_PLOT_ALL plots both reward and the environment

  • p_rst_balancing – Reward strategy to be used for the balancing region of the environment

  • p_rst_swinging – Reward strategy to be used for the swinging region of the environment

  • p_rst_swinging_outer_pole – Reward Strategy to be used for swinging up outer pole

  • p_reward_weights (list) – List of weights to be added to the dimensions of the state space for reward computation

  • p_reward_trend (bool) – Boolean value stating whether to plot reward trend

  • p_reward_window (int) – The number of latest rewards to be shown in the plot. Default is 0

  • p_random_range (list) – The boundaries for state space for initialization of environment randomly

  • range (p_balancing) – The boundaries for state space of environment in balancing region

  • p_swinging_outer_pole_range – The boundaries for state space of environment in swinging of outer pole region

  • p_break_swinging (bool) – Boolean value stating whether the environment shall be broken outside the balancing region

  • p_logging – Log level (see constants of class mlpro.bf.various.Log). Default = Log.C_LOG_WE.

C_REWARD_TYPE = 0
C_PLOT_DEPTH_ENV = 0
C_PLOT_DEPTH_REWARD = 1
C_PLOT_DEPTH_ALL = 2
C_VALID_DEPTH = [0, 1, 2]
C_RST_BALANCING_001 = 'rst_001'
C_RST_BALANCING_002 = 'rst_002'
C_RST_SWINGING_001 = 'rst_003'
C_RST_SWINGING_OUTER_POLE_001 = 'rst_004'
C_VALID_RST_BALANCING = ['rst_001', 'rst_002']
C_VALID_RST_SWINGING = ['rst_003']
C_VALID_RST_SWINGING_OUTER_POLE = ['rst_004']
_compute_reward(p_state_old: State = None, p_state_new: State = None) Reward

This method calculates the reward for C_TYPE_OVERALL reward type. The current reward is based on the worst possible distance between two states and the best possible minimum distance between current and goal state.

Parameters:
  • p_state_old (State) – Previous state.

  • p_state_new (State) – New state.

Returns:

current_reward – current calculated Reward values.

Return type:

Reward

compute_reward_001(p_state_old: State = None, p_state_new: State = None)

Reward strategy with only new normalized state

Parameters:
  • p_state_old (State) – Normalized old state.

  • p_state_new (State) – Normalized new state.

Returns:

current_reward – current calculated Reward values.

Return type:

Reward

compute_reward_002(p_state_old: State = None, p_state_new: State = None)

Reward strategy with both new and old normalized state based on euclidean distance from the goal state. Designed the balancing zone.

Parameters:
  • p_state_old (State) – Normalized old state.

  • p_state_new (State) – Normalized new state.

Returns:

current_reward – current calculated Reward values.

Return type:

Reward

compute_reward_003(p_state_old: State = None, p_state_new: State = None)

Reward strategy with both new and old normalized state based on euclidean distance from the goal state, designed for the swinging of outer pole. Both angles and velocity and acceleration of the outer pole are considered for the reward computation.

Parameters:
  • p_state_old (State) – Normalized old state.

  • p_state_new (State) – Normalized new state.

Returns:

current_reward – current calculated Reward values.

Return type:

Reward

compute_reward_004(p_state_old: State = None, p_state_new: State = None)

Reward strategy with both new and old normalized state based on euclidean distance from the goal state, designed for the swinging up region. Both angles and velocity and acceleration of the outer pole are considered for the reward computation.

The reward strategy is as follows:

reward = (|old_theta1n| - |new_theta1n|) + (|new_omega1n + new_alpha1n| - |old_omega1n + old_alpha1n|)

Parameters:
  • p_state_old (State) – Normalized old state.

  • p_state_new (State) – Normalized new state.

Returns:

current_reward – current calculated Reward values.

Return type:

Reward

_init_plot_2d(p_figure: Figure, p_settings: PlotSettings)

Custom method to initialize a 2D plot. If attribute p_settings.axes is not None the initialization shall be done there. Otherwise a new MatPlotLib Axes object shall be created in the given figure and stored in p_settings.axes.

Parameters:
  • p_figure (Matplotlib.figure.Figure) – Matplotlib figure object to host the subplot(s).

  • p_settings (PlotSettings) – Object with further plot settings.

_update_plot_2d(p_settings: PlotSettings, **p_kwargs)

This method updates the plot figure of each episode. When the figure is detected to be an embedded figure, this method will only set up the necessary data of the figure.

class mlpro.rl.pool.envs.doublependulum.DoublePendulumS4(p_id=None, p_name=None, p_buffer_size=0, p_range_max=0, p_autorun=0, p_class_shared=None, p_mode=0, p_latency=None, p_max_torque=20, p_l1=1.0, p_l2=1.0, p_m1=1.0, p_m2=1.0, p_g=9.8, p_init_angles='random', p_history_length=5, p_fct_strans: FctSTrans = None, p_fct_success: FctSuccess = None, p_fct_broken: FctBroken = None, p_fct_reward: FctReward = None, p_mujoco_file=None, p_frame_skip=None, p_state_mapping=None, p_action_mapping=None, p_camera_conf=None, p_visualize: bool = False, p_plot_level: int = 2, p_rst_balancing='rst_002', p_rst_swinging='rst_003', p_rst_swinging_outer_pole='rst_004', p_reward_weights: list = None, p_reward_trend: bool = False, p_reward_window: int = 0, p_random_range: list = None, p_balancing_range: list = (-0.2, 0.2), p_swinging_outer_pole_range=(0.2, 0.5), p_break_swinging: bool = False, p_logging=True)

Bases: DoublePendulumRoot, DoublePendulumSystemS4

This is the Double Pendulum Static 4 dimensional environment that inherits from the double pendulum root class, inheriting the dynamics and default reward strategy.

Parameters:
  • p_mode – Mode of environment. Possible values are Mode.C_MODE_SIM(default) or Mode.C_MODE_REAL.

  • p_latency (timedelta) – Optional latency of environment. If not provided, the internal value of constant C_LATENCY is used by default.

  • p_max_torque (float, optional) – Maximum torque applied to pendulum. The default is 20.

  • p_l1 (float, optional) – Length of pendulum 1 in m. The default is 0.5

  • p_l2 (float, optional) – Length of pendulum 2 in m. The default is 0.25

  • p_m1 (float, optional) – Mass of pendulum 1 in kg. The default is 0.5

  • p_m2 (float, optional) – Mass of pendulum 2 in kg. The default is 0.25

  • p_init_angles (str, optional) – C_ANGLES_UP starts the pendulum in an upright position C_ANGLES_DOWN starts the pendulum in a downward position C_ANGLES_RND starts the pendulum from a random position.

  • p_g (float, optional) – Gravitational acceleration. The default is 9.8

  • p_history_length (int, optional) – Historical trajectory points to display. The default is 5.

  • p_visualize (bool) – Boolean switch for visualisation. Default = False.

  • p_plot_level (int) – Types and number of plots to be plotted. Default = ALL C_PLOT_DEPTH_ENV only plots the environment C_PLOT_DEPTH_REWARD only plots the reward C_PLOT_ALL plots both reward and the environment

  • p_rst_balancingL – Reward strategy to be used for the balancing region of the environment

  • p_rst_swinging – Reward strategy to be used for the swinging region of the environment

  • p_reward_weights (list) – List of weights to be added to the dimensions of the state space for reward computation

  • p_reward_trend (bool) – Boolean value stating whether to plot reward trend

  • p_reward_window (int) – The number of latest rewards to be shown in the plot. Default is 0

  • p_random_range (list) – The boundaries for state space for initialization of environment randomly

  • range (p_balancing) – The boundaries for state space of environment in balancing region

  • p_break_swinging (bool) – Boolean value stating whether the environment shall be broken outside the balancing region

  • p_logging – Log level (see constants of class mlpro.bf.various.Log). Default = Log.C_LOG_WE.

C_NAME = 'DoublePendulumS4'
_normalize(p_state: list)

Method for normalizing the State values of the DP env based on MinMax normalisation based on static boundaries provided by MLPro.

Parameters:

p_state – The state to be normalized

Returns:

Normalized state values

Return type:

state

_obs_to_mujoco(p_state)
class mlpro.rl.pool.envs.doublependulum.DoublePendulumS7(p_id=None, p_name=None, p_buffer_size=0, p_range_max=0, p_autorun=0, p_class_shared=None, p_mode=0, p_latency=None, p_max_torque=20, p_l1=1.0, p_l2=1.0, p_m1=1.0, p_m2=1.0, p_g=9.8, p_init_angles='random', p_history_length=5, p_fct_strans: FctSTrans = None, p_fct_success: FctSuccess = None, p_fct_broken: FctBroken = None, p_fct_reward: FctReward = None, p_mujoco_file=None, p_frame_skip=None, p_state_mapping=None, p_action_mapping=None, p_camera_conf=None, p_visualize: bool = False, p_plot_level: int = 2, p_rst_balancing='rst_002', p_rst_swinging='rst_003', p_rst_swinging_outer_pole='rst_004', p_reward_weights: list = None, p_reward_trend: bool = False, p_reward_window: int = 0, p_random_range: list = None, p_balancing_range: list = (-0.2, 0.2), p_swinging_outer_pole_range=(0.2, 0.5), p_break_swinging: bool = False, p_logging=True)

Bases: DoublePendulumS4, DoublePendulumSystemS7

This is the classic implementation of Double Pendulum with 7 dimensional state space including derived accelerations of both the poles and the input torque. The dynamics of the system are inherited from the Double Pendulum Root class.

Parameters:
  • p_mode – Mode of environment. Possible values are Mode.C_MODE_SIM(default) or Mode.C_MODE_REAL.

  • p_latency (timedelta) – Optional latency of environment. If not provided, the internal value of constant C_LATENCY is used by default.

  • p_max_torque (float, optional) – Maximum torque applied to pendulum. The default is 20.

  • p_l1 (float, optional) – Length of pendulum 1 in m. The default is 0.5

  • p_l2 (float, optional) – Length of pendulum 2 in m. The default is 0.25

  • p_m1 (float, optional) – Mass of pendulum 1 in kg. The default is 0.5

  • p_m2 (float, optional) – Mass of pendulum 2 in kg. The default is 0.25

  • p_init_angles (str, optional) – C_ANGLES_UP starts the pendulum in an upright position C_ANGLES_DOWN starts the pendulum in a downward position C_ANGLES_RND starts the pendulum from a random position.

  • p_g (float, optional) – Gravitational acceleration. The default is 9.8

  • p_history_length (int, optional) – Historical trajectory points to display. The default is 5.

  • p_visualize (bool) – Boolean switch for visualisation. Default = False.

  • p_plot_level (int) – Types and number of plots to be plotted. Default = ALL C_PLOT_DEPTH_ENV only plots the environment C_PLOT_DEPTH_REWARD only plots the reward C_PLOT_ALL plots both reward and the environment

  • p_rst_balancingL – Reward strategy to be used for the balancing region of the environment

  • p_rst_swinging – Reward strategy to be used for the swinging region of the environment

  • p_reward_weights (list) – List of weights to be added to the dimensions of the state space for reward computation

  • p_reward_trend (bool) – Boolean value stating whether to plot reward trend

  • p_reward_window (int) – The number of latest rewards to be shown in the plot. Default is 0

  • p_random_range (list) – The boundaries for state space for initialization of environment randomly

  • range (p_balancing) – The boundaries for state space of environment in balancing region

  • p_break_swinging (bool) – Boolean value stating whether the environment shall be broken outside the balancing region

  • p_logging – Log level (see constants of class mlpro.bf.various.Log). Default = Log.C_LOG_WE.

C_NAME = 'DoublePendulumS7'
class mlpro.rl.pool.envs.doublependulum.DoublePendulumOA4(p_id=None, p_name: str = None, p_buffer_size: int = 0, p_range_max: int = 0, p_autorun: int = 0, p_class_shared: Shared = None, p_mode=0, p_ada=True, p_latency=None, p_t_step: timedelta = None, p_max_torque=20, p_l1=1.0, p_l2=1.0, p_m1=1.0, p_m2=1.0, p_g=9.8, p_init_angles='random', p_history_length=5, p_fct_strans: FctSTrans = None, p_fct_success: FctSuccess = None, p_fct_broken: FctBroken = None, p_fct_reward: FctReward = None, p_wf: OAWorkflow = None, p_wf_reward: OAWorkflow = None, p_wf_success: OAWorkflow = None, p_wf_broken: OAWorkflow = None, p_plot_level: int = 2, p_rst_balancing='rst_002', p_rst_swinging='rst_003', p_rst_swinging_outer_pole='rst_004', p_reward_weights: list = None, p_reward_trend: bool = False, p_reward_window: int = 0, p_random_range: list = None, p_balancing_range: list = (-0.2, 0.2), p_swinging_outer_pole_range=(0.2, 0.5), p_break_swinging: bool = False, p_mujoco_file=None, p_frame_skip: int = 1, p_state_mapping=None, p_action_mapping=None, p_camera_conf: tuple = (None, None, None), p_visualize: bool = False, p_logging: bool = True, **p_kwargs)

Bases: OAEnvironment, DoublePendulumS4, DoublePendulumOA4

C_NAME = 'Double Pendulum A4'
class mlpro.rl.pool.envs.doublependulum.DoublePendulumOA7(p_id=None, p_name: str = None, p_buffer_size: int = 0, p_range_max: int = 0, p_autorun: int = 0, p_class_shared: Shared = None, p_mode=0, p_ada=True, p_latency=None, p_t_step: timedelta = None, p_max_torque=20, p_l1=1.0, p_l2=1.0, p_m1=1.0, p_m2=1.0, p_g=9.8, p_init_angles='random', p_history_length=5, p_fct_strans: FctSTrans = None, p_fct_success: FctSuccess = None, p_fct_broken: FctBroken = None, p_fct_reward: FctReward = None, p_wf: OAWorkflow = None, p_wf_reward: OAWorkflow = None, p_wf_success: OAWorkflow = None, p_wf_broken: OAWorkflow = None, p_plot_level: int = 2, p_rst_balancing='rst_002', p_rst_swinging='rst_003', p_rst_swinging_outer_pole='rst_004', p_reward_weights: list = None, p_reward_trend: bool = False, p_reward_window: int = 0, p_random_range: list = None, p_balancing_range: list = (-0.2, 0.2), p_swinging_outer_pole_range=(0.2, 0.5), p_break_swinging: bool = False, p_mujoco_file=None, p_frame_skip: int = 1, p_state_mapping=None, p_action_mapping=None, p_camera_conf: tuple = (None, None, None), p_visualize: bool = False, p_logging: bool = True, **p_kwargs)

Bases: DoublePendulumOA4, DoublePendulumOA7

C_NAME = 'Double Pendulum A7'
C_PLOT_ACTIVE: bool = True