State-based Potential Games

Ver. 1.0.0 (2025-04-09)

This module implements a dynamic game policy class for State-Based Potential Games (SbPG), including three learning algorithms:

  • Best Response (BR) - DOI: 10.1109/TCYB.2020.3006620

  • Gradient-Based (GB) - DOI: 10.1109/IECON55916.2024.10905619

  • Gradient-Based with Momentum (GB_MOM) - DOI: 10.1109/IECON55916.2024.10905619

The class SbPG supports learning in multi-agent environments where agents update their actions based on individual utility gradients or best-response dynamics over discretized states.

class mlpro.gt.pool.policies.sbpg.SbPG(p_observation_space: MSpace, p_action_space: MSpace, p_id: str = None, p_buffer_size: int = 1, p_ada: bool = True, p_visualize: bool = False, p_logging=True, p_algo: int = 0, p_num_states: int = None, p_exploration_decay: float = None, p_alpha: float = None, p_ou_noise: float = None, p_kick_off_eps: int = None, p_cycles_per_ep: int = None, p_smoothing: float = None, p_ep_max: int = None, p_beta: float = None)

Bases: Policy

State-Based Potential Games (SbPG) Policy Class.

This class implements a learning policy for multi-agent systems in SbPG. It supports three algorithms for policy adaptation:

  • ALG_SbPG_BR: Best Response Learning

  • ALG_SbPG_GB: Gradient-Based Learning

  • ALG_SbPG_GB_MOM: Gradient-Based Learning with Momentum

The environment is discretized into a 2D grid of states, and the player learns a performance map and optimal action per state via adaptation methods.

Parameters:
  • p_observation_space (MSpace) – Observation space (state space) of the player.

  • p_action_space (MSpace) – Action space of the player.

  • p_id (str, optional) – Identifier of the policy.

  • p_buffer_size (int, optional) – Size of the internal buffer for learning, default is 1.

  • p_ada (bool, optional) – Whether adaptive learning is enabled.

  • p_visualize (bool, optional) – Enable visualization.

  • p_logging (int, optional) – Logging level.

  • p_algo (int, optional) – Algorithm selection (0: BR, 1: GB, 2: GB_MOM).

  • p_num_states (int, optional) – Number of discretized states per axis.

  • p_exploration_decay (float, optional) – Decay rate of exploration over time.

  • p_alpha (float, optional) – Learning rate for GB adaptation.

  • p_ou_noise (float, optional) – OU noise factor for exploration in GB mode.

  • p_kick_off_eps (int, optional) – Number of episodes before exploitation begins in GB.

  • p_cycles_per_ep (int, optional) – Steps per episode.

  • p_smoothing (float, optional) – Smoothing factor in interpolation.

  • p_ep_max (int, optional) – Maximum number of episodes.

  • p_beta (float, optional) – Momentum coefficient for GB_MOM.

performance_map

Stores actions and utilities for each state in a 2D grid.

Type:

torch.Tensor

map_nxt_action

Stores next recommended action for each state (GB only).

Type:

torch.Tensor

exploration

Current exploration probability.

Type:

float

_counter

Step counter within an episode.

Type:

int

_current_ep

Current episode number.

Type:

int

C_NAME = 'SbPG'
ALG_SbPG_BR = 0
ALG_SbPG_GB = 1
ALG_SbPG_GB_MOM = 2
_init_hyperparam()

Initializes the hyperparameter space with default values.

get_hyperparam() HyperParamTuple

Returns the current set of hyperparameters used by the policy.

Returns:

A tuple containing all hyperparameter values.

Return type:

HyperParamTuple

compute_action(p_state: State) Action

Computes the next action based on the current state and learning strategy.

Parameters:

p_state (State) – Current environment state.

Returns:

Selected action according to the chosen algorithm (BR or GB).

Return type:

Action

compute_action_br(p_state: State) Action

Computes the player’s action using BR learning strategy.

With a probability of exploration, the agent takes a random action. Otherwise, it chooses the best known action based on the performance map using interpolation.

Parameters:

p_state (State) – Current environment state.

Returns:

Computed action for the current state.

Return type:

Action

compute_action_gb(p_state: State) Action

Computes the player’s action using GB learning strategy.

If the player is still in the “kick-off” phase, it explores randomly. Otherwise, it exploits the learned gradient with optional noise (OU noise). Outside the exploration phase, it uses interpolated action values.

Parameters:

p_state (State) – Current environment state.

Returns:

Computed action for the current state.

Return type:

Action

_adapt(**p_kwargs) bool

Adapts the policy based on the selected algorithm (BR, GB, or GB_MOM).

Parameters:

p_kwargs (dict) – Keyword arguments, must include a SARSElement.

Returns:

True if the policy was updated successfully.

Return type:

bool

_adapt_br(p_kwargs) bool

Performs policy update using BR learning.

Parameters:

p_kwargs (dict) – Dictionary containing SARSElement.

Returns:

True if a better reward was found and map updated.

Return type:

bool

_adapt_gb(p_kwargs) bool

Performs policy update using GB learning.

Updates the action using utility gradient and learning rate. Stores the best known action and utility in the performance map.

Parameters:

p_kwargs (dict) – Keyword arguments, must include a SARSElement.

Returns:

Always returns True to indicate the learning step was performed.

Return type:

bool

_adapt_gb_mom(p_kwargs) bool

Performs policy update using Gradient-Based learning with Momentum (GB_MOM).

A moving average of the gradient is computed using the beta parameter, which introduces momentum into the learning process.

Parameters:

p_kwargs (dict) – Keyword arguments, must include a SARSElement.

Returns:

Always returns True to indicate the learning step was performed.

Return type:

bool

_discretization(p_x_fill_level, p_y_fill_level)

Discretizes continuous state values into grid coordinates.

Parameters:
  • p_x_fill_level (float) – X-coordinate value (between 0 and 1).

  • p_y_fill_level (float) – Y-coordinate value (between 0 and 1).

Returns:

Discretized (x, y) indices into the performance map.

Return type:

tuple of int

interpolate_maps(p_pos_x, p_pos_y)

Interpolates the performance map using inverse distance weighting.

Parameters:
  • p_pos_x (float) – X-position in the state space.

  • p_pos_y (float) – Y-position in the state space.

Returns:

Interpolated action value for the given state position.

Return type:

float