State-based Potential Games
Ver. 1.1.0 (2025-07-18)
This module implements a dynamic game policy class for State-Based Potential Games (SbPG), including three learning algorithms:
Best Response (BR) - DOI: 10.1109/TCYB.2020.3006620
Gradient-Based (GB) - DOI: 10.1109/IECON55916.2024.10905619
Gradient-Based with Momentum (GB_MOM) - DOI: 10.1109/IECON55916.2024.10905619
The class SbPG supports learning in multi-agent environments where agents update their actions based on individual utility gradients or best-response dynamics over discretized states.
- class mlpro.gt.pool.policies.sbpg.SbPG(p_observation_space: MSpace, p_action_space: MSpace, p_id: str = None, p_buffer_size: int = 1, p_ada: bool = True, p_visualize: bool = False, p_logging=True, p_algo: int = 0, p_num_states: int = None, p_exploration_decay: float = None, p_alpha: float = None, p_ou_noise: float = None, p_kick_off_eps: int = None, p_cycles_per_ep: int = None, p_smoothing: float = None, p_ep_max: int = None, p_beta: float = None)
Bases:
PolicyState-Based Potential Games (SbPG) Policy Class.
This class implements a learning policy for multi-agent systems in SbPG. It supports three algorithms for policy adaptation:
ALG_SbPG_BR: Best Response Learning
ALG_SbPG_GB: Gradient-Based Learning
ALG_SbPG_GB_MOM: Gradient-Based Learning with Momentum
The environment is discretized into a 2D grid of states, and the player learns a performance map and optimal action per state via adaptation methods.
- Parameters:
p_observation_space (MSpace) – Observation space (state space) of the player.
p_action_space (MSpace) – Action space of the player.
p_id (str, optional) – Identifier of the policy.
p_buffer_size (int, optional) – Size of the internal buffer for learning, default is 1.
p_ada (bool, optional) – Whether adaptive learning is enabled.
p_visualize (bool, optional) – Enable visualization.
p_logging (int, optional) – Logging level.
p_algo (int, optional) – Algorithm selection (0: BR, 1: GB, 2: GB_MOM).
p_num_states (int, optional) – Number of discretized states per axis.
p_exploration_decay (float, optional) – Decay rate of exploration over time.
p_alpha (float, optional) – Learning rate for GB adaptation.
p_ou_noise (float, optional) – OU noise factor for exploration in GB mode.
p_kick_off_eps (int, optional) – Number of episodes before exploitation begins in GB.
p_cycles_per_ep (int, optional) – Steps per episode.
p_smoothing (float, optional) – Smoothing factor in interpolation.
p_ep_max (int, optional) – Maximum number of episodes.
p_beta (float, optional) – Momentum coefficient for GB_MOM.
- performance_map
Stores actions and utilities for each state in a 2D grid.
- Type:
torch.Tensor
- map_nxt_action
Stores next recommended action for each state (GB only).
- Type:
torch.Tensor
- exploration
Current exploration probability.
- Type:
float
- _counter
Step counter within an episode.
- Type:
int
- _current_ep
Current episode number.
- Type:
int
- C_NAME = 'SbPG'
- ALG_SbPG_BR = 0
- ALG_SbPG_GB = 1
- ALG_SbPG_GB_MOM = 2
- _init_hyperparam()
Initializes the hyperparameter space with default values.
- get_hyperparam() HyperParamTuple
Returns the current set of hyperparameters used by the policy.
- Returns:
A tuple containing all hyperparameter values.
- Return type:
- compute_action(p_state: State) Action
Computes the next action based on the current state and learning strategy.
- compute_action_br(p_state: State) Action
Computes the player’s action using BR learning strategy.
With a probability of exploration, the agent takes a random action. Otherwise, it chooses the best known action based on the performance map using interpolation.
- compute_action_gb(p_state: State) Action
Computes the player’s action using GB learning strategy.
If the player is still in the “kick-off” phase, it explores randomly. Otherwise, it exploits the learned gradient with optional noise (OU noise). Outside the exploration phase, it uses interpolated action values.
- _adapt(**p_kwargs) bool
Adapts the policy based on the selected algorithm (BR, GB, or GB_MOM).
- Parameters:
p_kwargs (dict) – Keyword arguments, must include a SARSElement.
- Returns:
True if the policy was updated successfully.
- Return type:
bool
- _adapt_br(p_kwargs) bool
Performs policy update using BR learning.
- Parameters:
p_kwargs (dict) – Dictionary containing SARSElement.
- Returns:
True if a better reward was found and map updated.
- Return type:
bool
- _adapt_gb(p_kwargs) bool
Performs policy update using GB learning.
Updates the action using utility gradient and learning rate. Stores the best known action and utility in the performance map.
- Parameters:
p_kwargs (dict) – Keyword arguments, must include a SARSElement.
- Returns:
Always returns True to indicate the learning step was performed.
- Return type:
bool
- _adapt_gb_mom(p_kwargs) bool
Performs policy update using Gradient-Based learning with Momentum (GB_MOM).
A moving average of the gradient is computed using the beta parameter, which introduces momentum into the learning process.
- Parameters:
p_kwargs (dict) – Keyword arguments, must include a SARSElement.
- Returns:
Always returns True to indicate the learning step was performed.
- Return type:
bool
- _discretization(p_x_fill_level, p_y_fill_level)
Discretizes continuous state values into grid coordinates.
- Parameters:
p_x_fill_level (float) – X-coordinate value (between 0 and 1).
p_y_fill_level (float) – Y-coordinate value (between 0 and 1).
- Returns:
Discretized (x, y) indices into the performance map.
- Return type:
tuple of int
- interpolate_maps(p_pos_x, p_pos_y)
Interpolates the performance map using inverse distance weighting.
- Parameters:
p_pos_x (float) – X-position in the state space.
p_pos_y (float) – Y-position in the state space.
- Returns:
Interpolated action value for the given state position.
- Return type:
float