State-based Potential Games

Ver. 1.0.0 (2025-04-09)

This module implements a dynamic game policy class for State-Based Potential Games (SbPG), including three learning algorithms:

Best Response (BR) - DOI: 10.1109/TCYB.2020.3006620

Gradient-Based (GB) - DOI: 10.1109/IECON55916.2024.10905619

Gradient-Based with Momentum (GB_MOM) - DOI: 10.1109/IECON55916.2024.10905619

The class SbPG supports learning in multi-agent environments where agents update their actions based on individual utility gradients or best-response dynamics over discretized states.

class mlpro.gt.pool.policies.sbpg.SbPG(p_observation_space: MSpace, p_action_space: MSpace, p_id: str = None, p_buffer_size: int = 1, p_ada: bool = True, p_visualize: bool = False, p_logging=True, p_algo: int = 0, p_num_states: int = None, p_exploration_decay: float = None, p_alpha: float = None, p_ou_noise: float = None, p_kick_off_eps: int = None, p_cycles_per_ep: int = None, p_smoothing: float = None, p_ep_max: int = None, p_beta: float = None)

Bases: Policy

State-Based Potential Games (SbPG) Policy Class.

This class implements a learning policy for multi-agent systems in SbPG. It supports three algorithms for policy adaptation:

ALG_SbPG_BR: Best Response Learning

ALG_SbPG_GB: Gradient-Based Learning

ALG_SbPG_GB_MOM: Gradient-Based Learning with Momentum

The environment is discretized into a 2D grid of states, and the player learns a performance map and optimal action per state via adaptation methods.

Parameters:

p_observation_space (MSpace) – Observation space (state space) of the player.
p_action_space (MSpace) – Action space of the player.
p_id (str, optional) – Identifier of the policy.
p_buffer_size (int, optional) – Size of the internal buffer for learning, default is 1.
p_ada (bool, optional) – Whether adaptive learning is enabled.
p_visualize (bool, optional) – Enable visualization.
p_logging (int, optional) – Logging level.
p_algo (int, optional) – Algorithm selection (0: BR, 1: GB, 2: GB_MOM).
p_num_states (int, optional) – Number of discretized states per axis.
p_exploration_decay (float, optional) – Decay rate of exploration over time.
p_alpha (float, optional) – Learning rate for GB adaptation.
p_ou_noise (float, optional) – OU noise factor for exploration in GB mode.
p_kick_off_eps (int, optional) – Number of episodes before exploitation begins in GB.
p_cycles_per_ep (int, optional) – Steps per episode.
p_smoothing (float, optional) – Smoothing factor in interpolation.
p_ep_max (int, optional) – Maximum number of episodes.
p_beta (float, optional) – Momentum coefficient for GB_MOM.

performance_map

Stores actions and utilities for each state in a 2D grid.

Type:: torch.Tensor

map_nxt_action

Stores next recommended action for each state (GB only).

Type:: torch.Tensor

exploration

Current exploration probability.

Type:: float

_counter

Step counter within an episode.

Type:: int

_current_ep

Current episode number.

Type:: int

C_NAME = 'SbPG'

ALG_SbPG_BR = 0

ALG_SbPG_GB = 1

ALG_SbPG_GB_MOM = 2

_init_hyperparam(): Initializes the hyperparameter space with default values.

get_hyperparam() → HyperParamTuple

Returns the current set of hyperparameters used by the policy.

Returns:: A tuple containing all hyperparameter values.
Return type:: HyperParamTuple

compute_action(p_state: State) → Action

Computes the next action based on the current state and learning strategy.

Parameters:: p_state (State) – Current environment state.
Returns:: Selected action according to the chosen algorithm (BR or GB).
Return type:: Action

compute_action_br(p_state: State) → Action

Computes the player’s action using BR learning strategy.

With a probability of exploration, the agent takes a random action. Otherwise, it chooses the best known action based on the performance map using interpolation.

Parameters:: p_state (State) – Current environment state.
Returns:: Computed action for the current state.
Return type:: Action

compute_action_gb(p_state: State) → Action

Computes the player’s action using GB learning strategy.

If the player is still in the “kick-off” phase, it explores randomly. Otherwise, it exploits the learned gradient with optional noise (OU noise). Outside the exploration phase, it uses interpolated action values.

Parameters:: p_state (State) – Current environment state.
Returns:: Computed action for the current state.
Return type:: Action

_adapt(**p_kwargs) → bool

Adapts the policy based on the selected algorithm (BR, GB, or GB_MOM).

Parameters:: p_kwargs (dict) – Keyword arguments, must include a SARSElement.
Returns:: True if the policy was updated successfully.
Return type:: bool

_adapt_br(p_kwargs) → bool

Performs policy update using BR learning.

Parameters:: p_kwargs (dict) – Dictionary containing SARSElement.
Returns:: True if a better reward was found and map updated.
Return type:: bool

_adapt_gb(p_kwargs) → bool

Performs policy update using GB learning.

Updates the action using utility gradient and learning rate. Stores the best known action and utility in the performance map.

Parameters:: p_kwargs (dict) – Keyword arguments, must include a SARSElement.
Returns:: Always returns True to indicate the learning step was performed.
Return type:: bool

_adapt_gb_mom(p_kwargs) → bool

Performs policy update using Gradient-Based learning with Momentum (GB_MOM).

A moving average of the gradient is computed using the beta parameter, which introduces momentum into the learning process.

Parameters:: p_kwargs (dict) – Keyword arguments, must include a SARSElement.
Returns:: Always returns True to indicate the learning step was performed.
Return type:: bool

_discretization(p_x_fill_level, p_y_fill_level)

Discretizes continuous state values into grid coordinates.

Parameters:

p_x_fill_level (float) – X-coordinate value (between 0 and 1).
p_y_fill_level (float) – Y-coordinate value (between 0 and 1).

Returns:

Discretized (x, y) indices into the performance map.

Return type:

tuple of int

interpolate_maps(p_pos_x, p_pos_y)

Interpolates the performance map using inverse distance weighting.

Parameters:

p_pos_x (float) – X-position in the state space.
p_pos_y (float) – Y-position in the state space.

Returns:

Interpolated action value for the given state position.

Return type:

float