This tutorial is designed to provide a step-by-step mathematical explanation of the key concepts in the whitepaper "Adaptive Multi-Agent Negotiation Framework for Decentralized Markets: A Mean-Field-Type Game Approach with Uncertainty and Reinforcement Learning." It builds from foundational ideas in game theory and stochastic processes to advanced topics like typed mean-field-type games (MFTGs), reinforcement learning (RL) integration, risk measures, and forecasting under uncertainty.
1. Foundations: From n-Player Games to Mean-Field Limits
1.1 n-Player Games and Empirical Measures
In traditional game theory, consider \(N\) agents (e.g., prosumers in an energy market) with states \(x_i \in \mathbb{R}^d\) and controls \(u_i \in U\) (e.g., bid quantities). Agents interact through couplings, often via the empirical measure (or empirical law):
\[ \mu^N(t) = \frac{1}{N} \sum_{i=1}^N \delta_{x_i(t)}, \]
where \(\delta_x\) is the Dirac delta at \(x\). This summarizes the "average" state of the population. As \(N \to \infty\), \(\mu^N \Rightarrow \mu\) (weak convergence), reducing complexity from \(O(N^2)\) (pairwise interactions) to \(O(N)\).
Intuition: In large markets, individual agents have negligible impact, so we model interactions via the population distribution \(\mu\) instead of tracking every pair.
1.2 Mean-Field Games (MFGs)
Classical MFGs assume homogeneous, anonymous agents. The dynamics for a representative agent are stochastic differential equations (SDEs):
\[ dx_t = b(x_t, u_t, \mu(t)) \, dt + \sigma(x_t) \, dW_t, \]
where \(W_t\) is Brownian motion, \(b\) is the drift (e.g., state evolution based on control and market price influenced by \(\mu\)), and \(\sigma\) is volatility.
The cost functional to minimize is:
\[ J[u] = \mathbb{E} \left[ \int_0^T L(x_t, u_t, \mu(t)) \, dt + \Phi(x_T, \mu(T)) \right], \]
with running cost \(L\) (e.g., trading penalties) and terminal cost \(\Phi\).
Equilibria solve a coupled system:
- Backward HJB equation (optimal control):
\[ -\partial_t v(t,x) = \inf_u \left[ L(x,u,\mu(t)) + b(x,u,\mu(t)) \cdot \nabla v + \frac{1}{2} \mathrm{tr} \left( a(x) \nabla^2 v \right) \right], \quad v(T,x) = \Phi(x,\mu(T)), \]
where \(a = \sigma \sigma^\top\).
- Forward FP equation (population evolution):
\[ \partial_t \mu = -\nabla \cdot (b^*(t,x,\mu) \mu) + \frac{1}{2} \nabla^2 : (a(x) \mu), \quad \mu(0,\cdot) = \mu_0, \]
with optimal drift \(b^*\) from the HJB minimizer.
Rule of Thumb: Use MFGs when agents are many, interactions are via aggregates (e.g., prices), and individuals are small.
2. Extending to Typed Mean-Field-Type Games (MFTGs)
Real markets have heterogeneity (e.g., consumers vs. PV owners). MFTGs introduce types \(\tau \in \mathcal{T}\) with proportions \(\lambda_\tau\) (\(\sum_\tau \lambda_\tau = 1\)).
2.1 Type-Specific Dynamics and Costs
For type \(\tau\):
\[ dx^\tau_t = b_\tau(x^\tau_t, u^\tau_t, \mu(t)) \, dt + \sigma_\tau(x^\tau_t) \, dW^\tau_t, \quad a_\tau = \sigma_\tau \sigma_\tau^\top, \]
\[ J_\tau[u^\tau] = \mathbb{E} \left[ \int_0^T L_\tau(x^\tau_t, u^\tau_t, \mu(t)) \, dt + \Phi_\tau(x^\tau_T, \mu(T)) \right]. \]
The mixture law is \(\mu(t) = \sum_\tau \lambda_\tau \mu_\tau(t,\cdot)\), where \(\mu_\tau\) is the type-conditional law.
2.2 Equilibrium Equations
For each \(\tau\), solve type-specific HJB:
\[ -\partial_t v_\tau(t,x) = \inf_{u \in U_\tau} \left[ L_\tau(x,u,\mu(t)) + b_\tau(x,u,\mu(t)) \cdot \nabla v_\tau + \frac{1}{2} \mathrm{tr} \left( a_\tau(x) \nabla^2 v_\tau \right) \right], \]
and FP:
\[ \partial_t \mu_\tau = -\nabla \cdot (b^*_\tau(t,x,\mu) \mu_\tau) + \frac{1}{2} \nabla^2 : (a_\tau(x) \mu_\tau). \]
Intuition: Types allow modeling groups (e.g., residential vs. industrial) while keeping tractability.
3. Finite-Sample Convergence and Propagation of Chaos
3.1 Theorem: O(1/√N) Rate
Under Lipschitz assumptions on \(b_\tau\), \(\sigma_\tau\), and independent Brownian motions, with type-exchangeable initial states:
\[ \mathbb{E} \left[ \sup_{t \leq T} W_2 \left( \mu^N(t), \mu(t) \right) \right] \leq \frac{C}{\sqrt{N}}, \]
where \(W_2\) is the 2-Wasserstein distance.
Derivation Sketch: Couple finite trajectories with mean-field copies using Itô's lemma. Apply Grönwall's inequality for drifts/diffusions, then concentration for empirical measures. Types require within-type exchangeability.
What it Means: For finite \(N\) (e.g., 10,000 agents), the empirical approximation converges at rate \(O(N^{-1/2})\), justifying mean-field use in simulations.
4. Reinforcement Learning Integration
MFTGs are static; RL adapts to drifting prices/uncertainties.
4.1 Mean-Field-Conditioned Policy Gradient
Parameterize policy \(\pi_\theta(u_t | x_t, \mu^N(t))\). The gradient for \(J(\theta)\) is:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{h_{0:T} \sim d^{\pi_\theta, \mu^N}} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(u_t | x_t, \mu^N(t)) \, A^{\pi_\theta}(x_t, u_t, \mu^N(t)) \right], \]
where \(A\) is the advantage function, \(h_{0:T}\) is a trajectory, and \(d^{\pi_\theta, \mu^N}\) is the occupancy measure.
4.2 Two-Timescale Learning with Wasserstein Modulation
Use critic steps \(\eta_t\) (fast) and actor steps \(\alpha_t\) (slow, \(\alpha_t / \eta_t \to 0\)), modulated by market drift:
\[ \alpha_t = \alpha_0 \min \left( 1, \frac{\tau_0}{t} \right) \left( 1 + \beta W_1(\mu^N(t), \mu^N(t-1)) \right)^{-1}, \]
where \(W_1\) is 1-Wasserstein distance.
Lemma (Dynamic Regret): For convex losses \(\ell_t(\theta)\) with drifting minimizers \(\|\theta^*_t - \theta^*_{t-1}\| \leq L_\mu W_1(\mu^N(t), \mu^N(t-1))\), regret is \(\tilde{O}(\sqrt{T})\).
Intuition: Slow actor adapts to non-stationary environments (e.g., renewable shifts); Wasserstein slows updates during high drift for stability.
5. Risk-Aware Objectives with CVaR
Agents minimize risk-adjusted costs \(c_i(q,p,\xi) = -u_i(q,p,\xi)\) (negative utility under scenario \(\xi\) from forecasts).
5.1 Conditional Value-at-Risk (CVaR)
At level \(\alpha \in (0,1)\):
\[ \mathrm{CVaR}_\alpha(c_i) = \inf_z \left[ z + \frac{1}{\alpha} \mathbb{E}[(c_i - z)_+] \right]. \]
Objective:
\[ J_i = (1 - \gamma_i) \mathbb{E}[c_i] + \gamma_i \mathrm{CVaR}_\alpha(c_i), \quad \gamma_i \in [0,1]. \]
Why on Losses? Focuses on downside risk (e.g., high costs from shortages), not upside utilities.
Estimation (Rockafellar-Uryasev): For samples \(\{c_k\}^K_{k=1}\):
\[ \widehat{\mathrm{CVaR}}_\alpha(c) = \min_z \left[ z + \frac{1}{\alpha K} \sum_{k=1}^K (c_k - z)_+ \right]. \]
Convex; solve via subgradient or bisection.
6. Uncertainty-Aware Forecasting
Renewable errors are heavy-tailed. Use heteroscedastic Student-t head:
\[ \hat{y}_{t+h|t} \sim \mathcal{T}_{\nu(x_t)} \left( \mu_\theta(x_t), \sigma^2_\phi(x_t) \right), \]
trained by minimizing \(-\log p(y | \mu_\theta, \sigma_\phi, \nu)\).
Benefits: Better tail coverage than Gaussian (e.g., CRPS improvement 3-6%), reducing violations.
7. Lightning Network: Routing Heuristics
Routing is NP-hard. Use prune-rank-route with multi-part payments (MPP).
7.1 Edge Weights
\[ w_{ij} = \alpha \cdot \mathrm{fee}_{ij} + \beta \cdot \left(1 - \frac{\mathrm{capacity}_{ij}}{\max_\mathrm{cap}} \right) + \gamma \cdot \mathrm{latency}_{ij}. \]
Prune edges with capacity < \(\theta \cdot\) amount. Compute \(k\)-shortest paths (Yen's algorithm: \(O(k n (m + n \log n))\)).
Intuition: Balances fees, liquidity, and speed for P2P settlements.
8. Worked Example: Linear-Quadratic MFTG
Two types: Consumers (\(\tau=C\)), PV+Storage (\(\tau=P\)). 1D state \(x^\tau_t\) (net demand), control \(u^\tau_t\) (buy/sell).
Dynamics:
\[ dx^\tau_t = (a_\tau x^\tau_t + b_\tau u^\tau_t + \kappa_\tau \bar{x}_t) \, dt + \sigma_\tau dW^\tau_t, \quad \bar{x}_t = \sum_\tau \lambda_\tau \mathbb{E}[x^\tau_t]. \]
Costs:
\[ L_\tau = \frac{1}{2} q_\tau (x^\tau_t)^2 + \frac{1}{2} r_\tau (u^\tau_t)^2 + s_\tau x^\tau_t \bar{x}_t, \quad \Phi_\tau = \frac{1}{2} q_{\tau,T} (x^\tau_T)^2. \]
HJB guess: \(v_\tau(t,x) = \frac{1}{2} P_\tau(t) x^2 + \xi_\tau(t) x + \zeta_\tau(t)\), yielding coupled Riccati ODEs for \(P_\tau\). Optimal \(u^*_\tau = -r_\tau^{-1} b_\tau P_\tau x +\) affine in \(\bar{x}_t\).
FP: Ornstein-Uhlenbeck process.
Takeaway: LQ gives closed-form linear policies—great for code testing.
9. Evaluation Metrics and Reproducibility
Key metrics:
- Efficiency: \% of Pareto optimum (MILP benchmark).
- Latency: Lognormal percentiles (median 47 ms).
- CRPS for forecasts: Lower is better; Student-t beats Gaussian.
Use ENTSO-E data for validation: Diebold-Mariano tests confirm significance.
This tutorial covers the core math; refer to the whitepaper for implementation details. For deeper dives, simulate the LQ example using libraries like JAX or PyTorch.