Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

ArXiv ID: 2404.12598 “View on arXiv”

Authors: Unknown

Abstract

This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent’s risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton’s investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.

Keywords: Reinforcement Learning (RL), Risk-Sensitive Optimization, Entropy-Regularized RL, Q-Learning, Martingale Theory, Investment Management

Complexity vs Empirical Score

  • Math Complexity: 8.5/10
  • Empirical Rigor: 3.0/10
  • Quadrant: Lab Rats
  • Why: The paper is highly mathematical, featuring advanced stochastic calculus, martingale theory, and convergence proofs for complex continuous-time problems, while the empirical backing is limited to theoretical proofs and simulation experiments on synthetic data without backtesting or real-world datasets.
  flowchart TD
    R["Research Goal: Continuous-time risk-sensitive RL with entropy regularization"]
    M["Key Methodology: Martingale formulation with quadratic variation penalty"]
    D["Data/Inputs: Merton investment problem & linear-quadratic control simulations"]
    C["Computational Process: Adapted Q-learning algorithm for risk-sensitive objectives"]
    O["Outcomes: Proof of convergence, quantitative impact of temperature parameter, improved finite-sample performance"]
    R --> M
    M --> D
    D --> C
    C --> O