Entropy Regularization

Continuous-time reinforcement learning for optimal switching over multiple regimes

Continuous-time reinforcement learning for optimal switching over multiple regimes ArXiv ID: 2512.04697 “View on arXiv” Authors: Yijie Huang, Mengge Li, Xiang Yu, Zhou Zhou Abstract This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm. ...

Wasserstein Robust Market Making via Entropy Regularization

Wasserstein Robust Market Making via Entropy Regularization ArXiv ID: 2503.04072 “View on arXiv” Authors: Unknown Abstract In this paper, we introduce a robust market making framework based on Wasserstein distance, utilizing a stochastic policy approach enhanced by entropy regularization. We demonstrate that, under mild assumptions, the robust market making problem can be reformulated as a convex optimization question. Additionally, we outline a methodology for selecting the optimal radius of the Wasserstein ball, further refining our framework’s effectiveness. ...

Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning

Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning ArXiv ID: 2312.15385 “View on arXiv” Authors: Unknown Abstract This paper studies a discrete-time mean-variance model based on reinforcement learning. Compared with its continuous-time counterpart in \cite{“zhou2020mv”}, the discrete-time model makes more general assumptions about the asset’s return distribution. Using entropy to measure the cost of exploration, we derive the optimal investment strategy, whose density function is also Gaussian type. Additionally, we design the corresponding reinforcement learning algorithm. Both simulation experiments and empirical analysis indicate that our discrete-time model exhibits better applicability when analyzing real-world data than the continuous-time model. ...

Continuous-time q-learning for mean-field control problems

Continuous-time q-learning for mean-field control problems ArXiv ID: 2306.16208 “View on arXiv” Authors: Unknown Abstract This paper studies the q-learning, recently coined as the continuous time counterpart of Q-learning by Jia and Zhou (2023), for continuous time Mckean-Vlasov control problems in the setting of entropy-regularized reinforcement learning. In contrast to the single agent’s control problem in Jia and Zhou (2023), the mean-field interaction of agents renders the definition of the q-function more subtle, for which we reveal that two distinct q-functions naturally arise: (i) the integrated q-function (denoted by $q$) as the first-order approximation of the integrated Q-function introduced in Gu, Guo, Wei and Xu (2023), which can be learnt by a weak martingale condition involving test policies; and (ii) the essential q-function (denoted by $q_e$) that is employed in the policy improvement iterations. We show that two q-functions are related via an integral representation under all test policies. Based on the weak martingale condition and our proposed searching method of test policies, some model-free learning algorithms are devised. In two examples, one in LQ control framework and one beyond LQ control framework, we can obtain the exact parameterization of the optimal value function and q-functions and illustrate our algorithms with simulation experiments. ...