false

Deep Generative Models for Synthetic Financial Data: Applications to Portfolio and Risk Modeling

Deep Generative Models for Synthetic Financial Data: Applications to Portfolio and Risk Modeling ArXiv ID: 2512.21798 “View on arXiv” Authors: Christophe D. Hounwanou, Yae Ulrich Gaba Abstract Synthetic financial data provides a practical solution to the privacy, accessibility, and reproducibility challenges that often constrain empirical research in quantitative finance. This paper investigates the use of deep generative models, specifically Time-series Generative Adversarial Networks (TimeGAN) and Variational Autoencoders (VAEs) to generate realistic synthetic financial return series for portfolio construction and risk modeling applications. Using historical daily returns from the S and P 500 as a benchmark, we generate synthetic datasets under comparable market conditions and evaluate them using statistical similarity metrics, temporal structure tests, and downstream financial tasks. The study shows that TimeGAN produces synthetic data with distributional shapes, volatility patterns, and autocorrelation behaviour that are close to those observed in real returns. When applied to mean–variance portfolio optimization, the resulting synthetic datasets lead to portfolio weights, Sharpe ratios, and risk levels that remain close to those obtained from real data. The VAE provides more stable training but tends to smooth extreme market movements, which affects risk estimation. Finally, the analysis supports the use of synthetic datasets as substitutes for real financial data in portfolio analysis and risk simulation, particularly when models are able to capture temporal dynamics. Synthetic data therefore provides a privacy-preserving, cost-effective, and reproducible tool for financial experimentation and model development. ...

December 25, 2025 · 2 min · Research Team

Synthetic Financial Data Generation for Enhanced Financial Modelling

Synthetic Financial Data Generation for Enhanced Financial Modelling ArXiv ID: 2512.21791 “View on arXiv” Authors: Christophe D. Hounwanou, Yae Ulrich Gaba, Pierre Ntakirutimana Abstract Data scarcity and confidentiality in finance often impede model development and robust testing. This paper presents a unified multi-criteria evaluation framework for synthetic financial data and applies it to three representative generative paradigms: the statistical ARIMA-GARCH baseline, Variational Autoencoders (VAEs), and Time-series Generative Adversarial Networks (TimeGAN). Using historical S and P 500 daily data, we evaluate fidelity (Maximum Mean Discrepancy, MMD), temporal structure (autocorrelation and volatility clustering), and practical utility in downstream tasks, specifically mean-variance portfolio optimization and volatility forecasting. Empirical results indicate that ARIMA-GARCH captures linear trends and conditional volatility but fails to reproduce nonlinear dynamics; VAEs produce smooth trajectories that underestimate extreme events; and TimeGAN achieves the best trade-off between realism and temporal coherence (e.g., TimeGAN attained the lowest MMD: 1.84e-3, average over 5 seeds). Finally, we articulate practical guidelines for selecting generative models according to application needs and computational constraints. Our unified evaluation protocol and reproducible codebase aim to standardize benchmarking in synthetic financial data research. ...

December 25, 2025 · 2 min · Research Team

Forecasting implied volatility surface with generative diffusion models

Forecasting implied volatility surface with generative diffusion models ArXiv ID: 2511.07571 “View on arXiv” Authors: Chen Jin, Ankush Agarwal Abstract We introduce a conditional Denoising Diffusion Probabilistic Model (DDPM) for generating arbitrage-free implied volatility (IV) surfaces, offering a more stable and accurate alternative to existing GAN-based approaches. To capture the path-dependent nature of volatility dynamics, our model is conditioned on a rich set of market variables, including exponential weighted moving averages (EWMAs) of historical surfaces, returns and squared returns of underlying asset, and scalar risk indicators like VIX. Empirical results demonstrate our model significantly outperforms leading GAN-based models in capturing the stylized facts of IV dynamics. A key challenge is that historical data often contains small arbitrage opportunities in the earlier dataset for training, which conflicts with the goal of generating arbitrage-free surfaces. We address this by incorporating a standard arbitrage penalty into the loss function, but apply it using a novel, parameter-free weighting scheme based on the signal-to-noise ratio (SNR) that dynamically adjusts the penalty’s strength across the diffusion process. We also show a formal analysis of this trade-off and provide a proof of convergence showing that the penalty introduces a small, controllable bias that steers the model toward the manifold of arbitrage-free surfaces while ensuring the generated distribution remains close to the real-world data. ...

November 10, 2025 · 2 min · Research Team

Nested Optimal Transport Distances

Nested Optimal Transport Distances ArXiv ID: 2509.06702 “View on arXiv” Authors: Ruben Bontorno, Songyan Hou Abstract Simulating realistic financial time series is essential for stress testing, scenario generation, and decision-making under uncertainty. Despite advances in deep generative models, there is no consensus metric for their evaluation. We focus on generative AI for financial time series in decision-making applications and employ the nested optimal transport distance, a time-causal variant of optimal transport distance, which is robust to tasks such as hedging, optimal stopping, and reinforcement learning. Moreover, we propose a statistically consistent, naturally parallelizable algorithm for its computation, achieving substantial speedups over existing approaches. ...

September 8, 2025 · 2 min · Research Team

Prospects of Imitating Trading Agents in the Stock Market

Prospects of Imitating Trading Agents in the Stock Market ArXiv ID: 2509.00982 “View on arXiv” Authors: Mateusz Wilinski, Juho Kanniainen Abstract In this work we show how generative tools, which were successfully applied to limit order book data, can be utilized for the task of imitating trading agents. To this end, we propose a modified generative architecture based on the state-space model, and apply it to limit order book data with identified investors. The model is trained on synthetic data, generated from a heterogeneous agent-based model. Finally, we compare model’s predicted distribution over different aspects of investors’ actions, with the ground truths known from the agent-based model. ...

August 31, 2025 · 2 min · Research Team

Causal Interventions in Bond Multi-Dealer-to-Client Platforms

Causal Interventions in Bond Multi-Dealer-to-Client Platforms ArXiv ID: 2506.18147 “View on arXiv” Authors: Paloma Marín, Sergio Ardanza-Trevijano, Javier Sabio Abstract The digitalization of financial markets has shifted trading from voice to electronic channels, with Multi-Dealer-to-Client (MD2C) platforms now enabling clients to request quotes (RfQs) for financial instruments like bonds from multiple dealers simultaneously. In this competitive landscape, dealers cannot see each other’s prices, making a rigorous analysis of the negotiation process crucial to ensure their profitability. This article introduces a novel general framework for analyzing the RfQ process using probabilistic graphical models and causal inference. Within this framework, we explore different inferential questions that are relevant for dealers participating in MD2C platforms, such as the computation of optimal prices, estimating potential revenues and the identification of clients that might be interested in trading the dealer’s axes. We then move into analyzing two different approaches for model specification: a generative model built on the work of (Fermanian, Guéant, & Pu, 2017); and discriminative models utilizing machine learning techniques. Our results show that generative models can match the predictive accuracy of leading discriminative algorithms such as LightGBM (ROC-AUC: 0.742 vs. 0.743) while simultaneously enforcing critical business requirements, notably spread monotonicity. ...

June 22, 2025 · 2 min · Research Team

LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data

LOB-Bench: Benchmarking Generative AI for Finance – an Application to Limit Order Book Data ArXiv ID: 2502.09172 “View on arXiv” Authors: Unknown Abstract While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains “market impact metrics”, i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes. ...

February 13, 2025 · 2 min · Research Team

Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance

Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance ArXiv ID: 2501.03993 “View on arXiv” Authors: Unknown Abstract Simulation methods have always been instrumental in finance, and data-driven methods with minimal model specification, commonly referred to as generative models, have attracted increasing attention, especially after the success of deep learning in a broad range of fields. However, the adoption of these models in financial applications has not matched the growing interest, probably due to the unique complexities and challenges of financial markets. This paper contributes to a deeper understanding of the limitations of generative models, particularly in portfolio and risk management. To this end, we begin by presenting theoretical results on the importance of initial sample size, and point out the potential pitfalls of generating far more data than originally available. We then highlight the inseparable nature of model development and the desired uses by touching on a paradox: usual generative models inherently care less about what is important for constructing portfolios (in particular the long-short ones). Based on these findings, we propose a pipeline for the generation of multivariate returns that meets conventional evaluation standards on a large universe of US equities while being compliant with stylized facts observed in asset returns and turning around the pitfalls we previously identified. Moreover, we insist on the need for more accurate evaluation methods, and suggest, through an example of mean-reversion strategies, a method designed to identify poor models for a given application based on regurgitative training, i.e. retraining the model using the data it has itself generated, which is commonly referred to in statistics as identifiability. ...

January 7, 2025 · 3 min · Research Team

A Financial Time Series Denoiser Based on Diffusion Model

A Financial Time Series Denoiser Based on Diffusion Model ArXiv ID: 2409.02138 “View on arXiv” Authors: Unknown Abstract Financial time series often exhibit low signal-to-noise ratio, posing significant challenges for accurate data interpretation and prediction and ultimately decision making. Generative models have gained attention as powerful tools for simulating and predicting intricate data patterns, with the diffusion model emerging as a particularly effective method. This paper introduces a novel approach utilizing the diffusion model as a denoiser for financial time series in order to improve data predictability and trading performance. By leveraging the forward and reverse processes of the conditional diffusion model to add and remove noise progressively, we reconstruct original data from noisy inputs. Our extensive experiments demonstrate that diffusion model-based denoised time series significantly enhance the performance on downstream future return classification tasks. Moreover, trading signals derived from the denoised data yield more profitable trades with fewer transactions, thereby minimizing transaction costs and increasing overall trading efficiency. Finally, we show that by using classifiers trained on denoised time series, we can recognize the noising state of the market and obtain excess return. ...

September 2, 2024 · 2 min · Research Team

Mean-Field Microcanonical Gradient Descent

Mean-Field Microcanonical Gradient Descent ArXiv ID: 2403.08362 “View on arXiv” Authors: Unknown Abstract Microcanonical gradient descent is a sampling procedure for energy-based models allowing for efficient sampling of distributions in high dimension. It works by transporting samples from a high-entropy distribution, such as Gaussian white noise, to a low-energy region using gradient descent. We put this model in the framework of normalizing flows, showing how it can often overfit by losing an unnecessary amount of entropy in the descent. As a remedy, we propose a mean-field microcanonical gradient descent that samples several weakly coupled data points simultaneously, allowing for better control of the entropy loss while paying little in terms of likelihood fit. We study these models in the context of financial time series, illustrating the improvements on both synthetic and real data. ...

March 13, 2024 · 2 min · Research Team