Synthetic Financial Data Generation for Enhanced Financial Modelling

ArXiv ID: 2512.21791 “View on arXiv”

Authors: Christophe D. Hounwanou, Yae Ulrich Gaba, Pierre Ntakirutimana

Abstract

Data scarcity and confidentiality in finance often impede model development and robust testing. This paper presents a unified multi-criteria evaluation framework for synthetic financial data and applies it to three representative generative paradigms: the statistical ARIMA-GARCH baseline, Variational Autoencoders (VAEs), and Time-series Generative Adversarial Networks (TimeGAN). Using historical S and P 500 daily data, we evaluate fidelity (Maximum Mean Discrepancy, MMD), temporal structure (autocorrelation and volatility clustering), and practical utility in downstream tasks, specifically mean-variance portfolio optimization and volatility forecasting. Empirical results indicate that ARIMA-GARCH captures linear trends and conditional volatility but fails to reproduce nonlinear dynamics; VAEs produce smooth trajectories that underestimate extreme events; and TimeGAN achieves the best trade-off between realism and temporal coherence (e.g., TimeGAN attained the lowest MMD: 1.84e-3, average over 5 seeds). Finally, we articulate practical guidelines for selecting generative models according to application needs and computational constraints. Our unified evaluation protocol and reproducible codebase aim to standardize benchmarking in synthetic financial data research.

Keywords: synthetic data, generative models, portfolio optimization, evaluation framework, risk modeling

Complexity vs Empirical Score

  • Math Complexity: 8.5/10
  • Empirical Rigor: 9.0/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced mathematical frameworks including generative models (VAE, TimeGAN) with formal objectives (ELBO, adversarial losses) and statistical measures (MMD, autocorrelation), while also presenting rigorous empirical benchmarks with specific quantitative results on S&P 500 data and a reproducible codebase.
  flowchart TD
    Start["Research Goal: Evaluate Synthetic Financial Data Generators for Financial Modelling"]
    
    Data["Data: Historical S&P 500 Daily Data"]
    
    Models["Methodology: Three Generative Paradigms<br/>1. ARIMA-GARCH (Statistical Baseline)<br/>2. Variational Autoencoder (VAE)<br/>3. Time-series GAN (TimeGAN)"]
    
    Process["Computational Evaluation<br/>Fidelity: Maximum Mean Discrepancy (MMD)<br/>Structure: Autocorrelation & Volatility Clustering<br/>Utility: Portfolio Optimization & Volatility Forecasting"]
    
    Outcomes["Key Findings:<br/>ARIMA-GARCH: Captured linear trends, missed nonlinear dynamics<br/>VAE: Smooth trajectories, underestimated extreme events<br/>TimeGAN: Best trade-off (Lowest MMD: 1.84e-3), high temporal coherence"]
    
    Start --> Data
    Data --> Models
    Models --> Process
    Process --> Outcomes