Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance
ArXiv ID: 2501.03993 “View on arXiv”
Authors: Unknown
Abstract
Simulation methods have always been instrumental in finance, and data-driven methods with minimal model specification, commonly referred to as generative models, have attracted increasing attention, especially after the success of deep learning in a broad range of fields. However, the adoption of these models in financial applications has not matched the growing interest, probably due to the unique complexities and challenges of financial markets. This paper contributes to a deeper understanding of the limitations of generative models, particularly in portfolio and risk management. To this end, we begin by presenting theoretical results on the importance of initial sample size, and point out the potential pitfalls of generating far more data than originally available. We then highlight the inseparable nature of model development and the desired uses by touching on a paradox: usual generative models inherently care less about what is important for constructing portfolios (in particular the long-short ones). Based on these findings, we propose a pipeline for the generation of multivariate returns that meets conventional evaluation standards on a large universe of US equities while being compliant with stylized facts observed in asset returns and turning around the pitfalls we previously identified. Moreover, we insist on the need for more accurate evaluation methods, and suggest, through an example of mean-reversion strategies, a method designed to identify poor models for a given application based on regurgitative training, i.e. retraining the model using the data it has itself generated, which is commonly referred to in statistics as identifiability.
Keywords: Generative Models, Portfolio Management, Risk Management, Multivariate Returns, Regurgitative Training, Equities / General Asset Management
Complexity vs Empirical Score
- Math Complexity: 7.0/10
- Empirical Rigor: 8.0/10
- Quadrant: Holy Grail
- Why: The paper employs advanced mathematics including generative models (GANs, VAEs), high-dimensional statistics, and theoretical proofs on sample size and identifiability, scoring high in math complexity. It demonstrates strong empirical rigor by proposing a specific pipeline, evaluating it on a large universe of US equities (S&P500), and introducing novel evaluation methods like regurgitative training with backtesting on mean-reversion strategies.
flowchart TD
A["Research Goal:<br>Understand limitations of<br>generative models in finance"] --> B{"Literature &<br>Theoretical Analysis"}
B --> C["Pitfalls of Oversampling<br>and Mismatched Objectives"]
B --> D["Key Paradox Identified:<br>Generative models ignore<br>portfolio-specific features"]
C --> E["Proposed Solution:<br>Application-Compliant<br>Generative Pipeline"]
D --> E
E --> F{"Validation &<br>Evaluation"}
F --> G["Regurgitative Training<br>Method"]
F --> H["Conventional Metrics<br>on US Equities"]
G & H --> I["Key Findings:<br>1. Model & Use-case are inseparable<br>2. Generation should preserve<br>stylized facts<br>3. Regurgitative testing is essential"]
style A fill:#e1f5fe
style I fill:#e8f5e8