Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus
ArXiv ID: 2509.13923 “View on arXiv”
Authors: Lamia Lamrani, Benoît Collins, Jean-Philippe Bouchaud
Abstract
Cross-validation is one of the most widely used methods for model selection and evaluation; its efficiency for large covariance matrix estimation appears robust in practice, but little is known about the theoretical behavior of its error. In this paper, we derive the expected Frobenius error of the holdout method, a particular cross-validation procedure that involves a single train and test split, for a generic rotationally invariant multiplicative noise model, therefore extending previous results to non-Gaussian data distributions. Our approach involves using the Weingarten calculus and the Ledoit-Péché formula to derive the oracle eigenvalues in the high-dimensional limit. When the population covariance matrix follows an inverse Wishart distribution, we approximate the expected holdout error, first with a linear shrinkage, then with a quadratic shrinkage to approximate the oracle eigenvalues. Under the linear approximation, we find that the optimal train-test split ratio is proportional to the square root of the matrix dimension. Then we compute Monte Carlo simulations of the holdout error for different distributions of the norm of the noise, such as the Gaussian, Student, and Laplace distributions and observe that the quadratic approximation yields a substantial improvement, especially around the optimal train-test split ratio. We also observe that a higher fourth-order moment of the Euclidean norm of the noise vector sharpens the holdout error curve near the optimal split and lowers the ideal train-test ratio, making the choice of the train-test ratio more important when performing the holdout method.
Keywords: cross-validation, covariance matrix estimation, high-dimensional statistics, Monte Carlo simulation, Weingarten calculus, General
Complexity vs Empirical Score
- Math Complexity: 8.5/10
- Empirical Rigor: 3.0/10
- Quadrant: Lab Rats
- Why: The paper employs advanced random matrix theory, Weingarten calculus, and detailed asymptotic derivations for error estimation and optimal split ratios, indicating high mathematical complexity. However, it relies solely on Monte Carlo simulations for validation without presenting backtests, real financial datasets, or implementation-heavy metrics, resulting in lower empirical rigor.
flowchart TD
Goal["Research Goal:<br>Assess theoretical behavior of<br>holdout cross-validation error<br>for large non-Gaussian<br>covariance matrix estimation"]
Methodology["Key Methodology:<br>Weingarten calculus &<br>Ledoit-Péché formula<br>for oracle eigenvalue derivation"]
Inputs["Data/Inputs:<br>Rotationally invariant<br>multiplicative noise model<br>(Gaussian, Student, Laplace)"]
Process["Computational Process:<br>1. Derive expected Frobenius error<br>2. Apply linear/quadratic shrinkage<br>3. Monte Carlo simulations"]
Outcomes["Key Findings:<br>• Optimal split ratio ∝ √dimension<br>• Quadratic approx. improves accuracy<br>• Higher 4th moment sharpens error curve<br>• Changes optimal split ratio"]
Goal --> Methodology
Methodology --> Inputs
Inputs --> Process
Process --> Outcomes