High-Dimensional Learning in Finance

ArXiv ID: 2506.03780 “View on arXiv”

Authors: Hasan Fallahgoul

Abstract

Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine two key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I establish information-theoretic lower bounds that identify when reliable learning is impossible no matter how sophisticated the estimator. A detailed quantitative calibration of the polynomial lower bound shows that with typical parameter choices, e.g., 12,000 features, 12 monthly observations, and R-square 2-3%, the required sample size to escape the bound exceeds 25-30 years of data–well beyond any rolling-window actually used. Thus, observed out-of-sample success must originate from lower-complexity artefacts rather than from the intended high-dimensional mechanism.

Keywords: Random Fourier Features, kernel approximation, information-theoretic bounds, high-dimensional learning, sample size requirements, General Financial Markets

Complexity vs Empirical Score

  • Math Complexity: 9.0/10
  • Empirical Rigor: 4.0/10
  • Quadrant: Lab Rats
  • Why: The paper is dense with advanced theoretical constructs including PAC-learning theory, information-theoretic bounds, random matrix theory, and detailed proofs on kernel approximation properties, placing it at the high end of mathematical complexity. While it includes quantitative calibration and numerical validation, the summary and excerpt focus on theoretical proofs and bounds rather than extensive backtesting, code, or dataset implementation, resulting in moderate empirical rigor.
  flowchart TD
    A["Research Goal<br>Understanding High-Dim ML Success in Finance"] --> B{"Key Methodology<br>1. RFF Kernel Analysis<br>2. Info-Theoretic Bounds"}
    B --> C["Data/Inputs<br>12k Features, 12 Monthly Observations"]
    C --> D["Computational Process<br>Quantitative Calibration of Bounds"]
    D --> E{"Outcome: Required Sample Size"}
    E -- > 25-30 Years --> F["Key Finding<br>Success from Low-Dim Artefacts"]
    E -- <= 12 Months --> G["False Success Scenario<br>Unachievable Requirements"]
    F --> H["Conclusion<br>Targeted Low-Complexity Models"]
    G --> H