Variable selection for minimum-variance portfolios

ArXiv ID: 2508.14986 “View on arXiv”

Authors: Guilherme V. Moura, André P. Santos, Hudson S. Torrent

Abstract

Machine learning (ML) methods have been successfully employed in identifying variables that can predict the equity premium of individual stocks. In this paper, we investigate if ML can also be helpful in selecting variables relevant for optimal portfolio choice. To address this question, we parameterize minimum-variance portfolio weights as a function of a large pool of firm-level characteristics as well as their second-order and cross-product transformations, yielding a total of 4,610 predictors. We find that the gains from employing ML to select relevant predictors are substantial: minimum-variance portfolios achieve lower risk relative to sparse specifications commonly considered in the literature, especially when non-linear terms are added to the predictor space. Moreover, some of the selected predictors that help decreasing portfolio risk also increase returns, leading to minimum-variance portfolios with good performance in terms of Shape ratios in some situations. Our evidence suggests that ad-hoc sparsity can be detrimental to the performance of minimum-variance characteristics-based portfolios.

Keywords: Machine Learning (ML), Minimum-Variance Portfolio, Variable Selection, Predictor Transformations, Risk-Return Optimization, Equities

Complexity vs Empirical Score

  • Math Complexity: 8.0/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced econometric techniques like Bayesian variable selection with horseshoe priors and high-dimensional penalized regressions, leading to high math complexity. Empirical rigor is strong due to the use of a large dataset (4,610 predictors), out-of-sample testing, and direct application to portfolio construction, though it remains a research paper rather than a fully production-ready system.
  flowchart TD
    A["Research Goal: Does ML improve<br>Minimum-Variance Portfolio selection?"] --> B["Data & Predictor Setup"]
    B --> C["Core Methodology"]
    C --> D["Key Findings"]
    
    subgraph B ["Data & Inputs"]
        B1["Large pool of<br>firm-level characteristics"]
        B2["Transformations:<br>Second-order & cross-products"]
        B1 --> B2
        B2 --> B3["Total: 4,610 Predictors"]
    end
    
    subgraph C ["Methodology"]
        C1["Parameterize portfolio weights<br>as function of predictors"]
        C2["Apply Machine Learning<br>for variable selection"]
        C1 --> C2
        C2 --> C3["Compare against<br>sparse benchmark models"]
    end
    
    subgraph D ["Outcomes"]
        D1["Substantial risk reduction<br>vs. sparse specifications"]
        D2["Non-linear terms provide<br>additional gains"]
        D3["Some predictors boost both<br>risk reduction and returns"]
        D4["Improved Sharpe ratios<br>in some cases"]
        D5["Ad-hoc sparsity is<br>detrimental to performance"]
    end
    
    style A fill:#e1f5fe
    style D fill:#e8f5e8