Scaling Conditional Autoencoders for Portfolio Optimization via Uncertainty-Aware Factor Selection

ArXiv ID: 2511.17462 “View on arXiv”

Authors: Ryan Engel, Yu Chen, Pawel Polak, Ioana Boier

Abstract

Conditional Autoencoders (CAEs) offer a flexible, interpretable approach for estimating latent asset-pricing factors from firm characteristics. However, existing studies usually limit the latent factor dimension to around K=5 due to concerns that larger K can degrade performance. To overcome this challenge, we propose a scalable framework that couples a high-dimensional CAE with an uncertainty-aware factor selection procedure. We employ three models for quantile prediction: zero-shot Chronos, a pretrained time-series foundation model (ZS-Chronos), gradient-boosted quantile regression trees using XGBoost and RAPIDS (Q-Boost), and an I.I.D bootstrap-based sample mean model (IID-BS). For each model, we rank factors by forecast uncertainty and retain the top-k most predictable factors for portfolio construction, where k denotes the selected subset of factors. This pruning strategy delivers substantial gains in risk-adjusted performance across all forecasting models. Furthermore, due to each model’s uncorrelated predictions, a performance-weighted ensemble consistently outperforms individual models with higher Sharpe, Sortino, and Omega ratios.

Keywords: Conditional Autoencoders (CAEs), Asset-pricing factors, Quantile regression, XGBoost, Ensemble learning, Equities (Stocks)

Complexity vs Empirical Score

  • Math Complexity: 8.5/10
  • Empirical Rigor: 9.0/10
  • Quadrant: Holy Grail
  • Why: The paper introduces advanced statistical learning concepts, including a high-dimensional conditional autoencoder (CAE) and uncertainty-aware factor selection via quantile forecasts from foundation models, resulting in high mathematical density. Furthermore, it is highly empirical, explicitly backtesting with specific models (Chronos, XGBoost, bootstrap), robust metrics (Sharpe, Sortino, Omega ratios), and implementation details (RAPIDS, ensemble methods) on financial time-series data.
  flowchart TD
    A["Research Goal: Scale Conditional Autoencoders<br>for Portfolio Optimization via<br>Uncertainty-Aware Factor Selection"] --> B["Input: High-Dimensional Firm Characteristics<br>and Asset Returns"]
    B --> C["Step 1: Generate Latent Factors<br>using Conditional Autoencoder (CAE)"]
    C --> D["Step 2: Quantile Prediction & Uncertainty Estimation"]
    subgraph D ["Forecasting Models"]
        D1["Zero-shot Chronos (ZS-Chronos)<br>Pretrained Time-Series Foundation"]
        D2["Quantile Gradient Boosting (Q-Boost)<br>XGBoost & RAPIDS"]
        D3["IID Bootstrap Sample Mean (IID-BS)<br>Baseline Model"]
    end
    D --> E["Step 3: Uncertainty-Aware Factor Selection<br>Rank factors by forecast uncertainty<br>Retain top-k most predictable factors"]
    E --> F["Step 4: Portfolio Construction & Evaluation"]
    F --> G["Key Outcomes & Findings"]
    G --> G1["Substantial gains in risk-adjusted performance<br>across all models"]
    G --> G2["Performance-weighted ensemble<br>outperforms individuals"]
    G --> G3["Higher Sharpe, Sortino, and Omega ratios<br>achieved via pruning strategy"]