Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering

ArXiv ID: 2509.01590 “View on arXiv”

Authors: Bingyang Wang, Grant Johnson, Maria Hybinette, Tucker Balch

Abstract

This paper investigates whether artificial intelligence can enhance stock clustering compared to traditional methods. We consider this in the context of the semi-strong Efficient Markets Hypothesis (EMH), which posits that prices fully reflect all public information and, accordingly, that clusters based on price information cannot be improved upon. We benchmark three clustering approaches: (i) price-based clusters derived from historical return correlations, (ii) human-informed clusters defined by the Global Industry Classification Standard (GICS), and (iii) AI-driven clusters constructed from large language model (LLM) embeddings of stock-related news headlines. At each date, each method provides a classification in which each stock is assigned to a cluster. To evaluate a clustering, we transform it into a synthetic factor model following the Arbitrage Pricing Theory (APT) framework. This enables consistent evaluation of predictive performance in a roll forward, out-of-sample test. Using S&P 500 constituents from from 2022 through 2024, we find that price-based clustering consistently outperforms both rule-based and AI-based methods, reducing root mean squared error (RMSE) by 15.9% relative to GICS and 14.7% relative to LLM embeddings. Our contributions are threefold: (i) a generalizable methodology that converts any equity grouping: manual, machine, or market-driven, into a real-time factor model for evaluation; (ii) the first direct comparison of price-based, human rule-based, and AI-based clustering under identical conditions; and (iii) empirical evidence reinforcing that short-horizon return information is largely contained in prices. These results support the EMH while offering practitioners a practical diagnostic for monitoring evolving sector structures and provide academics a framework for testing alternative hypotheses about how quickly markets absorb information.

Keywords: Clustering Algorithms, Large Language Models (LLMs), Arbitrage Pricing Theory (APT), Efficient Market Hypothesis (EMH), Factor Models, Equities

Complexity vs Empirical Score

  • Math Complexity: 6.5/10
  • Empirical Rigor: 8.0/10
  • Quadrant: Holy Grail
  • Why: The paper uses advanced financial theory (APT, EMH) and embedding techniques, but the empirical setup is exceptionally rigorous with out-of-sample rolling tests, specific dataset details (S&P 500 2022-2024), and clear error metrics (RMSE).
  flowchart TD
    A["Research Question:<br>Can AI improve stock clustering<br>over price-based methods under EMH?"] --> B["Methodology Setup<br>Rolling Out-of-Sample Test"]
    B --> C{"Inputs"}
    C --> C1["S&P 500 Constituents<br>2022-2024"]
    C --> C2["Three Clustering Methods<br>1. Price-Based<br>2. GICS (Human Rules)<br>3. LLM Embeddings"]
    C --> C3["APT Factor Model Framework<br>Consistent Evaluation"]
    C1 & C2 & C3 --> D["Computational Process<br>Convert clusters to synthetic factors<br>Calculate RMSE in out-of-sample rolls"]
    D --> E{"Key Findings"}
    E --> E1["Price-Based Clustering<br>Best Performance"]
    E --> E2["15.9% RMSE Reduction<br>vs GICS"]
    E --> E3["14.7% RMSE Reduction<br>vs LLM Embeddings"]
    E --> E4["Conclusion:<br>Short-horizon returns<br>contained in prices (EMH supported)"]