THEME: Enhancing Thematic Investing with Semantic Stock Representations and Temporal Dynamics
ArXiv ID: 2508.16936 “View on arXiv”
Authors: Hoyoung Lee, Wonbin Ahn, Suhwan Park, Jaehoon Lee, Minjae Kim, Sungdong Yoo, Taeyoon Lim, Woohyung Lim, Yongjae Lee
Abstract
Thematic investing, which aims to construct portfolios aligned with structural trends, remains a challenging endeavor due to overlapping sector boundaries and evolving market dynamics. A promising direction is to build semantic representations of investment themes from textual data. However, despite their power, general-purpose LLM embedding models are not well-suited to capture the nuanced characteristics of financial assets, since the semantic representation of investment assets may differ fundamentally from that of general financial text. To address this, we introduce THEME, a framework that fine-tunes embeddings using hierarchical contrastive learning. THEME aligns themes and their constituent stocks using their hierarchical relationship, and subsequently refines these embeddings by incorporating stock returns. This process yields representations effective for retrieving thematically aligned assets with strong return potential. Empirical results demonstrate that THEME excels in two key areas. For thematic asset retrieval, it significantly outperforms leading large language models. Furthermore, its constructed portfolios demonstrate compelling performance. By jointly modeling thematic relationships from text and market dynamics from returns, THEME generates stock embeddings specifically tailored for a wide range of practical investment applications.
Keywords: Thematic investing, Hierarchical contrastive learning, LLM embedding fine-tuning, Portfolio construction, Asset retrieval, Equity Portfolio Management
Complexity vs Empirical Score
- Math Complexity: 6.5/10
- Empirical Rigor: 8.0/10
- Quadrant: Holy Grail
- Why: The paper employs advanced machine learning techniques like hierarchical contrastive learning and embedding fine-tuning, representing significant mathematical and methodological complexity. It is highly empirical, with a custom dataset (TRS), clear backtesting for portfolio performance, and quantitative comparisons to baseline models like LLMs.
flowchart TD
A["Research Goal"] --> B["Data Collection"]
subgraph B ["Inputs"]
B1["Thematic Text"]
B2["Stock Returns"]
end
B --> C{"THEME Framework"}
C --> D["Hierarchical Contrastive Learning"]
D --> E["Align Themes & Stocks"]
E --> F["Incorporate Returns"]
F --> G["Refined Embeddings"]
G --> H["Outcomes"]
subgraph H ["Findings"]
H1["Superior Asset Retrieval"]
H2["High-Performance Portfolios"]
end