Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
ArXiv ID: 2510.15691 “View on arXiv”
Authors: Tian Guo, Emmanuel Hauptmann
Abstract
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three methods of different architectural complexities: representation combination, representation summation, and attentive representations. Next, building on the limitation of fusion learning observed in empirical comparison, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability of the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction and selection.
Keywords: Multimodal factors, Large language models (LLMs), Fusion learning, Representation learning, Mixture models, Equities
Complexity vs Empirical Score
- Math Complexity: 5.5/10
- Empirical Rigor: 7.5/10
- Quadrant: Street Traders
- Why: The paper employs established mathematical concepts like multimodal fusion and mixture models but lacks heavy derivations, placing it in a moderate math complexity range. It is highly empirical, featuring experiments on real investment universes, backtest portfolio evaluations (long-only and long-short), and comparisons of model variants, indicating strong data and implementation focus.
flowchart TD
A["Research Goal: Compare & combine quantitative factors & LLM newsflow for stock return prediction"] --> B["Input Data: Quantitative Factors & LLM Newsflow Representations"]
B --> C["Method 1: Fusion Learning Framework"]
C --> D["Comparison: Representation Combination vs. Summation vs. Attentive"]
C --> E["Finding 1: Fusion Learning Limitations"]
E --> F["Method 2: Mixture Model (Adaptive Combination)"]
F --> G["Constraint: Training Instability"]
G --> H["Solution: Decoupled Training Approach"]
H --> I["Outcome: Optimized Multimodal Stock Return Prediction & Selection"]