Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series

ArXiv ID: 2509.19628 “View on arXiv”

Authors: Ross Koval, Nicholas Andrews, Xifeng Yan

Abstract

Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.

Keywords: Multimodal Forecasting, Cross-Modal Alignment, Modality-Specific Experts, Token Weighting Mechanism, Financial Time Series Integration, Equity (Stocks)

Complexity vs Empirical Score

  • Math Complexity: 6.5/10
  • Empirical Rigor: 7.5/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced neural architectures like mixture-of-experts (MoE) and cross-modal alignment frameworks with specific objective functions (SALMON, STW), indicating significant math complexity. It also demonstrates empirical rigor through a large-scale forecasting task, a released codebase, and investment simulations with economic gains, making it backtest-ready.
  flowchart TD
    G["Research Goal<br>How to effectively integrate interleaved<br>text & time series for financial forecasting?"] --> D
    subgraph D ["Inputs & Preprocessing"]
        D1["Text Data<br>(News Articles)"] --> D3["Interleaved Sequences"]
        D2["Time Series Data<br>(Stock Prices)"] --> D3
    end
    D --> M
    subgraph M ["Methodology"]
        direction LR
        M1["Modality-Specific Experts"] --> M2["Cross-Modal Alignment<br>with Salient Token Weighting"] --> M3["Joint Reasoning &<br>Pretrained Language Modeling"]
    end
    M --> F["Computational Process<br>Multimodal LLM Architecture"]
    F --> O
    subgraph O ["Key Findings & Outcomes"]
        O1["State-of-the-Art<br>Forecasting Performance"]
        O2["Interpretability<br>Insights"]
        O3["Meaningful Economic Gains<br>in Investment Simulations"]
    end
    O --> R["Research Impact<br>Validates Modality-Specific Experts &<br>Cross-Modal Alignment for Finance"]