Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability

ArXiv ID: 2510.00205 “View on arXiv”

Authors: Zhongtian Sun, Chenghao Xiao, Anoushka Harit, Jongmin Yu

Abstract

Financial news is essential for accurate market prediction, but evolving narratives across macroeconomic regimes introduce semantic and causal drift that weaken model reliability. We present an evaluation framework to quantify robustness in financial NLP under regime shifts. The framework defines four metrics: (1) Financial Causal Attribution Score (FCAS) for alignment with causal cues, (2) Patent Cliff Sensitivity (PCS) for sensitivity to semantic perturbations, (3) Temporal Semantic Volatility (TSV) for drift in latent text representations, and (4) NLI-based Logical Consistency Score (NLICS) for entailment coherence. Applied to LSTM and Transformer models across four economic periods (pre-COVID, COVID, post-COVID, and rate hike), the metrics reveal performance degradation during crises. Semantic volatility and Jensen-Shannon divergence correlate with prediction error. Transformers are more affected by drift, while feature-enhanced variants improve generalisation. A GPT-4 case study confirms that alignment-aware models better preserve causal and logical consistency. The framework supports auditability, stress testing, and adaptive retraining in financial AI systems.

Keywords: Financial NLP, Regime Shifts, Causal Attribution, Transformer Models, Robustness Evaluation, Multi-Asset (Sentiment Analysis)

Complexity vs Empirical Score

  • Math Complexity: 7.5/10
  • Empirical Rigor: 6.8/10
  • Quadrant: Holy Grail
  • Why: The paper introduces formal mathematical constructs (e.g., distributional shifts, functional mapping for metrics) and uses statistical measures like Jensen-Shannon divergence, showing significant mathematical density. It employs rigorous empirical methods, including multi-period regime testing, model comparison (LSTM vs. Transformer), and a GPT-4 case study, though it lacks full backtest-ready implementation details.
  flowchart TD
    A["Research Goal: Quantify robustness<br>of Financial NLP models<br>under regime shifts"] --> B["Methodology: Evaluation Framework"]
    B --> C["Define Metrics:<br>FCAS, PCS, TSV, NLICS"]
    B --> D["Data: Financial News across<br>4 Economic Regimes"]
    C --> E["Computational Process:<br>Apply Metrics to Models<br>LSTM vs Transformer"]
    D --> E
    E --> F["Key Findings:<br>Performance degradation during crises"]
    E --> G["Key Findings:<br>Transformers more affected by drift"]
    F --> H["Outcome:<br>Framework for stress testing,<br>auditability, & adaptive retraining"]
    G --> H