Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models

ArXiv ID: 2510.05702 “View on arXiv”

Authors: Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, Stefano Pasquali

Abstract

Large Language Models are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained decoding and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial LLM deployment.

Keywords: LLM bias, Representation bias, Constrained decoding, Confidence scores, Sector analysis, Equities

Complexity vs Empirical Score

  • Math Complexity: 5.5/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced statistical methods like ANOVA, Pearson/Spearman/Kendall correlations with FDR correction, and confidence interval analysis, indicating moderate to high mathematical complexity. It also demonstrates high empirical rigor through a systematic, multi-model experiment on 150 U.S. equities with rigorous data collection, multiple testing protocols, and statistical validation of findings.
  flowchart TD
    A["Research Goal<br>Detect Representation Bias in LLMs for Finance"] --> B["Data & Methodology<br>~150 US Equities with Balanced Round-Robin Prompting"]
    B --> C["Computation<br>Constrained Decoding &<br>Token-Logit Aggregation"]
    C --> D["Outputs<br>Firm-Level Confidence Scores"]
    D --> E{"Statistical Analysis"}
    E --> F["Findings: Firm Size & Valuation ↑ Confidence"]
    E --> G["Findings: Risk Factors ↓ Confidence"]
    E --> H["Findings: Sector Variability<br>High in Tech"]