Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination

ArXiv ID: 2311.15548 “View on arXiv”

Authors: Unknown

Abstract

The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs’ hallucination behaviors in financial tasks. First, we empirically investigate LLM model’s ability of explaining financial concepts and terminologies. Second, we assess LLM models’ capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs’ hallucination.

Keywords: large language models, hallucination mitigation, few-shot learning, retrieval augmentation generation, financial text analysis, Equities

Complexity vs Empirical Score

  • Math Complexity: 1.0/10
  • Empirical Rigor: 8.0/10
  • Quadrant: Street Traders
  • Why: The paper is primarily empirical, focusing on benchmarking LLM performance and evaluating mitigation techniques with datasets, API calls, and compute metrics, which requires high implementation and data effort. Mathematical complexity is low, as it lacks advanced theoretical derivations and focuses on experimental methodologies rather than dense formulas.
  flowchart TD
    Start(["Research Goal:<br>Examine Hallucination in LLMs for Finance"]) --> Input1["Financial Concepts &<br>Terminologies"]
    Start --> Input2["Historical Stock<br>Price Queries"]
    
    Input1 --> Method["Methodology:<br>Empirical Testing of LLM Capabilities"]
    Input2 --> Method
    
    Method --> Mitigate["Mitigation Strategies:<br>1. Few-Shot Learning<br>2. DoLa Decoding<br>3. RAG<br>4. Tool Learning"]
    
    Mitigate --> Finding(["Major Finding:<br>Off-the-shelf LLMs show<br>serious hallucination in finance<br>Urgent need for mitigation"])
    
    style Start fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style Finding fill:#ffebee,stroke:#c62828,stroke-width:2px