Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

ArXiv ID: 2512.05156 “View on arXiv”

Authors: Igor Halperin

Abstract

Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${"\bf Q"}$ and ${"\bf A"}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [“0,1”], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.

Keywords: information theory, thermodynamics, Kullback-Leibler divergence, semantic faithfulness, convex optimization, Financial Text Analytics (SEC Filings)

Complexity vs Empirical Score

Math Complexity: 7.5/10
Empirical Rigor: 6.0/10
Quadrant: Holy Grail
Why: The paper employs advanced mathematical concepts including KL divergence, convex optimization, and thermodynamic entropy production, placing it in the high math complexity range. However, its empirical demonstration is limited to 10 SEC 10-K examples without code, backtesting, or statistical validation, placing it just above the threshold for high empirical rigor.

  flowchart TD
    A["Research Goal:<br>Evaluate LLM Faithfulness<br>without Supervision"] --> B["Data & Input:<br>Question-Context-Answer QCA triplets<br>from SEC 10-K Summarization"]
    B --> C["Methodology:<br>Model QCA as probability distributions<br>and transition matrices via convex optimization"]
    C --> D["Computational Processes:<br>1. Semantic Faithfulness SF<br>via KL Divergence mapping to [0,1"]<br>2. Semantic Entropy Production SEP<br>thermodynamic analysis]
    D --> E["Key Outcomes:<br>SF & SEP metrics enable<br>unsupervised hallucination control<br>High Faithfulness correlates with Low Entropy"]

Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations#

Abstract#

Complexity vs Empirical Score#

Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

Abstract

Complexity vs Empirical Score