Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

ArXiv ID: 2508.10192 “View on arXiv”

Authors: Igor Halperin

Abstract

The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations – events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {“confabulations, defined as responses that are arbitrary and semantically misaligned with the user’s query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S”}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{“Semantic Exploration”}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.

Keywords: Hallucination detection, Semantic divergence metrics, Large Language Models, Jensen-Shannon divergence, Wasserstein distance, Technology/AI

Complexity vs Empirical Score

Math Complexity: 6.5/10
Empirical Rigor: 3.0/10
Quadrant: Lab Rats
Why: The paper introduces several information-theoretic metrics (e.g., Jensen-Shannon, Wasserstein, KL divergence) and uses joint clustering on embeddings, requiring dense math, but lacks backtest-ready datasets, code, or heavy statistical validation for real-world trading applications.

  flowchart TD
    A["Research Goal<br>Detect Faithfulness Hallucinations<br>& Confabulations"] --> B["Data Input<br>Dataset of LLM Responses<br>to Original & Paraphrased Prompts"]
    B --> C["Core Methodology<br>Joint Clustering of Sentence Embeddings<br>to Create Shared Topic Space"]
    C --> D["Computation: Heatmap<br>Visualize Topic Co-occurrences<br>between Prompts & Responses"]
    D --> E["Computation: Metrics<br>Calculate JSD, Wasserstein Distance<br>& KL(Answer || Prompt)"]
    E --> F["Outcome: Score S_H<br>Quantifies Semantic Divergence<br>High S_H = Faithfulness Hallucination"]
    E --> G["Outcome: Semantic Box<br>Diagnostic Framework<br>Classifies Response Types<br>e.g., Confident Confabulation"]

Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models#

Abstract#

Complexity vs Empirical Score#

Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

Abstract

Complexity vs Empirical Score