false

Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models ArXiv ID: 2508.10192 “View on arXiv” Authors: Igor Halperin Abstract The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations – events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {“confabulations, defined as responses that are arbitrary and semantically misaligned with the user’s query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S”}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{“Semantic Exploration”}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation. ...

August 13, 2025 · 2 min · Research Team

A First Look at Financial Data Analysis Using ChatGPT-4o

A First Look at Financial Data Analysis Using ChatGPT-4o ArXiv ID: ssrn-4849578 “View on arXiv” Authors: Unknown Abstract OpenAI’s new flagship model, ChatGPT-4o, released on May 13, 2024, offers enhanced natural language understanding and more coherent responses. In this paper, we Keywords: Large Language Models (LLMs), Natural Language Processing, Generative AI, AI Evaluation, Model Performance, Technology/AI Complexity vs Empirical Score Math Complexity: 4.0/10 Empirical Rigor: 6.5/10 Quadrant: Street Traders Why: The paper involves implementing and comparing specific financial models like ARMA-GARCH, indicating moderate-to-high implementation complexity, but the core mathematics is largely descriptive and comparative rather than novel. Empirical rigor is high due to the use of real datasets (CRSP, Fama-French) and direct backtesting comparisons against Stata. flowchart TD A["Research Goal: Evaluate ChatGPT-4o for Financial Data Analysis"] --> B["Methodology: Zero-shot vs. Chain-of-Thought"] B --> C["Input: Financial Statements & Market Data"] C --> D["Process: Text Generation & Sentiment Analysis"] D --> E["Output: Financial Predictions & Explanations"] E --> F["Key Findings: High Accuracy in NLP Tasks"] F --> G["Outcome: Strong Potential but Limited Numerical Reasoning"]

May 31, 2024 · 1 min · Research Team

FinBERT - A Large Language Model for Extracting Information from Financial Text

FinBERT - A Large Language Model for Extracting Information from Financial Text ArXiv ID: ssrn-3910214 “View on arXiv” Authors: Unknown Abstract We develop FinBERT, a state-of-the-art large language model that adapts to the finance domain. We show that FinBERT incorporates finance knowledge and can bette Keywords: FinBERT, Natural Language Processing, Large Language Models, Financial Text Analysis, Technology/AI Complexity vs Empirical Score Math Complexity: 2.0/10 Empirical Rigor: 8.0/10 Quadrant: Street Traders Why: The paper focuses on fine-tuning a pre-existing transformer model (FinBERT) with specific financial datasets, which is primarily an empirical, implementation-heavy task with significant data preparation and evaluation metrics, while the underlying mathematics is standard deep learning rather than novel or dense derivations. flowchart TD A["Research Goal:<br>Create domain-adapted LLM for finance"] --> B["Data:<br>Financial Documents & Corpora"] B --> C["Preprocessing:<br>Tokenization & Formatting"] C --> D["Core Methodology:<br>BERT Architecture Adaptation"] D --> E["Training:<br>Domain-specific Fine-tuning"] E --> F["Evaluation:<br>Benchmark Testing"] F --> G["Outcome:<br>FinBERT Model"] F --> H["Outcome:<br>Improved Performance vs. General LLMs"] G --> I["Final Result:<br>State-of-the-art Financial NLP"] H --> I

August 27, 2021 · 1 min · Research Team