FinReflectKG - EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation
ArXiv ID: 2510.05710 “View on arXiv”
Authors: Fabrizio Dimino, Abhinav Arun, Bhaskarjit Sarmah, Stefano Pasquali
Abstract
Large language models (LLMs) are increasingly being used to extract structured knowledge from unstructured financial text. Although prior studies have explored various extraction methods, there is no universal benchmark or unified evaluation framework for the construction of financial knowledge graphs (KG). We introduce FinReflectKG - EvalBench, a benchmark and evaluation framework for KG extraction from SEC 10-K filings. Building on the agentic and holistic evaluation principles of FinReflectKG - a financial KG linking audited triples to source chunks from S&P 100 filings and supporting single-pass, multi-pass, and reflection-agent-based extraction modes - EvalBench implements a deterministic commit-then-justify judging protocol with explicit bias controls, mitigating position effects, leniency, verbosity and world-knowledge reliance. Each candidate triple is evaluated with binary judgments of faithfulness, precision, and relevance, while comprehensiveness is assessed on a three-level ordinal scale (good, partial, bad) at the chunk level. Our findings suggest that, when equipped with explicit bias controls, LLM-as-Judge protocols provide a reliable and cost-efficient alternative to human annotation, while also enabling structured error analysis. Reflection-based extraction emerges as the superior approach, achieving best performance in comprehensiveness, precision, and relevance, while single-pass extraction maintains the highest faithfulness. By aggregating these complementary dimensions, FinReflectKG - EvalBench enables fine-grained benchmarking and bias-aware evaluation, advancing transparency and governance in financial AI applications.
Keywords: Financial knowledge graphs, LLM evaluation, SEC filings, LLM-as-Judge, Reflection extraction, General Financial Data
Complexity vs Empirical Score
- Math Complexity: 3.0/10
- Empirical Rigor: 7.5/10
- Quadrant: Street Traders
- Why: The paper focuses on benchmarking methodology and evaluation protocols with minimal advanced mathematics, relying instead on descriptive statistics and scoring metrics, while demonstrating high empirical rigor through a reproducible benchmark on real SEC filings with bias controls and multiple extraction modes.
flowchart TD
A["Research Goal<br>Develop unified KG<br>evaluation benchmark"] --> B["Data<br>SEC 10-K filings from<br>S&P 100 companies"]
B --> C["Methodology<br>FinReflectKG EvalBench<br>Deterministic commit-then-justify protocol"]
C --> D["Bias Controls<br>Position, Leniency,<br>Verbosity, World-Knowledge"]
D --> E["Evaluation Dimensions<br>Faithfulness (Binary)<br>Precision (Binary)<br>Relevance (Binary)<br>Comprehensiveness (3-level)"]
E --> F["Key Findings<br>1. Reflection extraction superior<br>2. LLM-as-Judge reliable with bias controls<br>3. Single-pass best for faithfulness"]