UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
ArXiv ID: 2410.14059 “View on arXiv”
Authors: Unknown
Abstract
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
Keywords: Large Language Models (LLMs), Financial Benchmarking, Human-AI Alignment, Task-Specific Interactions, LLM-as-Judge, General Financial Technology
Complexity vs Empirical Score
- Math Complexity: 2.0/10
- Empirical Rigor: 8.0/10
- Quadrant: Street Traders
- Why: The paper introduces a new evaluation benchmark with a user study, dataset creation, and a rigorous correlation analysis (Pearson r=0.78), but contains almost no advanced mathematics or formal modeling.
flowchart TD
A["Research Goal<br>Evaluate LLM Financial Expertise<br>User-Centric Approach"] --> B["Methodology: User Study<br>804 Participants & Feedback"]
B --> C["Methodology: Dataset Creation<br>Hybrid User-Intent & Interaction Data"]
C --> D["Benchmarking Process<br>11 LLM Services Evaluated"]
D --> E["Computational Process<br>LLM-as-Judge Methodology"]
E --> F["Key Outcome<br>High Alignment: Pearson r=0.78"]
F --> G["Final Contribution<br>UCFE Framework & Benchmark"]