Construction of a Japanese Financial Benchmark for Large Language Models

ArXiv ID: 2403.15062 “View on arXiv”

Authors: Unknown

Abstract

With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently, we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.

Keywords: large language models, financial benchmarking, Japanese domain, natural language processing, AI evaluation

Complexity vs Empirical Score

Math Complexity: 1.5/10
Empirical Rigor: 6.0/10
Quadrant: Street Traders
Why: The paper focuses on benchmark construction and model evaluation using existing datasets and public resources, showing high empirical rigor through data handling and backtest-like evaluations, but the mathematics involved is relatively low-level (mostly classification metrics and data processing).

  flowchart TD
    A["Research Goal:\nConstruct Japanese Financial Benchmark for LLMs"] --> B["Methodology:\nDevelop Multi-Task Dataset (QA, Generation, Summarization)"]
    B --> C["Data & Inputs:\nJapanese Financial Documents & Financial QA"]
    C --> D["Computational Process:\nEvaluate GPT-4 & Existing Models\nusing constructed tasks"]
    D --> E["Outcome 1:\nGPT-4 Outperforms Other Models"]
    D --> F["Outcome 2:\nBenchmark Effectively Differentiates\nModel Performance Across Domains"]

Construction of a Japanese Financial Benchmark for Large Language Models#

Abstract#

Complexity vs Empirical Score#

Construction of a Japanese Financial Benchmark for Large Language Models

Abstract

Complexity vs Empirical Score