INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent
ArXiv ID: 2412.18174 “View on arXiv”
Authors: Unknown
Abstract
Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textsc{“InvestorBench”}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.
Keywords: LLM Agents, Benchmarking, Reinforcement Learning, Multi-modal Data, Financial Decision Making, Multi-Asset (Equities, Crypto, ETFs)
Complexity vs Empirical Score
- Math Complexity: 2.5/10
- Empirical Rigor: 8.0/10
- Quadrant: Street Traders
- Why: The paper focuses on constructing a practical benchmark (InvestorBench) with empirical evaluation using real datasets and multiple LLMs, showing strong empirical rigor, while the mathematical complexity is low as it primarily describes agent architecture without dense theoretical derivations.
flowchart TD
A["Research Goal<br>Create comprehensive benchmark for LLM agents<br>in diverse financial decision-making tasks"] --> B["Methodology<br>Develop InvestorBench: Multi-asset benchmark<br>with standardized environments"]
B --> C["Data & Inputs<br>Curated open-source multi-modal datasets<br>(Stocks, Crypto, ETFs)"]
C --> D["Computational Process<br>Evaluate 13 LLM backbone models<br>across varying market conditions"]
D --> E["Key Outcomes<br>Standardized benchmark framework<br>Established accessible evaluation platform<br>Assessed reasoning capabilities"]