LOB-Bench: Benchmarking Generative AI for Finance – an Application to Limit Order Book Data

ArXiv ID: 2502.09172 “View on arXiv”

Authors: Unknown

Abstract

While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains “market impact metrics”, i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.

Keywords: Limit Order Books (LOB), benchmarking, generative models, market impact metrics, Stocks

Complexity vs Empirical Score

  • Math Complexity: 6.5/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Holy Grail
  • Why: The paper introduces a complex benchmark with statistical and distributional evaluation metrics, showing advanced mathematical formulation, while providing an open-source Python implementation, specific datasets (GOOG, INTC), and backtesting-ready empirical results comparing generative models.
  flowchart TD
    A["Research Goal: Develop a benchmark for generative models in finance"] --> B["Data Source: LOBSTER format Limit Order Book data"]
    B --> C["Methodology: LOB-Bench Framework"]
    C --> D["Computational Metrics: Distributional differences, LOB statistics, Market impact"]
    D --> E["Evaluation: C)GAN, Autoregressive models, Parametric models"]
    E --> F["Findings: Autoregressive GenAI outperforms traditional models"]