A statistical technique for cleaning option price data

ArXiv ID: 2501.11164 “View on arXiv”

Authors: Unknown

Abstract

Recorded option pricing datasets are not always freely available. Additionally, these datasets often contain numerous prices which are either higher or lower than can reasonably be expected. Various reasons for these unexpected observations are possible, including human error in the recording of the details associated with the option in question. In order for the analyses performed on these datasets to be reliable, it is necessary to identify and remove these options from the dataset. In this paper, we list three distinct problems often found in recorded option price datasets alongside means of addressing these. The methods used are justified using sound statistical reasoning and remove option prices violating the standard assumption of no arbitrage. An attractive aspect of the proposed technique is that no option pricing model-based assumptions are used. Although the discussion is restricted to European options, the procedure is easily modified for use with exotic options as well. As a final contribution, the paper contains a link to six option pricing datasets which have already been cleaned using the proposed methods and can be freely used by researchers.

Keywords: option pricing datasets, arbitrage detection, data cleaning, statistical filtering, European options, Options

Complexity vs Empirical Score

  • Math Complexity: 4.0/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Street Traders
  • Why: The paper employs standard statistical reasoning and arbitrage bounds (equations 1 and 2) rather than advanced mathematical derivations, resulting in moderate complexity. However, it demonstrates high empirical rigor by providing specific datasets, details data sourcing (Yahoo Finance), specifies processing tools (Matlab), and outlines concrete cleaning steps ready for implementation.
  flowchart TD
    A["Research Goal: Clean Option Pricing Data"] --> B["Identify Three Problems"]
    B --> C["Apply Statistical Arbitrage Tests"]
    C --> D{"Price Violates No-Arbitrage?"}
    D -- Yes --> E["Flag as Outlier"]
    D -- No --> F["Retain Data Point"]
    E --> G["Outcome: Cleaned Datasets"]
    F --> G