Data-driven measures of high-frequency trading
ArXiv ID: 2405.08101 “View on arXiv”
Authors: Unknown
Abstract
High-frequency trading (HFT) accounts for almost half of equity trading volume, yet it is not identified in public data. We develop novel data-driven measures of HFT activity that separate strategies that supply and demand liquidity. We train machine learning models to predict HFT activity observed in a proprietary dataset using concurrent public intraday data. Once trained on the dataset, these models generate HFT measures for the entire U.S. stock universe from 2010 to 2023. Our measures outperform conventional proxies, which struggle to capture HFT’s time dynamics. We further validate them using shocks to HFT activity, including latency arbitrage, exchange speed bumps, and data feed upgrades. Finally, our measures reveal how HFT affects fundamental information acquisition. Liquidity-supplying HFTs improve price discovery around earnings announcements while liquidity-demanding strategies impede it.
Keywords: high-frequency trading, machine learning, liquidity provision, price discovery, market microstructure, Equity
Complexity vs Empirical Score
- Math Complexity: 3.5/10
- Empirical Rigor: 8.0/10
- Quadrant: Street Traders
- Why: The paper employs advanced machine learning (ensemble models) for prediction but focuses more on application and validation than deep mathematical theory. It demonstrates high empirical rigor through extensive backtesting on multiple datasets, validation via natural experiments (speed bumps, latency arbitrage, feed upgrades), and detailed out-of-sample performance metrics across a long time period and broad stock universe.
flowchart TD
A["Research Goal:<br>Identify & Measure HFT in Public Data"] --> B["Methodology:<br>Machine Learning Model Training"]
B --> C{"Data/Inputs"}
C --> C1["Proprietary HFT Dataset<br>Ground Truth Labels"]
C --> C2["Public Intraday Data<br>Features for Prediction"]
C1 & C2 --> D["Computational Process:<br>Train ML Models"]
D --> E["Generate HFT Measures<br>for Full US Stock Universe<br>(2010-2023)"]
E --> F{"Validation & Outcomes"}
F --> F1["Superior Performance<br>vs. Conventional Proxies"]
F --> F2["Validated via Shocks<br>Latency, Speed Bumps, Feeds"]
F --> F3["Key Finding:<br>Liquidity Supplying HFT<br>Improves Price Discovery"]
F --> F4["Key Finding:<br>Liquidity Demanding HFT<br>Impedes Price Discovery"]