Exploring Microstructural Dynamics in Cryptocurrency Limit Order Books: Better Inputs Matter More Than Stacking Another Hidden Layer
ArXiv ID: 2506.05764 “View on arXiv”
Authors: Haochuan Wang
Abstract
Cryptocurrency price dynamics are driven largely by microstructural supply demand imbalances in the limit order book (LOB), yet the highly noisy nature of LOB data complicates the signal extraction process. Prior research has demonstrated that deep-learning architectures can yield promising predictive performance on pre-processed equity and futures LOB data, but they often treat model complexity as an unqualified virtue. In this paper, we aim to examine whether adding extra hidden layers or parameters to “blackbox ish” neural networks genuinely enhances short term price forecasting, or if gains are primarily attributable to data preprocessing and feature engineering. We benchmark a spectrum of models from interpretable baselines, logistic regression, XGBoost to deep architectures (DeepLOB, Conv1D+LSTM) on BTC/USDT LOB snapshots sampled at 100 ms to multi second intervals using publicly available Bybit data. We introduce two data filtering pipelines (Kalman, Savitzky Golay) and evaluate both binary (up/down) and ternary (up/flat/down) labeling schemes. Our analysis compares models on out of sample accuracy, latency, and robustness to noise. Results reveal that, with data preprocessing and hyperparameter tuning, simpler models can match and even exceed the performance of more complex networks, offering faster inference and greater interpretability.
Keywords: limit order book analysis, data preprocessing, Kalman filtering, Savitzky-Golay filter, model interpretability, Cryptocurrency
Complexity vs Empirical Score
- Math Complexity: 5.5/10
- Empirical Rigor: 8.5/10
- Quadrant: Holy Grail
- Why: The paper employs moderately advanced math in features and neural architectures but is anchored in a rigorous empirical setup with specific public data, multiple models, and direct performance comparisons. This balances theoretical depth with strong backtest-ready implementation, fitting the Holy Grail quadrant.
flowchart TD
A["Research Goal<br>Assess if model complexity<br>or data preprocessing drives<br>LOB forecasting gains?"] --> B["Data Acquisition<br>Bybit BTC/USDT LOB snapshots<br>100ms - multi-second intervals"]
B --> C{"Data Preprocessing Pipelines"}
C --> C1["Kalman Filter"]
C --> C2["Savitzky-Golay Filter"]
C --> C3["Raw / Baseline"]
C --> D["Labeling Schemes<br>Binary (Up/Down)<br>Ternary (Up/Flat/Down)"]
D --> E["Model Training<br>Linear: Logistic Regression<br>Tree-based: XGBoost<br>Deep: DeepLOB, Conv1D+LSTM"]
E --> F["Out-of-Sample Evaluation<br>Accuracy | Latency | Robustness"]
F --> G["Key Findings"]
G --> G1["Simpler models (LR/XGBoost)<br>match/exceed deep nets<br>with proper preprocessing"]
G --> G2["Data filtering reduces noise<br>significantly more than<br>adding hidden layers"]
G --> G3["Trade-off: Speed & Interpretability<br>vs. Complexity"]