Alternative Loss Function in Evaluation of Transformer Models

ArXiv ID: 2507.16548 “View on arXiv”

Authors: Jakub Michańków, Paweł Sakowski, Robert Ślepaczuk

Abstract

The proper design and architecture of testing machine learning models, especially in their application to quantitative finance problems, is crucial. The most important aspect of this process is selecting an adequate loss function for training, validation, estimation purposes, and hyperparameter tuning. Therefore, in this research, through empirical experiments on equity and cryptocurrency assets, we apply the Mean Absolute Directional Loss (MADL) function, which is more adequate for optimizing forecast-generating models used in algorithmic investment strategies. The MADL function results are compared between Transformer and LSTM models, and we show that in almost every case, Transformer results are significantly better than those obtained with LSTM.

Keywords: Transformer, LSTM, Mean Absolute Directional Loss (MADL), Loss Function Optimization, Algorithmic Investment Strategies, Equity and Cryptocurrency

Complexity vs Empirical Score

Math Complexity: 6.0/10
Empirical Rigor: 7.0/10
Quadrant: Holy Grail
Why: The paper presents advanced neural network architecture mathematics (Transformer attention mechanisms with formal equations) and demonstrates rigorous empirical testing including walk-forward procedures, extended out-of-sample periods, and risk-adjusted performance metrics across multiple asset classes.

  flowchart TD
    A["Research Goal:<br>Design ML testing for Finance"] --> B["Data Sources<br>(Equity & Crypto Assets)"]
    B --> C{"Model Training & Tuning"}
    C --> D["Transformer Model"]
    C --> E["LSTM Model"]
    D & E --> F["Apply MADL<br>Loss Function"]
    F --> G["Comparison &<br>Evaluation Results"]
    G --> H["Outcome:<br>Transformer outperforms LSTM"]

Alternative Loss Function in Evaluation of Transformer Models#

Abstract#

Complexity vs Empirical Score#

Alternative Loss Function in Evaluation of Transformer Models

Abstract

Complexity vs Empirical Score