false

The Limits of Complexity: Why Feature Engineering Beats Deep Learning in Investor Flow Prediction

The Limits of Complexity: Why Feature Engineering Beats Deep Learning in Investor Flow Prediction ArXiv ID: 2601.07131 “View on arXiv” Authors: Sungwoo Kang Abstract The application of machine learning to financial prediction has accelerated dramatically, yet the conditions under which complex models outperform simple alternatives remain poorly understood. This paper investigates whether advanced signal processing and deep learning techniques can extract predictive value from investor order flows beyond what simple feature engineering achieves. Using a comprehensive dataset of 2.79 million observations spanning 2,439 Korean equities from 2020–2024, we apply three methodologies: \textit{“Independent Component Analysis”} (ICA) to recover latent market drivers, \textit{“Wavelet Coherence”} analysis to characterize multi-scale correlation structure, and \textit{“Long Short-Term Memory”} (LSTM) networks with attention mechanisms for non-linear prediction. Our results reveal a striking finding: a parsimonious linear model using market capitalization-normalized flows (``Matched Filter’’ preprocessing) achieves a Sharpe ratio of 1.30 and cumulative return of 272.6%, while the full ICA-Wavelet-LSTM pipeline generates a Sharpe ratio of only 0.07 with a cumulative return of $-5.1%$. The raw LSTM model collapsed to predicting the unconditional mean, achieving a hit rate of 47.5% – worse than random. We conclude that in low signal-to-noise financial environments, domain-specific feature engineering yields substantially higher marginal returns than algorithmic complexity. These findings establish important boundary conditions for the application of deep learning to financial prediction. ...

January 12, 2026 · 2 min · Research Team

Enhancing OHLC Data with Timing Features: A Machine Learning Evaluation

Enhancing OHLC Data with Timing Features: A Machine Learning Evaluation ArXiv ID: 2509.16137 “View on arXiv” Authors: Ruslan Tepelyan Abstract OHLC bar data is a widely used format for representing financial asset prices over time due to its balance of simplicity and informativeness. Bloomberg has recently introduced a new bar data product that includes additional timing information-specifically, the timestamps of the open, high, low, and close prices within each bar. In this paper, we investigate the impact of incorporating this timing data into machine learning models for predicting volume-weighted average price (VWAP). Our experiments show that including these features consistently improves predictive performance across multiple ML architectures. We observe gains across several key metrics, including log-likelihood, mean squared error (MSE), $R^2$, conditional variance estimation, and directional accuracy. ...

September 19, 2025 · 2 min · Research Team

Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation

Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation ArXiv ID: 2506.15723 “View on arXiv” Authors: Irina G. Tanashkina, Alexey S. Tanashkin, Alexander S. Maksimchuik, Anna Yu. Poshivailo Abstract In this article, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. The researcher, lacking expertise in this topic, encounters numerous difficulties in the effort to build a good model. The main source of this is the huge difference between noisy real market data and ideal data which is very common in all types of tutorials on machine learning. This paper covers all stages of modeling: the collection of initial data, identification of outliers, the search and analysis of patterns in the data, the formation and final choice of price factors, the building of the model, and the evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with interpolation methods of geostatistics allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point the application of geostatistical methods is difficult. Therefore we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets. ...

June 5, 2025 · 2 min · Research Team

Assets Forecasting with Feature Engineering and Transformation Methods for LightGBM

Assets Forecasting with Feature Engineering and Transformation Methods for LightGBM ArXiv ID: 2501.07580 “View on arXiv” Authors: Unknown Abstract Fluctuations in the stock market rapidly shape the economic world and consumer markets, impacting millions of individuals. Hence, accurately forecasting it is essential for mitigating risks, including those associated with inactivity. Although research shows that hybrid models of Deep Learning (DL) and Machine Learning (ML) yield promising results, their computational requirements often exceed the capabilities of average personal computers, rendering them inaccessible to many. In order to address this challenge in this paper we optimize LightGBM (an efficient implementation of gradient-boosted decision trees (GBDT)) for maximum performance, while maintaining low computational requirements. We introduce novel feature engineering techniques including indicator-price slope ratios and differences of close and open prices divided by the corresponding 14-period Exponential Moving Average (EMA), designed to capture market dynamics and enhance predictive accuracy. Additionally, we test seven different feature and target variable transformation methods, including returns, logarithmic returns, EMA ratios and their standardized counterparts as well as EMA difference ratios, so as to identify the most effective ones weighing in both efficiency and accuracy. The results demonstrate Log Returns, Returns and EMA Difference Ratio constitute the best target variable transformation methods, with EMA ratios having a lower percentage of correct directional forecasts, and standardized versions of target variable transformations requiring significantly more training time. Moreover, the introduced features demonstrate high feature importance in predictive performance across all target variable transformation methods. This study highlights an accessible, computationally efficient approach to stock market forecasting using LightGBM, making advanced forecasting techniques more widely attainable. ...

December 27, 2024 · 2 min · Research Team

Hunting Tomorrow's Leaders: Using Machine Learning to Forecast S&P 500 Additions & Removal

Hunting Tomorrow’s Leaders: Using Machine Learning to Forecast S&P 500 Additions & Removal ArXiv ID: 2412.12539 “View on arXiv” Authors: Unknown Abstract This study applies machine learning to predict S&P 500 membership changes: key events that profoundly impact investor behavior and market dynamics. Quarterly data from WRDS datasets (2013 onwards) was used, incorporating features such as industry classification, financial data, market data, and corporate governance indicators. Using a Random Forest model, we achieved a test F1 score of 0.85, outperforming logistic regression and SVC models. This research not only showcases the power of machine learning for financial forecasting but also emphasizes model transparency through SHAP analysis and feature engineering. The model’s real world applicability is demonstrated with predicted changes for Q3 2023, such as the addition of Uber (UBER) and the removal of SolarEdge Technologies (SEDG). By incorporating these predictions into a trading strategy i.e. buying stocks announced for addition and shorting those marked for removal, we anticipate capturing alpha and enhancing investment decision making, offering valuable insights into index dynamics ...

December 17, 2024 · 2 min · Research Team

S&P 500 Trend Prediction

S&P 500 Trend Prediction ArXiv ID: 2412.11462 “View on arXiv” Authors: Unknown Abstract This project aims to predict short-term and long-term upward trends in the S&P 500 index using machine learning models and feature engineering based on the “101 Formulaic Alphas” methodology. The study employed multiple models, including Logistic Regression, Decision Trees, Random Forests, Neural Networks, K-Nearest Neighbors (KNN), and XGBoost, to identify market trends from historical stock data collected from Yahoo! Finance. Data preprocessing involved handling missing values, standardization, and iterative feature selection to ensure relevance and variability. For short-term predictions, KNN emerged as the most effective model, delivering robust performance with high recall for upward trends, while for long-term forecasts, XGBoost demonstrated the highest accuracy and AUC scores after hyperparameter tuning and class imbalance adjustments using SMOTE. Feature importance analysis highlighted the dominance of momentum-based and volume-related indicators in driving predictions. However, models exhibited limitations such as overfitting and low recall for positive market movements, particularly in imbalanced datasets. The study concludes that KNN is ideal for short-term alerts, whereas XGBoost is better suited for long-term trend forecasting. Future enhancements could include advanced architectures like Long Short-Term Memory (LSTM) networks and further feature refinement to improve precision and generalizability. These findings contribute to developing reliable machine learning tools for market trend prediction and investment decision-making. ...

December 16, 2024 · 2 min · Research Team

Detection of financial opportunities in micro-blogging data with a stacked classification system

Detection of financial opportunities in micro-blogging data with a stacked classification system ArXiv ID: 2404.07224 “View on arXiv” Authors: Unknown Abstract Micro-blogging sources such as the Twitter social network provide valuable real-time data for market prediction models. Investors’ opinions in this network follow the fluctuations of the stock markets and often include educated speculations on market opportunities that may have impact on the actions of other investors. In view of this, we propose a novel system to detect positive predictions in tweets, a type of financial emotions which we term “opportunities” that are akin to “anticipation” in Plutchik’s theory. Specifically, we seek a high detection precision to present a financial operator a substantial amount of such tweets while differentiating them from the rest of financial emotions in our system. We achieve it with a three-layer stacked Machine Learning classification system with sophisticated features that result from applying Natural Language Processing techniques to extract valuable linguistic information. Experimental results on a dataset that has been manually annotated with financial emotion and ticker occurrence tags demonstrate that our system yields satisfactory and competitive performance in financial opportunity detection, with precision values up to 83%. This promising outcome endorses the usability of our system to support investors’ decision making. ...

March 29, 2024 · 2 min · Research Team

CNN-DRL with Shuffled Features in Finance

CNN-DRL with Shuffled Features in Finance ArXiv ID: 2402.03338 “View on arXiv” Authors: Unknown Abstract In prior methods, it was observed that the application of Convolutional Neural Networks agent in Deep Reinforcement Learning to financial data resulted in an enhanced reward. In this study, a specific permutation was applied to the feature vector, thereby generating a CNN matrix that strategically positions more pertinent features in close proximity. Our comprehensive experimental evaluations unequivocally demonstrate a substantial enhancement in reward attainment. ...

January 16, 2024 · 1 min · Research Team

Application of Machine Learning in Stock Market Forecasting: A Case Study of Disney Stock

Application of Machine Learning in Stock Market Forecasting: A Case Study of Disney Stock ArXiv ID: 2401.10903 “View on arXiv” Authors: Unknown Abstract This document presents a stock market analysis conducted on a dataset consisting of 750 instances and 16 attributes donated in 2014-10-23. The analysis includes an exploratory data analysis (EDA) section, feature engineering, data preparation, model selection, and insights from the analysis. The Fama French 3-factor model is also utilized in the analysis. The results of the analysis are presented, with linear regression being the best-performing model. ...

December 31, 2023 · 2 min · Research Team

A Data-driven Deep Learning Approach for Bitcoin Price Forecasting

A Data-driven Deep Learning Approach for Bitcoin Price Forecasting ArXiv ID: 2311.06280 “View on arXiv” Authors: Unknown Abstract Bitcoin as a cryptocurrency has been one of the most important digital coins and the first decentralized digital currency. Deep neural networks, on the other hand, has shown promising results recently; however, we require huge amount of high-quality data to leverage their power. There are some techniques such as augmentation that can help us with increasing the dataset size, but we cannot exploit them on historical bitcoin data. As a result, we propose a shallow Bidirectional-LSTM (Bi-LSTM) model, fed with feature engineered data using our proposed method to forecast bitcoin closing prices in a daily time frame. We compare the performance with that of other forecasting methods, and show that with the help of the proposed feature engineering method, a shallow deep neural network outperforms other popular price forecasting models. ...

October 27, 2023 · 2 min · Research Team