News Sentiment Embeddings for Stock Price Forecasting
ArXiv ID: 2507.01970 “View on arXiv”
Authors: Ayaan Qayyum
Abstract
This paper will discuss how headline data can be used to predict stock prices. The stock price in question is the SPDR S&P 500 ETF Trust, also known as SPY that tracks the performance of the largest 500 publicly traded corporations in the United States. A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encodings of each headline with principal component analysis (PCA) to exact the key features. The challenge of this work is to capture the time-dependent and time-independent, nuanced impacts of news on stock prices while handling potential lag effects and market noise. Financial and economic data were collected to improve model performance; such sources include the U.S. Dollar Index (DXY) and Treasury Interest Yields. Over 390 machine-learning inference models were trained. The preliminary results show that headline data embeddings greatly benefit stock price prediction by at least 40% compared to training and optimizing a machine learning system without headline data embeddings.
Keywords: Text Embedding, Natural Language Processing, Principal Component Analysis (PCA), Machine Learning Inference, SPY Prediction, Equities
Complexity vs Empirical Score
- Math Complexity: 3.0/10
- Empirical Rigor: 7.0/10
- Quadrant: Street Traders
- Why: The paper uses practical ML techniques like PCA and embeddings with substantial empirical testing (390 models, real WSJ/DXY data), but lacks advanced mathematical derivations, focusing on application over theoretical complexity.
flowchart TD
A["Research Goal:<br>Forecast SPY Stock Prices<br>using News Headlines"] --> B["Data Collection:<br>WSJ Headlines &<br>Financial Indicators DXY, Treasury Yields"]
B --> C["Feature Engineering:<br>OpenAI Text Embeddings<br>+ PCA for Dimensionality Reduction"]
C --> D["Model Training:<br>390+ ML Inference Models<br>trained on historical data"]
D --> E["Evaluation &<br>Comparison vs Baseline"]
E --> F["Key Finding:<br>40% Improvement in<br>Prediction Accuracy with Headline Data"]