false

SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection ArXiv ID: 2505.14420 “View on arXiv” Authors: Huopu Zhang, Yanguang Liu, Miao Zhang, Zirui He, Mengnan Du Abstract Predicting earnings surprises from financial documents, such as earnings conference calls, regulatory filings, and financial news, has become increasingly important in financial economics. However, these financial documents present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the SAE-FiRE (Sparse Autoencoder for Financial Representation Enhancement) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense neural representations from large language models into interpretable sparse components, then applies statistical feature selection methods, including ANOVA F-tests and tree-based importance scoring, to identify the top-k most discriminative dimensions for classification. By systematically filtering out noise that might otherwise lead to overfitting, we enable more robust and generalizable predictions. Experimental results across three financial datasets demonstrate that SAE-FiRE significantly outperforms baseline approaches. ...

May 20, 2025 · 2 min · Research Team

BASIR: Budget-Assisted Sectoral Impact Ranking -- A Dataset for Sector Identification and Performance Prediction Using Language Models

BASIR: Budget-Assisted Sectoral Impact Ranking – A Dataset for Sector Identification and Performance Prediction Using Language Models ArXiv ID: 2504.13189 “View on arXiv” Authors: Unknown Abstract Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India’s Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts. Our architecture incorporates fine-tuned embeddings for sector identification, coupled with language models that rank sectors based on their predicted performances. Our results demonstrate 0.605 F1-score in sector classification, and 0.997 NDCG score in predicting ranks of sectors based on post-budget performances. The methodology enables investors and policymakers to quantify fiscal policy impacts through structured, data-driven insights, addressing critical gaps in manual analysis. The annotated dataset has been released under CC-BY-NC-SA-4.0 license to advance computational economics research. ...

April 2, 2025 · 2 min · Research Team

Market-Derived Financial Sentiment Analysis: Context-Aware Language Models for Crypto Forecasting

Market-Derived Financial Sentiment Analysis: Context-Aware Language Models for Crypto Forecasting ArXiv ID: 2502.14897 “View on arXiv” Authors: Unknown Abstract Financial Sentiment Analysis (FSA) traditionally relies on human-annotated sentiment labels to infer investor sentiment and forecast market movements. However, inferring the potential market impact of words based on their human-perceived intentions is inherently challenging. We hypothesize that the historical market reactions to words, offer a more reliable indicator of their potential impact on markets than subjective sentiment interpretations by human annotators. To test this hypothesis, a market-derived labeling approach is proposed to assign tweet labels based on ensuing short-term price trends, enabling the language model to capture the relationship between textual signals and market dynamics directly. A domain-specific language model was fine-tuned on these labels, achieving up to an 11% improvement in short-term trend prediction accuracy over traditional sentiment-based benchmarks. Moreover, by incorporating market and temporal context through prompt-tuning, the proposed context-aware language model demonstrated an accuracy of 89.6% on a curated dataset of 227 impactful Bitcoin-related news events with significant market impacts. Aggregating daily tweet predictions into trading signals, our method outperformed traditional fusion models (which combine sentiment-based and price-based predictions). It challenged the assumption that sentiment-based signals are inferior to price-based predictions in forecasting market movements. Backtesting these signals across three distinct market regimes yielded robust Sharpe ratios of up to 5.07 in trending markets and 3.73 in neutral markets. Our findings demonstrate that language models can serve as effective short-term market predictors. This paradigm shift underscores the untapped capabilities of language models in financial decision-making and opens new avenues for market prediction applications. ...

February 17, 2025 · 3 min · Research Team

Continuous Risk Factor Models: Analyzing Asset Correlations through Energy Distance

Continuous Risk Factor Models: Analyzing Asset Correlations through Energy Distance ArXiv ID: 2410.23447 “View on arXiv” Authors: Unknown Abstract This paper introduces a novel approach to financial risk analysis that does not rely on traditional price and market data, instead using market news to model assets as distributions over a metric space of risk factors. By representing asset returns as integrals over the scalar field of these risk factors, we derive the covariance structure between asset returns. Utilizing encoder-only language models to embed this news data, we explore the relationships between asset return distributions through the concept of Energy Distance, establishing connections between distributional differences and excess returns co-movements. This data-agnostic approach provides new insights into portfolio diversification, risk management, and the construction of hedging strategies. Our findings have significant implications for both theoretical finance and practical risk management, offering a more robust framework for modelling complex financial systems without depending on conventional market data. ...

October 30, 2024 · 2 min · Research Team

Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis

Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis ArXiv ID: 2402.11728 “View on arXiv” Authors: Unknown Abstract In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a novel weak-supervision model that incorporates the knowledge of subject matter experts (SMEs) in the aggregation function, outperforming existing approaches. We also demonstrate the practical utility of our proposed model by constructing a novel measure of optimism. Here, we observe the dependence of earnings surprise and return on our optimism measure. Our dataset, models, and code are publicly (under CC BY 4.0 license) available on GitHub. ...

February 18, 2024 · 2 min · Research Team