GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models

ArXiv ID: 2309.03079 “View on arXiv”

Authors: Unknown

Abstract

Annual Reports of publicly listed companies contain vital information about their financial health which can help assess the potential impact on Stock price of the firm. These reports are comprehensive in nature, going up to, and sometimes exceeding, 100 pages. Analysing these reports is cumbersome even for a single firm, let alone the whole universe of firms that exist. Over the years, financial experts have become proficient in extracting valuable information from these documents relatively quickly. However, this requires years of practice and experience. This paper aims to simplify the process of assessing Annual Reports of all the firms by leveraging the capabilities of Large Language Models (LLMs). The insights generated by the LLM are compiled in a Quant styled dataset and augmented by historical stock price data. A Machine Learning model is then trained with LLM outputs as features. The walkforward test results show promising outperformance wrt S&P500 returns. This paper intends to provide a framework for future work in this direction. To facilitate this, the code has been released as open source.

Keywords: Annual Reports, Large Language Models (LLMs), Machine Learning, Walkforward test, Open source, Equities

Complexity vs Empirical Score

  • Math Complexity: 2.5/10
  • Empirical Rigor: 7.5/10
  • Quadrant: Street Traders
  • Why: The paper relies on standard machine learning and statistical measures without advanced mathematical derivations, but features a detailed backtest with walk-forward analysis, transaction cost considerations, and open-source code, making it implementation-heavy.
  flowchart TD
    A["Research Goal:<br>Automate Annual Report Analysis<br>for Stock Investment"] --> B["Data Collection:<br>S&P500 Annual Reports &<br>Historical Stock Prices"]

    B --> C["LLM Analysis:<br>Extract Insights & Sentiment<br>from 100+ Page Documents"]
    
    C --> D["Dataset Construction:<br>Combine LLM Outputs with<br>Quantitative Financial Data"]
    
    D --> E["Machine Learning:<br>Train Predictive Model<br>on Augmented Dataset"]
    
    E --> F["Validation:<br>Walkforward Testing<br>vs S&P500 Benchmark"]
    
    F --> G["Outcome:<br>Significant Outperformance<br>with Open-Source Framework"]
    
    style A fill:#e1f5fe
    style G fill:#e8f5e8
    style F fill:#fff3e0