SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation

ArXiv ID: 2412.10906 “View on arXiv”

Authors: Unknown

Abstract

The rapid growth of the financial sector and the rising focus on Environmental, Social, and Governance (ESG) considerations highlight the need for advanced NLP tools. However, open-source LLMs proficient in both finance and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a category-balanced dataset comprising seven financial NLP tasks and ESG report generation, and propose TCFD-Bench, a benchmark for evaluating sustainability report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of models achieving state-of-the-art performance across six adapted and two off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters compared to GPT-4’s 1,700B. Based on this, we propose the SusGen system, integrated with Retrieval-Augmented Generation (RAG), to assist in sustainability report generation. This work demonstrates the efficiency of our approach, advancing research in finance and ESG.

Keywords: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), ESG Reporting, Financial NLP, Dataset Creation, Equities

Complexity vs Empirical Score

  • Math Complexity: 3.0/10
  • Empirical Rigor: 8.5/10
  • Quadrant: Street Traders
  flowchart TD
    A["Research Goal: Develop open-source LLM for finance & ESG"] --> B["Methodology: Dataset Creation & Model Training"]
    B --> C["Data/Inputs: SusGen-30K dataset"]
    C --> D["Computational Process: SusGen-GPT 7-8B training"]
    D --> E["Findings: State-of-the-art performance"]
    E --> F["Outcome: SusGen System w/ RAG for report generation"]