SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation
ArXiv ID: 2412.10906 “View on arXiv”
Authors: Unknown
Abstract
The rapid growth of the financial sector and the rising focus on Environmental, Social, and Governance (ESG) considerations highlight the need for advanced NLP tools. However, open-source LLMs proficient in both finance and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a category-balanced dataset comprising seven financial NLP tasks and ESG report generation, and propose TCFD-Bench, a benchmark for evaluating sustainability report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of models achieving state-of-the-art performance across six adapted and two off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters compared to GPT-4’s 1,700B. Based on this, we propose the SusGen system, integrated with Retrieval-Augmented Generation (RAG), to assist in sustainability report generation. This work demonstrates the efficiency of our approach, advancing research in finance and ESG.
Keywords: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), ESG Reporting, Financial NLP, Dataset Creation, Equities
Complexity vs Empirical Score
- Math Complexity: 3.0/10
- Empirical Rigor: 8.5/10
- Quadrant: Street Traders
flowchart TD
A["Research Goal: Develop open-source LLM for finance & ESG"] --> B["Methodology: Dataset Creation & Model Training"]
B --> C["Data/Inputs: SusGen-30K dataset"]
C --> D["Computational Process: SusGen-GPT 7-8B training"]
D --> E["Findings: State-of-the-art performance"]
E --> F["Outcome: SusGen System w/ RAG for report generation"]