Identifying Risk Variables From ESG Raw Data Using A Hierarchical Variable Selection Algorithm

ArXiv ID: 2508.18679 “View on arXiv”

Authors: Zhi Chen, Zachary Feinstein, Ionut Florescu

Abstract

Environmental, Social, and Governance (ESG) factors aim to provide non-financial insights into corporations. In this study, we investigate whether we can extract relevant ESG variables to assess corporate risk, as measured by logarithmic volatility. We propose a novel Hierarchical Variable Selection (HVS) algorithm to identify a parsimonious set of variables from raw data that are most relevant to risk. HVS is specifically designed for ESG datasets characterized by a tree structure with significantly more variables than observations. Our findings demonstrate that HVS achieves significantly higher performance than models using pre-aggregated ESG scores. Furthermore, when compared with traditional variable selection methods, HVS achieves superior explanatory power using a more parsimonious set of ESG variables. We illustrate the methodology using company data from various sectors of the US economy.

Keywords: ESG variables, Hierarchical Variable Selection, Corporate risk assessment, Logarithmic volatility, Variable selection algorithm, Equity Risk Assessment/ESG

Complexity vs Empirical Score

  • Math Complexity: 6.5/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced statistical techniques like the Box-Cox transformation and a novel hierarchical selection algorithm, demonstrating moderate-to-high mathematical density. It validates the methodology on real-world US corporate data, compares it against benchmarks (Lasso, PCA), and presents robust metrics like R-squared improvements, showing significant empirical rigor.
  flowchart TD
    A["Research Goal: Identify relevant ESG variables<br>to assess corporate risk (logarithmic volatility)"] --> B["Input: US Company Data<br>Raw ESG Variables with Tree Structure"]
    B --> C["Hierarchical Variable Selection HVS Algorithm"]
    C --> D{"Comparison & Evaluation"}
    D --> E["Model 1: Pre-aggregated ESG Scores"]
    D --> F["Model 2: Traditional Variable Selection"]
    D --> G["Model 3: HVS Algorithm"]
    E & F & G --> H["Outcomes"]
    H --> I["HVS achieves superior explanatory power"]
    H --> J["HVS uses a more parsimonious set of variables"]
    H --> K["HVS significantly outperforms pre-aggregated scores"]