NLP-Based ESG Rating Generation from Unstructured Text
A pipeline for generating ESG scores from news articles and sustainability reports using topic modeling, LLM-based classification, and signal aggregation across Environmental, Social, and Governance dimensions.
Overview
ESG (Environmental, Social, Governance) ratings have become central to sustainable investment decisions, yet the dominant rating agencies produce scores that are opaque, inconsistent across providers, and often months out of date. This project develops an alternative pipeline that derives ESG scores directly from unstructured text — news articles, social media, analyst reports, and sustainability disclosures — using a sequence of NLP techniques to produce timely, transparent, and dimension-specific ratings.
Rather than relying on proprietary questionnaires or structured data fields, the approach treats language itself as the primary evidence of corporate ESG performance.
Technical Approach
The pipeline moves through several well-defined stages.
Data Collection and Preprocessing: Diverse unstructured text sources are identified and collected, then cleaned through tokenization, morphological analysis (leveraging eKoNLPy for Korean-language content), stop-word removal, and normalization.
Key Theme Extraction: Topic modeling identifies the ESG-relevant themes present in the corpus. Each of the three ESG dimensions — Environmental, Social, Governance — has distinct topic signatures that the model learns to recognize.
Data Labeling: Extracted themes are labeled using generative language models (GPT-4) in combination with domain expert review panels. This hybrid approach ensures label quality while keeping the labeling process scalable.
Classification and Sentiment Modeling: Separate classifiers handle topic classification and sentiment analysis. Pre-trained language models such as BERT and XLNet are fine-tuned on the labeled data, providing context-aware sentiment judgments that go beyond simple positive/negative polarity.
Signal Aggregation: ESG signals are weighted and aggregated per dimension (E, S, G), then combined into a composite ESG rating score for each company. The pipeline is designed to run on fresh data continuously, producing ratings that update as new articles appear rather than on an annual or quarterly cycle.
Significance
Traditional ESG rating methodologies suffer from well-documented inconsistency: ratings from different agencies for the same company often diverge dramatically. By grounding scores in observable language evidence and making the pipeline transparent, this work contributes to the emerging field of text-based ESG assessment that complements — and can audit — conventional rating approaches.
For investors and corporate sustainability teams, more frequent and explicable ESG signals enable faster response to reputational and regulatory risks. For researchers, the framework provides a replicable methodology for studying how ESG narratives evolve in media across markets and time periods.