InProC: Industry and Product/Service Code Classification

ArXiv ID: 2305.13532 “View on arXiv”

Authors: Unknown

Abstract

Determining industry and product/service codes for a company is an important real-world task and is typically very expensive as it involves manual curation of data about the companies. Building an AI agent that can predict these codes automatically can significantly help reduce costs, and eliminate human biases and errors. However, unavailability of labeled datasets as well as the need for high precision results within the financial domain makes this a challenging problem. In this work, we propose a hierarchical multi-class industry code classifier with a targeted multi-label product/service code classifier leveraging advances in unsupervised representation learning techniques. We demonstrate how a high quality industry and product/service code classification system can be built using extremely limited labeled dataset. We evaluate our approach on a dataset of more than 20,000 companies and achieved a classification accuracy of more than 92%. Additionally, we also compared our approach with a dataset of 350 manually labeled product/service codes provided by Subject Matter Experts (SMEs) and obtained an accuracy of more than 96% resulting in real-life adoption within the financial domain.

Keywords: industry classification, product classification, hierarchical multi-class classifier, unsupervised representation learning, General

Complexity vs Empirical Score

  • Math Complexity: 3.0/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Street Traders
  • Why: The paper uses relatively straightforward deep learning architectures with minimal advanced mathematical derivation, focusing instead on practical application with real-world financial datasets and reported high classification accuracies.
  flowchart TD
    A["Research Goal: Automate Industry & Product/Service Code Classification"] --> B{"Key Challenge: Lack of Labeled Data"};
    B --> C["Methodology: Hierarchical Multi-Class & Multi-Label Classifier"];
    C --> D["Process: Unsupervised Representation Learning"];
    D --> E["Input: 20,000+ Company Dataset"];
    E --> F["Output: Classification System"];
    F --> G["Result 1: 92% Accuracy on 20k Dataset"];
    F --> H["Result 2: 96% Accuracy on 350 SME Labels"];
    G & H --> I["Outcome: High Precision Financial Domain Adoption"];