Company2Vec – German Company Embeddings based on Corporate Websites
ArXiv ID: 2307.09332 “View on arXiv”
Authors: Unknown
Abstract
With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.
Keywords: representation learning, Word2Vec, company embeddings, unsupervised clustering, semantic analytics, Banking/Corporate Finance
Complexity vs Empirical Score
- Math Complexity: 3.0/10
- Empirical Rigor: 7.5/10
- Quadrant: Street Traders
- Why: The paper relies on established, low-complexity methods like Word2Vec and k-means clustering, with minimal advanced mathematical derivations. However, it demonstrates high empirical rigor through a large-scale dataset (42k websites), multiple evaluation metrics (industry prediction, cosine similarity), and concrete banking applications, making it highly backtest-ready.
flowchart TD
A["Research Goal"] --> B["Data Collection<br>Corporate Websites"]
B --> C["Word2Vec<br>Word Embeddings"]
C --> D["Dimensionality Reduction<br>Company Embeddings"]
D --> E{"Application & Evaluation"}
E --> F["Industry Prediction<br>Supervised Learning"]
E --> G["Semantic Analytics<br>Top-N Words"]
E --> H["Peer Identification<br>Clustering & Algorithms"]
style A fill:#e1f5fe,stroke:#01579b
style F fill:#f3e5f5,stroke:#4a148c
style G fill:#e8f5e8,stroke:#1b5e20
style H fill:#fff3e0,stroke:#e65100