NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

ArXiv ID: 2507.09601 “View on arXiv”

Authors: Hanwool Lee, Sara Yu, Yewon Hwang, Jonghyun Choi, Heejae Ahn, Sungbum Jung, Youngjae Yu

Abstract

General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX’s multilingual bge-m3 variant achieves Spearman’s rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

Keywords: Cross-lingual embeddings, Sentence embedding models, FinBERT, Korean financial NLP, Semantic textual similarity

Complexity vs Empirical Score

  • Math Complexity: 2.0/10
  • Empirical Rigor: 7.5/10
  • Quadrant: Street Traders
  • Why: The paper primarily focuses on empirical engineering, utilizing existing architectures (bge-m3), creating a new dataset (KorFinSTS), and reporting performance metrics (Spearman’s rho) without novel mathematical derivations. It is heavily data- and implementation-focused, with the release of models and benchmarks, making it actionable for practitioners.
  flowchart TD
    A["Research Goal: Domain-Adapted Cross-Lingual<br>Financial Embeddings"] --> B["Data Collection & Construction<br>18.8K Triplet Training Set<br>1,921-Pair KorFinSTS Benchmark"]
    B --> C["Methodology: NMIXX Model Suite<br>Fine-tuned via Cross-Lingual<br>Hard Negatives & Semantic Shifts"]
    C --> D["Training & Evaluation<br>Evaluated against 7 Baselines<br>Spearman's rho Correlation"]
    D --> E{"Key Findings & Outcomes"}
    E --> F["Scores: +0.22 on KorFinSTS<br>+0.10 on English FinSTS"]
    E --> G["Analysis: Richer Korean Token<br>Coverage improves Adaptation"]
    E --> H["Deliverables: Publicly Available<br>Models & Benchmark"]