Multimodal Document Analytics for Banking Process Automation

ArXiv ID: 2307.11845 “View on arXiv”

Authors: Unknown

Abstract

Traditional banks face increasing competition from FinTechs in the rapidly evolving financial ecosystem. Raising operational efficiency is vital to address this challenge. Our study aims to improve the efficiency of document-intensive business processes in banking. To that end, we first review the landscape of business documents in the retail segment. Banking documents often contain text, layout, and visuals, suggesting that document analytics and process automation require more than plain natural language processing (NLP). To verify this and assess the incremental value of visual cues when processing business documents, we compare a recently proposed multimodal model called LayoutXLM to powerful text classifiers (e.g., BERT) and large language models (e.g., GPT) in a case study related to processing company register extracts. The results confirm that incorporating layout information in a model substantially increases its performance. Interestingly, we also observed that more than 75% of the best model performance (in terms of the F1 score) can be achieved with as little as 30% of the training data. This shows that the demand for data labeled data to set up a multi-modal model can be moderate, which simplifies real-world applications of multimodal document analytics. Our study also sheds light on more specific practices in the scope of calibrating a multimodal banking document classifier, including the need for fine-tuning. In sum, the paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business and offers practical guidance on how to unlock this potential in day-to-day operations.

Keywords: document analytics, multimodal models, LayoutXLM, natural language processing (NLP), business process automation, Banking (Retail)

Complexity vs Empirical Score

  • Math Complexity: 2.5/10
  • Empirical Rigor: 8.0/10
  • Quadrant: Street Traders
  • Why: The paper focuses on applied machine learning (LayoutXLM vs. BERT/GPT) with concrete performance metrics (F1 scores) and data efficiency findings, but lacks advanced mathematical derivations or theoretical proofs.
  flowchart TD
    A["Research Goal"] --> B["Improve efficiency of document-intensive<br>banking processes (Retail Segment)"]
    B --> C["Data & Inputs<br>- Company Register Extracts<br>- Dataset split: 70/30/0 (test)"]
    C --> D["Computational Process<br>- Model Comparison<br>  - Text: BERT<br>  - LLM: GPT<br>  - Multimodal: LayoutXLM (Text + Layout)"]
    D --> E["Key Outcomes"]
    E --> F["1. Layout incorporation significantly<br>boosts performance vs text-only models"]
    E --> G["2. High efficiency: 75% of max F1 score<br>achieved with only 30% training data"]
    E --> H["3. Practical guidance:<br>Need for fine-tuning in real-world deployment"]