Can LLMs Identify Tax Abuse?

ArXiv ID: 2508.20097 “View on arXiv”

Authors: Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

Abstract

We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, LLM-based reasoning identified an entirely novel tax strategy, highlighting these models’ potential to revolutionize tax agencies’ fight against tax abuse.

Keywords: Large Language Models (LLMs), Tax Strategy Optimization, Legal Reasoning, Information Extraction, Rule Compliance, Multi-Asset

Complexity vs Empirical Score

Math Complexity: 2.0/10
Empirical Rigor: 7.5/10
Quadrant: Street Traders
Why: The paper’s core math is low-complexity linguistic prompting and legal analysis, but it demonstrates high empirical rigor through a handcrafted dataset (Shelter Check), GitHub code, and expert-graded strategy generation from real-world LLMs.

  flowchart TD
    A["Research Goal<br>Can LLMs Identify Tax Abuse?"] --> B["Methodology: Three Evaluation Tasks<br>(1) Interpret & Verify<br>(2) Fill Gaps<br>(3) Generate Strategies"]
    B --> C["Computational Process<br>LLM Reasoning on Tax Law<br>Simulated Expert Interactions"]
    C --> D["Key Findings & Outcomes<br>Novel Strategy Discovery<br>Verification Capability<br>Potential for Tax Agency Use"]

Can LLMs Identify Tax Abuse?#

Abstract#

Complexity vs Empirical Score#

Can LLMs Identify Tax Abuse?

Abstract

Complexity vs Empirical Score