Don’t Trust AI, Always Verify. Tax Law Still Needs Humans

AI is changing the practice of tax law. This series examines the ethical, legal, and practical implications of AI across key areas of tax practice.

Generative artificial intelligence is now embedded in tax practice, and that trend is unlikely to reverse. What has changed is the risk profile. Large language models generate fluent legal analysis through probabilistic text prediction rather than authoritative reasoning. In tax law, that distinction is consequential as the practice depends on hierarchical authority, technical precision, and a self-assessment regime grounded in verifiable sources. Hallucinated authority isn’t merely a drafting defect, but a substantive legal failure.

This article, the first of a two-part contribution, identifies two structural mismatches. First, probabilistic AI systems are epistemically incompatible with tax law’s demand for determinacy, particularly after the Supreme Court’s rejection of administrative deference in Loper Bright Enterprises v. Raimondo. Second, AI literalism is poorly suited to analyzing judicial anti-abuse doctrines, including economic substance and step transaction, which turn on purpose, intent, and context. The second article addresses professional responsibility, liability allocation, and governance responses, arguing that while AI may assist in drafting, it can’t discharge the non-delegable duty of verification required by tax law.

Tax professionals deploy LLMs to streamline work, especially time-intensive tasks such as parsing statutes and drafting preliminary memoranda. These efficiencies come with serious professional and legal risks. Because LLMs rely on statistical prediction rather than legal reasoning, they can’t assess precedent or apply interpretive canons, don’t reliably distinguish dicta from holdings, and ignore the hierarchical nature of tax authorities. See Artificial Intelligence (AI) Guidance for Judicial Office Holders, Courts and Tribunals Judiciary, 2 (Oct. 31, 2025); Shatrunjay Kumar Singh, Risk-Weighted Hallucination Scoring for Legal Answers: A Conceptual Framework for Trustworthy AI in Law, 10 Int’l J. Innovative Sci. & Res. Tech. 11, 2091 (2025); Jiaqi Wang et al., Legal Evaluations and Challenges of Large Language Models, 3 (2024); Joshua Kelsall et al., A Rapid Evidence Review of Evaluation Techniques for Large Language Models in Legal Use Cases: Trends, Gaps, and Recommendations for Future Research, AI & Society (2025); see also Abe Bohan Hou et al., CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation, 1 (2024); Eliza Mik, Caveat Lector: Large Language Models in Legal Practice, 19 Rutgers Bus. L. Rev. (2024); Tahir Khan, Oops! AI Made a Legal Mistake: Now What? AI Hallucinations, Professional Responsibility, and the Future of Legal Practice, The Barrister Group (Oct. 21, 2025); Matthew Dahl, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 16 J. Legal Analysis 64, 66 (2024). Rather than genuinely analyzing the law, these tools often predict what seems legally plausible, sometimes generating fabricated content—"hallucinations"—that erode confidence in AI-assisted legal research. Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 22 J. Empir. Leg. Stud. 216, 218 (2025).

When prompted about complex provisions such as IRC §351 or §199A, an LLM answers via token probabilities rather than by retrieving and verifying the authoritative legal authority. It can fabricate safe harbors, misquote Treasury Regulations, or invent precedent. See Artificial Intelligence (AI) Guidance for Judicial Office Holders, Courts and Tribunals Judiciary, 5 (Oct. 31, 2025); see also CCBE Guide on the Use of Generative AI by Lawyers, 4 (2025). Its outputs look polished, fluent, and professionally formatted—that is precisely the risk. Victor Habib Lantyer, The Phantom Menace: Generative AI Hallucinations and Their Legal Implications, SSRN No. 5167036, 1 (2025). In tax practice, where even a comma can alter meaning, hallucinated authority is not a superficial flaw, but a substantive legal error that can create liability.

In Mata v. Avianca, the district court set a key precedent on how courts treat AI-related risks, where counsel submitted a brief relying on fictitious ChatGPT-generated cases, prompting Rule 11 sanctions. In Thomas v. Commissioner, the US Tax Court struck a pretrial memorandum that relied on fabricated cases, and in Delano Crossing v. Wright, a Minnesota tax court referred counsel to disciplinary authorities for submitting an AI-generated brief. See also Tax Ct. R. 33. These are breaches of the non-delegable duty to verify legal content before filing, not just citation mistakes. Ayinde v. London Borough of Haringey, High Ct. of Justice (London 2025); see also American Bar Association Standing Committee on Ethics and Professional Responsibility, Formal Opinion 512, Generative Artificial Intelligence Tools, 4 (July 29, 2024); State Bar of Cal. Standing Committee on Professional Responsibility and Conduct, Generative AI Practical Guidance, 1 (2023). Collectively, they underscore that attorneys and taxpayers can’t outsource their legal judgment to AI.

The risks are especially acute in tax and law. See OECD, Governing with Artificial Intelligence (2025) (stating that tax administrations rely heavily on voluntary compliance and sustained trust); see also Shu-Yi Oei & Leigh Osofsky, Constituencies and Control in Statutory Drafting: Interviews with Government Tax Counsels, 104 Iowa L. Rev. 1291, 1319 (2018) (noting the unique complexity and technical language of the tax code). Unlike civil litigation, tax is built on self-assessment, so compliance turns on voluntary observance of an intricate statutory scheme. Hallucinated authorities distort that scheme, eroding the reliability of precedent and injecting false reference points into a regime already burdened by complexity. Borchuluun Yadamsuren, Steven Keith Platt, and Miguel Diaz, LLM-Assisted Formalization Enables Deterministic Detection of Statutory Inconsistency in the Internal Revenue Code (Nov. 15, 2025).

As a descriptive matter, AI systems are structurally at odds with tax law. See Eljas Linna and Tuula Linna, Judicial Requirements for Generative AI in Legal Reasoning, at 1 (2025) (describing a “fundamental barrier” between the probabilistic design of language models and the choice-dependent nature of judicial reasoning); Borchuluun Yadamsuren, supra, at 8 (concluding that probabilistic prompting is fragile and often fails to conduct deep, structural reasoning). Tax law demands determinacy and proceeds through statutes, regulations, revenue rulings, and hierarchies of authority, whereas AI systems function probabilistically. They also can’t consistently differentiate binding precedent from merely persuasive sources. Borchuluun Yadamsuren, supra.

At the doctrinal level, AI literalism is incompatible with anti-abuse doctrines such as economic substance and step transaction, which hinge on purpose, intent, and economic reality. This means factors not recoverable from probabilistic text generation.

Epistemic Instability Post-Loper Bright

Tax law relies on verifiable authority, a baseline that generative AI unsettles. LLMs use statistical next-word prediction rather than retrieving authoritative sources; they mimic coherence without actually interpreting statutes. They forecast what sounds legally plausible. This gap between computation and cognition is key to the epistemic risk.

Mechanics of hallucination. When asked about provisions such as IRC §351 or IRC §199A, an LLM generates a response by predicting the most likely sequence of tokens based on its training data, rather than by retrieving and applying the text of the relevant source as a traditional legal research system would. The LLM could fabricate safe harbors, misquote Treasury Regulations, or invent precedent. Its outputs look polished, fluent, and professionally formatted.

Legal reasoning is constrained by authorities, statutes, regulations, and case law; LLMs aren’t. This creates epistemic instability: Practitioners can’t safely rely on text that merely appears correct. Ethical rules demand that every citation be independently checked.

This is a real and immediate operational risk. In one case, counsel relied on a non-existent safe harbor for deductions in a pretrial brief, the IRS correctly found the position unsupported. An IRC §6662 penalty, which otherwise would have been avoidable, was imposed because of the hallucinated authority.

There are publicly available datasets that track AI mishaps caused by algorithmic failures in the tax and legal domain, a kind of crash log for generative AI similar to how self-driving accidents are tracked. Pramod K. Siva,Citing the Unseen:AI Hallucinations in Tax and Legal Practice a Comparative Analysis of Professional Responsibility, Procedural Legitimacy, and Sanctions (Jan. 5, 2026); see also Damien Charlotin, AI Hallucination Cases Database (2025). These cases show what happens when generative AI tools are used without manual review.

The table below presents recent US tax and related cases involving AI-generated citations in legal filings. The errors, committed by both self-represented taxpayers and attorneys, went well beyond mere clerical mistakes and led to consequences ranging from warnings and fines to stricken briefs and disciplinary referrals.

Along with international examples, these matters expose a broader trend: AI hallucinations are capable of misleading both attorneys and pro se litigants into filing what one court labeled counterfeit briefs.

These episodes of AI-invented authorities in US tax proceedings underscore the rapid expansion and influence of this technology. In a UK tax tribunal appeal, a judge admonished an appellant for relying on an AI chatbot that “hallucinated” non-existent precedents, warning that “the accuracy of AI should not be relied upon without checking” and that fictitious submissions “may be generated” if one trusts an unchecked AI.

The Minnesota Tax Court similarly addressed a brief written by AI containing fictitious citations, concluding that the use of unverified AI output constitutes an inherently misleading practice and violates the duty of inquiry under Rule 11.

The problem here is the reproduction rate. Once mistakes propagate at scale, traditional controls built for isolated incidents become structurally inadequate.

Loper Bright multiplier. The overruling of Chevron in Loper Bright Enterprises v. Raimondo removes the interpretive buffer. Agency interpretations no longer receive deference, and courts must independently construe ambiguous statutes.

In the post-Loper Bright landscape, a lawyer may need to sift through decades of congressional record materials, committee reports, and floor debates to establish statutory intent without relying on judicial deference to the IRS. While a human associate may miss details, an intelligent AI, specifically a retrieval-augmented generation system, can digest the entire legislative corpus in seconds.

However, there is a distinction between retrieving a document and understanding it. Loper Bright places the burden of statutory interpretation on the court and, by extension, the lawyer. This burden can’t be delegated to an AI. Even the best retrieval-based systems still need humans to verify and interpret retrieved material.

Fabricated citations are legally dispositive; AI increases this risk. If an LLM hallucinates a Senate report or a Treasury explanation to support a statutory reading, the error is irreparable.

Without Chevron, there is no agency cushion. The court must rely on text, structure, and verifiable history. Practitioners who rely on AI-generated legislative summaries demonstrate more than negligence; they facilitate the corruption of the judicial function through the submission of unverified secondary sources.

In the post-Loper Bright regime, interpretive legitimacy turns on evidentiary discipline. Probabilistic outputs that can’t be traced to official records are destabilizing. They inject fiction into what must be a fact-based inquiry.

Algorithmic bias and jurisdictional flattening. LLMs are trained on corpora that overrepresent certain jurisdictions, doctrines, or lines of authority. In tax law, this risk is particularly acute because federal authorities dominate the training corpus, while state and international regimes remain underrepresented.

This imbalance forces the model to internalize a distorted representation of legal doctrine, leading to “jurisdictional flattening,” where algorithmic models sometimes misapply Ninth Circuit precedent to Fifth Circuit taxpayers in violation of Golsen v. Commissioner.

These systems also conflate OECD guidelines with US transfer pricing rules and incorrectly equate EU VAT frameworks with US sales tax structures. These are regime errors, statistical artifacts presented as legal truth, rather than hallucinations. The result is a “regression to the mean.” AI collapses nuance and defaults to the majority or most frequent rule.

But tax law operates on specificity, not means. The proper answer usually depends on a precise jurisdiction, taxpayer profile, or factual exception. AI doesn’t recognize these constraints, but rather smooths them out, degrading the accuracy of legal advice.

AI Literalism vs. Judicial Doctrine

Tax law isn’t a codebook of mechanical steps; it is a body of doctrines grounded in statutory text and judicial interpretation. Generative AI misreads this structure. The model prioritizes surface fluency while disregarding substantive legal requirements. The system simulates form but lacks the capacity to infer underlying purpose. This creates a substantive mismatch: AI can replicate the steps of compliance but not the reason those steps matter.

Automaton’s blind spot. AI minimizes tax by manipulating textual patterns. It has no conception of legal doctrines like business purpose, economic substance, or intent. These are qualitative constructs that exist outside token prediction. An LLM can prepare an IRC §351 exchange that appears valid, property transferred, stock received, and 80% control achieved, but it can’t evaluate whether the transaction was orchestrated solely for tax deferral without economic substance. The form is satisfied; the doctrine isn’t.

Economic substance doctrine (IRC §7701(o)). The codified economic substance doctrine imposes a conjunctive test requiring meaningful economic change and substantial non-tax purpose. AI can’t meet this standard. The system simulates economic gain through projected minimal profits and fabricates business justifications using statistically common rationales.

But these aren’t factual representations; they are probabilistic fabrications. The result is the “technically perfect sham,” a transaction that satisfies the literal text of the Internal Revenue Code while violating its substantive guardrails. While AI can construct the transaction, a practitioner must dismantle it prior to filing.

Step transaction doctrine. The step transaction doctrine collapses formal steps into a single transaction where they are interdependent or prearranged. AI sequences operations independently: structure A, then structure B, then outcome C. This “chain-of-thought” reasoning reflects how LLMs solve problems.

But in tax, the court looks through this sequence and asks whether the taxpayer had a fixed plan. If so, the steps merge. An AI, unaware of the integration risk, may inadvertently script a series of transactions that courts treat as one abusive whole.

Liability gap. Although AI can replicate compliance, it remains unable to recognize legal doctrine. This concern constitutes a legal liability. Practitioners who rely on AI to structure transactions risk penalties under IRC §6662 and potential promoter sanctions under IRC §6700 if the AI-generated plan is later deemed abusive. When the human element is absent, the legal standard isn’t met.

Takeaways

AI’s efficiencies are undeniable, but so are its risks. It introduces epistemic instability. Large language models repeatedly generate inaccurate legal content despite maintaining a facade of fluency. In a post-Loper Bright world, where courts no longer defer to agency guidance, this instability becomes a direct threat to tax adjudication.

AI literalism can’t comply with anti-abuse doctrines. Tools that optimize for text patterns cannot recognize substance. They generate form, while the law requires purpose. This mismatch produces synthetic compliance like schemes that check statutory boxes while failing to meet doctrinal substance requirements.

As Part 2 of this series will demonstrate, these structural failures necessitate a professional response. The profession must move from passive reliance to active verification. The duty of competence now includes the technological capacity to distinguish statistical probability from legal authority.

This article does not necessarily reflect the opinion of Bloomberg Industry Group, Inc., the publisher of Bloomberg Law, Bloomberg Tax, and Bloomberg Government, or its owners.

Author Information

Pramod Kumar Siva is an international tax practitioner with over 25 years of cross-border experience spanning North America, Europe, the Middle East, and Asia. He is also a visiting adjunct faculty member at Texas A&M University.

Don’t Trust AI, Always Verify. Tax Law Still Needs Humans—Pt. 1

Epistemic Instability Post-Loper Bright

AI Literalism vs. Judicial Doctrine

Takeaways

Author Information

Learn more about Bloomberg Tax or Log In to keep reading:

See Breaking News in Context

Already a subscriber?