AI Didn’t Fix HCC Coding—It Made It Harder. This is How to Fix It

HIT Consultant – Read More

Ritwik Jain, Co-Founder and CRO, Martlet.AI and Senior Director, John Snow Labs

Hierarchical Condition Category (HCC) coding is one of the areas in healthcare where AI is the obvious solution. On paper, it makes perfect sense. Use it to scan charts, identify diagnoses, group them into risk categories, and more efficiently predict future healthcare costs. In Medicare Advantage and value-based care, where risk adjustment directly impacts reimbursement, it sounds like a natural fit. But reality tells a very different story.

If you want to understand what is actually happening in HCC coding today, you have to go beyond conference panels and vendor presentations. You have to go straight to the source: speak to the coders reviewing charts late into the evening, the CDI teams trying to close documentation gaps, and the physicians clicking through alerts in already overloaded clinic schedules.

Instead of simplifying workflows, many AI tools are overwhelming clinicians and coders with low-quality outputs. Providers describe being inundated with suggestions at the point of care that lack context or clinical relevance. After enough weak recommendations, trust erodes, alerts become noise, and the system that was meant to assist starts to feel like an interruption.

Coders share similar frustrations. Retrospective reviews are often flooded with false positives and low-precision suspects. Instead of eliminating manual work, AI shifts it. Coders become validators of questionable output, spending valuable time dismissing conditions that should never have surfaced in the first place.

We recently spoke with an Accountable Care Organization that experimented with a general-purpose large language model (LLM) for HCC suspecting. Nearly 70% of the AI-generated suspects were rejected by providers. That’s not a marginal tuning issue, but a clear indicator that the system fundamentally does not understand the clinical workflow it’s trying to augment.

When clinicians repeatedly override AI, frustration builds quickly. This breeds a lack of trust and eventually burnout, creating adoption to stall. But the deeper issue is that HCC coding has too often been treated like a generic natural language processing (NLP) problem. A growing number of vendors are building solutions on top of general-purpose LLMs and attempting to adapt them to healthcare through prompting alone.

To put it simply, this doesn’t work. HCC coding is not a chatbot exercise — it’s a high-precision, high-stakes reimbursement discipline. Every diagnosis submitted affects Risk Adjustment Factor (RAF) scores. Every unsupported condition carries audit exposure and every error has financial consequences.

General-purpose LLMs are remarkable tools, but they were never designed for the clinical nuance, regulatory constraints, and documentation rigor required under the Centers for Medicare & Medicaid Services (CMS) reimbursement rules. In other words, in risk adjustment, “almost right” is still wrong.

Another troubling pattern we’ve observed is “unlinked chart harvesting.” Some systems scan years of historical documentation and surface every condition they can find, regardless of whether it’s clearly tied to a documented encounter. The suspect lists may look impressive, but the burden of validation falls back on coders. Which encounter supports this diagnosis? Was the condition assessed and managed? Would the organization stand behind it in a Risk Adjustment Data Validation (RADV) audit?

As audit scrutiny increases, defensibility is no longer optional. Diagnoses must be supported by medical record documentation and clearly linked to legitimate encounters. When AI outputs are not encounter-aware, organizations do not just assume inefficiency — they assume risk. So what does responsible AI in HCC coding actually look like?

It starts with acknowledging that this is not a generic problem. It requires models trained on real, de-identified patient charts. It requires healthcare-specific NLP and proprietary AI systems built for clinical reasoning, not repurposed from general internet data. It requires continuous tuning based on real-world coding workflows and feedback from providers and compliance teams. And above all, it requires a validation-first mindset before anything reaches production.

The goal shouldn’t be to generate more suspects. It’s to generate better ones. This means diagnoses that are encounter-linked, clinically supported, and defensible before they’re ever submitted. In one health system serving roughly 100,000 patients, longitudinal chart analysis combined with real-time, encounter-specific suggestions led to a provider selection rate of approximately 70%. That number reflects trust. Providers don’t select recommendations they don’t believe in.

In another case, a health plan reviewing its CMS submissions for the 2024 and 2025 payment years uncovered more than 2,000 undercoded or missed HCCs, each tied to defensible documentation. The outcome was not just a measurable revenue lift, but greater confidence heading into audits. These results weren’t from bigger models or more aggressive prompting. They came from understanding the complexity of the use case and building the technology accordingly.

HCC coding isn’t the place for experimentation. It is infrastructure-level technology that directly affects revenue integrity and compliance exposure. At its core, HCC is about trust among clinicians and technology, compliance teams and submitted codes, and payers and regulators. Any system, AI or otherwise, that can’t clearly provide that doesn’t belong anywhere near reimbursement.

AI can absolutely improve HCC coding. But only if it is built for healthcare, validated to the core, and designed to work with clinicians and coders, not against them. In this space, accuracy is not a feature. It is the entire product.

So what should leaders look for? Start by asking whether the system was purpose-built for risk adjustment or simply adapted from a general-purpose model. Demand encounter-linked, evidence-traceable outputs, not broad historical harvesting. Examine how false positives are controlled and how provider trust is measured in real-world settings. And most importantly, ensure compliance and audit defensibility were built into the architecture from day one.

In HCC coding, precision isn’t optional. It’s protection.

About Ritwik Jain

Ritwik Jain is the Co-Founder and CRO of Martlet.AI, a John Snow Labs spin-off building products for the clinical coding and risk adjustment ecosystem to advance value-based care. He also serves as Senior Director of Healthcare Payers at John Snow Labs, where he leads the adoption of NLP and Generative AI across major U.S. health plans. With a background in AI, digital and cloud transformation, Ritwik focuses on aligning innovation with business outcomes to improve efficiency, accuracy, and scalability in healthcare.

HootMD

AI Didn’t Fix HCC Coding—It Made It Harder. This is How to Fix It

‘They Tricked Me’: A Father Was Chained After He Went to ICE To Reunite With His Kids

How to create “humble” AI