Grading Surgical Complications with an LLM

Because thesis work I'd done had already put LLM pipelines inside Varha's secure environment, and nobody else had, I was brought in to lead the foundational phase of a Sitra-funded pilot. Surgeons there already graded complications by the international Clavien-Dindo scale, but recorded the grade as free text, so it couldn't be analysed, compared, or even linked back to the patient. I led the early planning, shaped the technical approach with the surgeons, and built the initial pipeline: it assembles a patient's post-operative history and proposes the Clavien-Dindo grade, with a short justification, for the surgeon to confirm, all inside the hospital's secure environment. I set the foundations and then handed it on; the pilot is still running through 2026, so the measured results are honestly not mine to claim.

Context & Stakes

Some surgical complications can't be prevented, only learned from. But here the data that learning needs lived as free text nobody could analyse.

A large share of patient harm involves surgery, and some of those complications are preventable. To prevent the next one you have to be able to see patterns, which means the data has to be structured and comparable. The Clavien-Dindo classification is the international standard for grading how severe a surgical complication is, from grade I (a minor deviation) up to grade V (death). But in Finland it isn't applied systematically or structurally. At this county, surgeons did grade complications with Clavien-Dindo in gastrointestinal surgery, but they recorded the grade as free text inside the patient notes.

Free text is almost useless for learning. You can't aggregate it, you can't trend it, and because the complication data sitting in the county's data lake came without patient identifiers, you couldn't even follow a graded complication back to the patient or the operation it came from. All of the management and patient-safety value was locked behind a missing structure. The pilot, run by Varha with the Finnish Centre for Client and Patient Safety and funded by Sitra, set out to unlock it, with AI tested as support for the recording rather than a replacement for the clinician.

Role & Approach

I worked on this pilot embedded with Varha's surgical and IT teams. The reason I was brought in was specific: I'd already built LLM pipelines on real patient data inside their environment, and nobody else had. I came in to set the foundations.

That meant a few distinct things. I was responsible for the initial planning, including the bid for the Sitra grant the pilot ran on. I planned the technical approach with the surgeons: what data a complication grade actually depends on, and how a model could propose one without adding work to a surgeon's day. And I built the initial pipeline that does it. Around me, Varha was the lead implementer for the clinical content and the system connections, the Finnish Centre was the expert partner and the route to national scale, and an IT vendor handled the structured-recording side.

It's a design, code, and AI role more than a service-design one: the same person led the planning, defined the clinical data, and built the pipeline. I owned the foundational phase and then moved on, and the pilot has kept running since.

What I Found

The core insight came quickly and shaped everything: the information needed to grade a complication already exists in the records. It's just scattered across many systems and written as prose, so it's neither analysable nor linkable. The bottleneck is structure and consistency, not missing data.

So with the surgeons I defined the inputs a grade actually depends on: the patient notes, medication, endoscopies, re-operations, anaesthesias, dialyses, imaging, stoma procedures, lab results, and the fact and timing of death. The pipeline assembles those into a single 90-day post-operative timeline per patient.

How it works. The pipeline pulls the data a grade actually depends on, assembles a 90-day post-operative timeline for the patient, and prompts a locally hosted model for the Clavien-Dindo grade plus a one-to-two sentence justification. The surgeon confirms or overrides; the result is validated against the grade the clinician had already recorded.

One subtlety was worth getting right: the timeline's anchor. The obvious choice is the operation plus 90 days, but clinicians sometimes record the grade before that window has closed, which means the natural timeline wouldn't contain what they actually saw. So the pipeline also assembles a second timeline anchored on the moment the clinician recorded the grade, to make sure the model is reasoning from the same evidence the clinician had. The grade the surgeon had already written becomes the ground truth I validate against.

Problem Definition

Asking whether a model can guess the grade skips what actually mattered: complication data that nobody can analyse, compare, or even link to a patient. The point was to make it structured and reusable without adding a minute to a surgeon's day.

That framing changes the design. The moment recording feels like extra, disconnected admin, it stops happening, no matter how clever the model is. So the model couldn't be a separate tool that produces a verdict; it had to be a suggestion that slots into the existing flow, proposing a grade the surgeon confirms or corrects in a moment. And because the whole reason to do this is national learning, it had to be scalable by design, built on a shared data model and standard classifications rather than a one-county workaround.

Key Decisions

The model proposes; the clinician decides

Discarded: an autonomous classifier that writes the grade into the record on its own. Tidy on a slide, untrustworthy in a clinic.

Chosen: the model proposes the most likely Clavien-Dindo grade with a short justification, and the surgeon confirms or overrides it. The human stays in the loop by design.

Tradeoff: it doesn't remove the human step, but that's the point. Clinical accountability stays with the clinician, the suggestion is something a busy surgeon will actually accept, and the accept-or-override action gives a real signal (an acceptance rate) to measure later.

Open models, run inside the hospital walls

Discarded: a commercial cloud API, by far the easiest thing to call.

Chosen: open-weight models (gpt-oss-20 and gpt-oss-120) running inside the county's own secure data-services environment, because real patient records cannot leave it.

Tradeoff: considerably more to stand up and operate than an API key, but it's the only architecture that is both lawful and trustworthy for this data. The privacy constraint wasn't a footnote; it set the whole shape of the build.

Validate against the grade clinicians already assigned

Discarded: scoring the model against a clean synthetic benchmark.

Chosen: compare the model's grade to the Clavien-Dindo grade the surgeons had recorded by hand, on real gastrointestinal-surgery data.

Tradeoff: that ground truth is only as consistent as the free-text recording it came from, which is the very weakness the project exists to fix. I treated it as a known limit to be honest about, not a clean accuracy number to wave around.

Solution & Deliverables

What I built and tested was the pipeline above: pull the defined data for a patient, assemble the two 90-day post-operative timelines, prompt the locally hosted models for the Clavien-Dindo grade and a one-to-two sentence justification, and compare the answer to the clinician's recorded grade.

Around it, the pilot aimed to produce an operating model for structured Clavien-Dindo recording that could live inside the patient-information system; the AI solution proposing the grade as recording support; a management dashboard aggregating the structured complication data for leadership and root-cause work; and an effectiveness evaluation. The structural payoff is the part that matters: with the grade captured as data rather than prose, serious complications (grades III to V) can for the first time be linked with patient identifiers and followed back to a pathway, instead of sitting as anonymous free text. It was designed from the start to scale, with the Finnish Centre as the route to other counties and other adverse-event types.

Outcomes

I set the foundations the pilot ran on

I owned the early planning, including the bid for the Sitra grant, then built the technical foundations: the data the model depends on, the approach for proposing a grade, and the initial working pipeline. The pilot existed to run because that groundwork was in place.

A working pipeline on real data

I built and tested the pipeline on real gastrointestinal-surgery records: assemble the post-operative timeline, prompt the locally hosted model for a Clavien-Dindo grade and a short justification, and check it against the grade the surgeon had recorded. It showed the approach was feasible, carrying that earlier work forward onto a live clinical problem.

Still running, so the numbers aren't mine to claim

I owned the early, foundational phase and then handed it on; the pilot has continued through to 2026, beyond my involvement. So there's no measured accuracy or adoption figure for me to put here, and I won't invent one. What is honestly mine is the planning, the foundations, and a pipeline that worked.

Honest Reflection

The real risk was never the model's accuracy. It was whether surgeons would adopt the recording at all. A grading tool that sits to one side of the workflow, however accurate, is a tool nobody opens. That is why the whole thing is built as a suggestion that drops into the existing flow, and why I spent as much time with the surgeons on where the grade lands as on how the model produces it.

The limitation I would name first is the ground truth. I validated against the grades clinicians had already written, which means my reference inherits exactly the inconsistency the project set out to fix. A cleaner evaluation would need a curated, adjudicated reference set, which is a project in itself, and an honest next step rather than something to paper over.

What shaped the build most was the privacy constraint. "Patient data never leaves the building" isn't a compliance line at the end; it's the reason the models are open-weight and self-hosted, and that single decision rippled through the entire architecture.

The distilled version: in clinical AI, the goal is a good suggestion, not a verdict. The model earns its place by making the right record the easy one to keep, and by being humble enough to be overruled.