[{"data":1,"prerenderedAt":278},["ShallowReactive",2],{"work-intelligent-incident-reporting":3},{"id":4,"title":5,"body":6,"client":255,"deliverables":256,"description":257,"displayTitle":258,"extension":42,"featured":259,"hero":260,"heroAlign":261,"heroAspect":40,"heroComponent":262,"meta":263,"navigation":259,"path":264,"role":265,"seo":266,"stem":267,"tags":268,"team":273,"timeline":274,"tldr":275,"year":276,"__hash__":277},"work\u002Fwork\u002Fintelligent-incident-reporting.md","AI Isn't the Hard Part",{"type":7,"value":8,"toc":251},"minimark",[9,27,48,70,82,145,167,190,203,234],[10,11,14,21,24],"case-section",{"number":12,"title":13},"02","Context & Stakes",[15,16,17],"p",{},[18,19,20],"em",{},"Patient-safety incidents are inevitable in healthcare. Learning from them is not, and in Finland the data you would need to learn from was scattered across a dozen systems that don't talk to each other.",[15,22,23],{},"These incidents (the wrong dose, an avoidable infection, a complication that should have been caught) happen in every health system. Reporting them and learning from them is how you stop the avoidable harm from happening again. It's mandatory, but in practice often neglected. The estimates around the stakes are large: international figures put the cost of errors and harm at up to 13% of healthcare spending, and in Finland alone safety incidents are estimated to add up to roughly a billion euros a year. Reform of incident reporting is one of the 10 key metrics in Finland's national Client and Patient Safety Strategy for 2022 to 2026.",[15,25,26],{},"The problem is that the data needed to drive that learning barely exists in usable form. Mandatory adverse events flow into Hilmo, the national care registry, coded in ICD-10 and fed automatically from records, but near-misses never reach it. Voluntary reporting happens in local systems like HaiPro that stay at the level of a single organisation and never aggregate nationally. The result is incomplete, inconsistent, uncomparable data: enough to file, not enough to learn from, and not enough to steer a wellbeing services county's funding or safety work. That was the situation I was asked to study.",[10,28,31,34,37,45],{"number":29,"title":30},"03","Role & Approach",[15,32,33],{},"I ran the project end to end: the research and analysis, the process design, and the build. It was a research subproject inside Varha, the Wellbeing Services County of Southwest Finland, sitting under the national reform of incident reporting that the Ministry of Social Affairs and Health was funding. Varha's data-services team provided the secure environment the work had to run inside, and a Varha surgeon evaluated the reports the system generated. Being one person across the whole thing is why it spans research, design, and engineering rather than sitting in just one.",[15,35,36],{},"I framed the whole thing as design-science research, following the established six-stage model: identify the problem, define what a solution needs to do, design and build an artifact, demonstrate it, evaluate it, and communicate the result. In practice that meant interviewing my way to the requirements before I wrote a line of code, then building something real and putting it in front of a clinician to test, rather than arguing about AI in the abstract.",[38,39],"case-image",{"aspect":40,"caption":41,"size":42,"src":43,"align":44},"3\u002F2","The design-science research frame: from problem identification through to a built, evaluated artifact. Stage three is the GPT-4o proof-of-concept.","md","\u002Fimg\u002Fwork\u002Fintelligent-incident-reporting\u002Fmethodology.png","right",[15,46,47],{},"The interviews ran to 14 people: six doctors at Varha, all in senior surgical or clinical-information roles, and eight people across the national authorities (the ministry, the national health institute, the social-insurance institution, and the agency running the national data platform). I analysed them with reflexive thematic analysis, then used what I learned to design the process and the proof-of-concept.",[10,49,52,55,58,65],{"number":50,"title":51},"04","What I Found",[15,53,54],{},"Four requirements came up again and again: the systems need to integrate so the same thing isn't reported five times; classification needs to be standardised so data is comparable; AI could genuinely cut the reporting burden; and clinicians need feedback so reporting feels worth doing. The first three are technical. The fourth turned out to be the one that explains the others.",[15,56,57],{},"The sharpest recurring theme was that reporting feels futile. Clinicians are overworked and skip reporting unless an incident is clearly serious, partly because nothing visibly comes back when they do. You file into a system, and you never see it change anything, so the next time you don't bother. Underreporting isn't mostly a discipline problem; it's a feedback problem.",[59,60,62],"pull-quote",{"attribution":61},"Clinician, study interviews",[15,63,64],{},"THL collects adverse events in Hilmo, but I have never seen any feedback.",[38,66],{"aspect":40,"caption":67,"size":68,"src":69},"The synthesised findings: the four requirements (RQ1), the proof-of-concept's feasibility result (RQ2), and the national-level opportunities and barriers (RQ3). Underreporting and uncomparable data sit at the centre.","lg","\u002Fimg\u002Fwork\u002Fintelligent-incident-reporting\u002Fkey-findings.png",[10,71,74,79],{"number":72,"title":73},"05","Problem Definition",[15,75,76],{},[18,77,78],{},"Could a commercial LLM write the reports? Yes, and that turned out to be the simple bit. The harder, unfixed thing was the data beneath it: fragmented, uncomparable, and underreported.",[15,80,81],{},"This is the line the work lands on, in plain words: AI isn't the silver bullet here; a shift in the reporting culture is. A model that writes beautiful structured reports on top of data that is still entered five times, in five formats, into five systems that never aggregate, just produces nicer fragments. The real leverage is recording the event once, in the record where it already happens, auto-structuring it, and only then letting it flow somewhere it can be compared. The exact national report structure can't be settled by one project; that has to be defined with the authorities and clinicians, reusing international templates. But the shape of the fix was clear: change the process, and use the model inside it.",[10,83,86,109,127],{"number":84,"title":85},"06","Key Decisions",[87,88,90,97,103],"case-decision",{"title":89},"Human clinical evaluation, not automated text metrics",[15,91,92,96],{},[93,94,95],"strong",{},"Discarded:"," the convenient options. Automated NLP scores (BLEU, ROUGE, METEOR) measure textual overlap against a reference summary, not whether a clinical report is correct, and there is no established metric for structured output. Using another LLM as the judge is unreliable for factual medical knowledge.",[15,98,99,102],{},[93,100,101],{},"Chosen:"," a real clinician evaluating every generated report on a one-to-five scale across completeness, correctness, relevance, and coherence. The only evaluation that actually tells you if the output is clinically sound.",[15,104,105,108],{},[93,106,107],{},"Tradeoff:"," a single evaluator limits objectivity and reliability. I owned that as a known weakness rather than hiding behind a metric that looked rigorous but measured the wrong thing.",[87,110,112,117,122],{"title":111},"A general-purpose commercial model, run inside the county's walls",[15,113,114,116],{},[93,115,95],{}," building or fine-tuning a domain-specific or self-hosted model, which would have been a project in itself and was out of scope for a feasibility study.",[15,118,119,121],{},[93,120,101],{}," GPT-4o, at a low temperature (0.1) tuned for consistent clinical output, running inside Varha's own secure data environment so real patient records never left it. The pragmatic way to answer the actual question: can an off-the-shelf model do this at all?",[15,123,124,126],{},[93,125,107],{}," general-purpose models may not be the right long-term answer for healthcare. Whether a domain-specific or self-hosted open model would do better is a genuine open question I flagged for future work rather than pretended to have settled.",[87,128,130,135,140],{"title":129},"Design the process, don't just bolt on a tool",[15,131,132,134],{},[93,133,95],{}," treating the LLM as a standalone feature that summarises records on demand, which would have left the underlying duplicate-entry problem untouched.",[15,136,137,139],{},[93,138,101],{}," a single-entry reporting process with the model embedded in it and the clinician kept in the loop. Record the event once where care is documented, let the model generate a structured report, have the clinician review and sign it, then store it and send an anonymised copy onward.",[15,141,142,144],{},[93,143,107],{}," this only works if the national report structure gets defined centrally, which is beyond one project. But designing the process exposed exactly that dependency, which is more useful than a tool that hides it.",[10,146,149,152,156,159,164],{"number":147,"title":148},"07","Solution & Deliverables",[15,150,151],{},"The proof-of-concept ran on real, pseudonymised records inside Varha's environment: 40 patients, each with at least 10 free-text entries over five years, of which 21 had adverse-event codes and were used for the test. Per patient that was anywhere from 20,000 to over 700,000 characters of messy clinical text. The pipeline extracts the records flagged with adverse-event ICD-10 codes, combines them into one timestamped narrative per patient, and prompts GPT-4o with that text plus the required report structure, returning a structured incident report per distinct event.",[153,154],"poc-pipeline",{"caption":155},"The proof-of-concept pipeline: extract the adverse-event records (ICD-10 codes I26, J38.0, J93.8, L89, Y40–Y84, Y88), combine them into one timestamped narrative per patient, prompt GPT-4o with that text and the required report structure, and return one structured report per distinct event.",[15,157,158],{},"The designed artifact is the process the model belongs in: a single-entry reporting flow across three lanes (the clinician, the LLM, and the record system). The clinician records the event once as part of normal documentation; the system detects a likely incident; the model generates a prefilled, standardised report; the clinician reviews, edits if needed, and signs it; and the report is stored with the patient record while an anonymised copy is sent to a national database. Human oversight stays in; duplicate entry comes out.",[38,160],{"aspect":161,"caption":162,"size":68,"src":163},"2\u002F1","The single-entry process model (BPMN). Record once, auto-structure with the LLM, clinician reviews and signs, then store locally and send an anonymised copy to a national database. The fix is the process, not the model.","\u002Fimg\u002Fwork\u002Fintelligent-incident-reporting\u002Fprocess-model.png",[15,165,166],{},"What landed: a 113-page master's thesis; the single-entry process model; the working proof-of-concept and its clinical evaluation; and an international peer-reviewed publication of the interview insights and the PoC study.",[10,168,171,178,184],{"number":169,"title":170},"08","Outcomes",[172,173,175],"case-outcome",{"title":174},"The model held up clinically",[15,176,177],{},"Across the evaluation the generated reports scored a perfect five for coherence on every single one, with correctness and relevance consistently high and, notably, no hallucinations and no malformed structure. Completeness was the soft spot: on multi-faceted incidents the model sometimes missed details.",[172,179,181],{"title":180},"56 reports, 40 of them real",[15,182,183],{},"Across the 21 test patients the system generated 56 reports, of which 40 were genuinely treatment-related adverse events and 16 were not. So the quality of the structured writing was high; the real weakness was telling a care-related adverse event apart from other clinical issues. That looked like an artifact of asking the model to find many incidents at once, which the one-incident-at-a-time process is designed to avoid.",[172,185,187],{"title":186},"A validated concept that fed a national pilot",[15,188,189],{},"To the stakeholders' knowledge this was the first time a commercial LLM had been run on real patient data in this setting. It was published in an international peer-reviewed journal, and it fed directly into the Sitra-funded national pilot and architecture work that followed, and into the national vision for intelligent incident reporting. No national system has shipped: this is validated feasibility that set a direction, not a deployment, and I'm careful to call it that.",[10,191,194,197,200],{"number":192,"title":193},"09","Honest Reflection",[15,195,196],{},"The title is the reflection. The AI wasn't the hard part. The model did its job, and then the work argued that the model was the easy half: the bottleneck was the system around it, the fragmented data, the missing comparable structure, and a reporting culture where nothing comes back to the person who reported. That's still true for public healthcare, and it's the part I'd want anyone reading this to take away.",[15,198,199],{},"The honest limitations are real. I'd intended two or three clinicians to evaluate the output; delays meant a single evaluator, which is a genuine reliability caveat I own rather than gloss over. The research drew on one organisation and a narrow set of specialist surgeons, so the findings may not transfer cleanly to primary care or elsewhere. And a real share of the effort went not into the model but into territory nobody had charted, running a commercial LLM on real patient records under data-protection rules that partly pull against each other. None of that is a complaint; it's just where the difficulty actually lived.",[15,201,202],{},"It's also worth pinning to a date. This was 2024 into 2025, when doing this on real patient data was still close to uncharted, which is a big part of why the permissions and data-protection work above was so heavy. A couple of years on it already looks almost routine, and nobody would reach for GPT-4o for it now. The pace LLMs have moved at is genuinely wild, and a useful reminder that the model was always the part most likely to date, while the problem underneath it hasn't.",[204,205,207],"case-links",{"title":206},"Read more",[208,209,210,220,227],"ul",{},[211,212,213],"li",{},[214,215,219],"a",{"href":216,"rel":217},"https:\u002F\u002Furn.fi\u002FURN:NBN:fi:aalto-202503182886",[218],"nofollow","Master's thesis · Aalto University, 2025",[211,221,222],{},[214,223,226],{"href":224,"rel":225},"https:\u002F\u002Fpubmed.ncbi.nlm.nih.gov\u002F40588877\u002F",[218],"Peer-reviewed article · Studies in Health Technology and Informatics, 2025",[211,228,229],{},[214,230,233],{"href":231,"rel":232},"https:\u002F\u002Fasiakasjapotilasturvallisuuskeskus.fi\u002Fjusa-annevirran-blogikirjoitus-alykkaan-kansallisen-haitta-ja-vaaratapahtumien-raportointijarjestelman-kehittaminen-suomessa\u002F",[218],"Blog · Finnish Centre for Client and Patient Safety, 2025",[204,235,237],{"title":236},"Related work",[208,238,239,245],{},[211,240,241],{},[214,242,244],{"href":243},"\u002Fwork\u002Fpatient-safety-reform","Patient Safety Reform (the national reform this fed into)",[211,246,247],{},[214,248,250],{"href":249},"\u002Fwork\u002Fsurgical-complication-ai","Surgical Complication AI (the clinical pilot it seeded)",{"title":252,"searchDepth":253,"depth":253,"links":254},"",2,[],"Wellbeing Services County of Southwest Finland (Varha)","GPT-4o proof-of-concept, single-entry process model, clinical evaluation, peer-reviewed paper","A GPT-4o proof-of-concept turning free-text patient records into structured incident reports, for a Finnish healthcare authority. The model was the easy part.","Intelligent Incident Reporting",true,"\u002Fimg\u002Fwork\u002Fintelligent-incident-reporting\u002Fhero.png","full",null,{},"\u002Fwork\u002Fintelligent-incident-reporting","Researcher, designer & developer",{"title":5,"description":257},"work\u002Fintelligent-incident-reporting",[269,270,271,272],"Design Research","Healthcare","AI","LLM","Solo","2024–2025","Working solo inside a Finnish wellbeing services county, under the national patient-safety reform that funded it, I built a working proof-of-concept that prompts GPT-4o to turn free-text patient records into structured patient-safety incident reports. I ran the whole project end to end: research with clinicians and national authorities, a designed single-entry reporting process, and the model itself. In clinical evaluation it held up well, with no hallucinations. The harder finding was that the model was never the bottleneck: Finland's incident data is fragmented across a dozen systems and badly underreported, and a reporting culture where nothing comes back to the clinician is what actually needs fixing. The work was published in a peer-reviewed journal and fed directly into the Sitra-funded national pilot that followed.\n","2025","WkS8gWQrc2S4NeOEyRKc6upPH3YiMSyOup2fZdbN1ck",1780999146034]