A problem affecting 300 million surgeries per year
Every time a patient needs surgery, an anaesthetist assesses their health status and assigns a score: the ASA-PS classification (American Society of Anesthesiologists Physical Status). It’s one of the most widely used systems in medicine — for over 80 years.
The problem? Doctors disagree. Studies on hundreds of anaesthetists show the correct classification is assigned only 70% of the time. In a third of assessments, consensus isn’t even reached. A patient classified ASA 2 by one doctor may be classified ASA 3 by another — with real consequences for anaesthetic precautions, operating room preparation and post-operative management.
It’s not a competence issue: it’s a problem of inherent variability in a system based on subjective judgements.
The insight: AI reasons, it doesn’t guess
In 2024, HT-X started asking: can new-generation language models — those capable of structured reasoning (chain-of-thought) — do better?
Not better than the best specialists. Better than the average doctor, with a consistency no human can guarantee across thousands of assessments.
Answering this required scientific rigour, not a demo. It required validated data, a serious clinical partner, and a method publishable in a peer-reviewed journal.
The partner: Centro Ortopedico di Quadrante (Ramsay Santé)
HT-X collaborated with the Centro Ortopedico di Quadrante, part of the international Ramsay Santé group, one of Europe’s largest hospital groups. The clinical team — anaesthetists and hospital data scientists — worked with HT-X researchers to design a rigorous study.
The collaboration produced a scientific paper submitted to Informatics in Medicine Unlocked (Elsevier): “Improving ASA-PS Classification Accuracy Using Privacy-Preserving Large Language Models: A Multilingual On-Premise Evaluation”.
The study: 11 AI models, 20 clinical cases, 2 languages
The team tested 11 different AI models — from early ones (GPT-4, LLaMA, Mistral, Phi-4) to advanced reasoning models (GPT-o3, GPT-o4-mini, Claude Sonnet 3.7, Gemini 2.5, DeepSeek R1) — on 20 standardised clinical cases from the scientific literature.
Each case was evaluated in both English and Italian, to verify the AI works in the hospital’s language.
Results
| Metric | Human doctors | Early-gen LLMs | Reasoning LLMs |
|---|---|---|---|
| Mean accuracy | 7.7/10 (77%) | 7.7/10 (77%) | 9.75/10 (97.5%) |
| Errors per 10 cases | 2.3 | 2.3 | 0.25 |
| Error reduction | — | — | -89% |
Key figures:
- 97.5% accuracy for advanced models (95% CI: 92.9%–99.1%)
- 89% error reduction versus both doctors and early-generation models
- DeepSeek R1: perfect accuracy (10/10) with total reproducibility across repeated trials
- No difference between English and Italian evaluations
- Under 10 seconds per classification
The most relevant figure for a healthcare organisation: the difference between early and advanced models is statistically significant (p = 0.0008, Cohen’s d ≈ 1.21 — a “very large” effect).
Why on-premise and not ChatGPT
One of the paper’s central aspects — and of the KOI product that derives from it — is the choice of on-premise AI.
38% of LLM studies in healthcare don’t even address patient data privacy. HT-X made it central:
- DeepSeek R1 runs on EU cloud: no patient data leaves Europe
- GDPR and healthcare regulation compliant by design
- AI Act: the system is currently Research Use Only and undergoing medical device certification, with complete audit trail and human oversight
- Identical performance to cloud models: DeepSeek R1 (on-premise) achieves the same 10/10 as GPT-o3 and Claude Sonnet (cloud)
Using ChatGPT to classify patients would mean sending medical histories, diagnoses and clinical data to OpenAI’s servers. For a European hospital, that’s not an option.
From paper to product: how KOI was born
The scientific study wasn’t meant to stay in a journal. It’s the foundation on which HT-X built KOI, a clinical decision support system for anaesthesia classification.
The journey from problem to product:
1. Identifying the clinical need → Variability in ASA-PS classification has been documented for decades. Guidelines aren’t lacking — consistency in applying them is.
2. Rigorous scientific research → Benchmarks on standardised cases from literature, comparison with published human data, complete statistical analysis, peer review.
3. Technology choice → Open-source models (DeepSeek R1) installable on-premise, no cloud provider dependency, PRISMA infrastructure (Private Intelligence Stack for Modular AI).
4. Multilingual validation → AI must work in the hospital’s language. Italian results are identical to English ones.
5. Regulatory pathway → Medical device certification (MDR, IEC 62304). The system is a support tool: the anaesthetist decides.
6. Clinical deployment → On-premise installation in hospital infrastructure, integration with existing information systems.
What this means for healthcare organisations
This case demonstrates an approach HT-X applies systematically:
- Start from a real problem — not from technology
- Validate scientifically — with publishable studies, not demos
- Build on-premise — because in healthcare, data cannot leave
- Certify — because software touching clinical decisions is a medical device
If your facility has clinical processes where inter-operator variability is a known problem — classifications, triage, report interpretation — the approach is the same: start from data, validate rigorously, deploy with privacy.
The paper
The study “Improving ASA-PS Classification Accuracy Using Privacy-Preserving Large Language Models: A Multilingual On-Premise Evaluation” has been submitted for peer review to Informatics in Medicine Unlocked (Elsevier). Authors: Francesco Menegoni (HT-X), Claudio Trotti, Maria Beatrice Pagani, Paola Pisano.
For information about KOI or to assess AI opportunities in your healthcare facility, contact HT-X.
Frequently asked questions
Early models like GPT-4 achieve about 77% accuracy — the same level as human doctors. But the real problem isn't precision: it's that ChatGPT sends patient clinical data to OpenAI's servers in the USA, violating GDPR and European healthcare regulations. HT-X's KOI uses on-premise AI models (like DeepSeek R1) achieving 97.5% accuracy without any data leaving the hospital.
ASA-PS (American Society of Anesthesiologists Physical Status) classification is the global standard for preoperative risk assessment. It ranges from ASA 1 (healthy patient) to ASA 5 (moribund patient). It's critical because it determines anaesthetic precautions, but doctors agree on the correct class only 70% of the time — a problem AI can solve.
KOI is undergoing certification as a medical device under the European MDR regulation and IEC 62304 standard for medical software. The system is designed as a decision-support tool: the final classification remains the anaesthetist's responsibility. The scientific study has been submitted for peer review to Informatics in Medicine Unlocked (Elsevier).
Looking for a private ChatGPT for your business?
ORCA is the on-premise AI platform by HT-X (Human Technology eXcellence): your data stays yours, GDPR and AI Act compliant.
Discover ORCA