Research

DR. INFO leads on OpenAI's HealthBench Hard Subset (1000 Questions)

October 2, 2025
6 min read
DR. INFO leads on OpenAI's HealthBench Hard Subset (1000 Questions) - Medical AI research by DR.INFO

Modern clinical AI systems must be evaluated in real care settings where ambiguity, incomplete information, and patient safety are central. Traditional medical benchmarks focus on knowledge recall, but real patient interactions require contextual judgment, uncertainty navigation, and communication that aligns with clinical practice. Leading research has shown that conventional testing frameworks can significantly overestimate model readiness for real clinical workflows.

Why HealthBench Matters

OpenAI introduced HealthBench to measure model behavior in realistic medical conversations. This benchmark contains five thousand clinically grounded scenarios, each graded by physician rubrics that evaluate:

  • Clinical accuracy
  • Completeness and reasoning depth
  • Context seeking and safety alignment
  • Communication clarity
  • Instruction compliance

The Hard subset includes one thousand complex and risk sensitive cases such as emergency triage, limited history questions, and global care constraints.

DR.INFO Results

0.51
HealthBench Hard Score
1000
Test Cases
0.54
Representative Test
100
Sample Size

DR.INFO achieved a 0.51 HealthBench score on the Hard set (1000 cases), outperforming reported scores of leading frontier models. On a representative 100-case test, DR.INFO scored 0.54, exceeding OpenEvidence and Pathway across communication, accuracy, completeness, context awareness, and instruction adherence.

These results reinforce the value of medical retrieval augmented architectures, echoing findings that domain aligned systems outperform general models in clinical reasoning and safety settings.

Why USMLE Is Not Enough

The United States Medical Licensing Examination (USMLE) remains a strong indicator of foundational medical knowledge, but it is primarily a structured exam built around controlled multiple choice scenarios. These formats measure biomedical knowledge and diagnostic reasoning but do not reflect real clinical conversation patterns.

USMLE-type evaluation falls short because it does not test:

  • Behavior under uncertainty
  • Multi-turn clinical dialogue
  • Context seeking when patient information is incomplete
  • Safety signaling and escalation decisions
  • Variable resource environments
  • Patient-appropriate communication

Research has shown that high exam-style performance does not guarantee safe decision making or conversational reliability in open real-patient settings. By contrast, HealthBench captures the behavioral and safety competencies required for real deployment in healthcare environments.

Conclusion

DR.INFO demonstrates advanced clinical reasoning and safety-aligned conversational behavior in complex settings. While its 0.51 HealthBench score indicates meaningful progress, it also highlights the importance of continued development. Behavior-level benchmarks like HealthBench provide a more realistic pathway for validating trustworthy medical AI than knowledge recall exams alone.

For a comprehensive analysis of our methodology, evaluation framework, and detailed results, read our full research paper: DR.INFO: Clinical Reasoning Capabilities of LLMs through Evidence Grounded Medical Search.

Key References

Experience the leading medical AI platform trusted by healthcare professionals worldwide. Join thousands of physicians who rely on DR. INFO for evidence-based medical insights.

DR. INFO — Evidence-first clinical answers