Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.
With only 1,200 samples from each subset, AlphaMed already outperforms strong baselines trained with distilled CoT data, by training with minimal RL. Interestingly, we observe distinct training dynamics across different training sets, which we find closely linked to dataset informativeness. In particular, datasets with longer question lengths tend to provide richer supervision signals, resulting in more stable training and delayed saturation, thus benefiting RL training.
We train AlphaMed purely with multiple choice QA supervision without any chain of thought annotations to encourage the emergence of structured reasoning. Intriguingly, as illustrated in the figure, AlphaMed performs step by step clinical calculations and even aligns its answer with quantitative risk metrics. This showcases a form of emergent deliberate reasoning despite the absence of explicit reasoning supervision.
AlphaMed advances state-of-the-art performance across six medical QA benchmarks, including both in-domain tasks like MedQA and PubMedQA, and out-of-domain challenges such as MMLU-ProM, GPQA-M, and MedXpert. Remarkably, AlphaMed(8B) not only outperforms all models under 10B but also surpasses much larger models such as QwQ-32B. Meanwhile, AlphaMed(70B) sets a new open-source record, outperforming proprietary models like GPT-4o and significantly stronger baselines such as DeepSeek-V3 (761B). Without using any distilled chain-of-thought data, AlphaMed demonstrates strong reasoning capabilities purely through minimalist reinforcement learning.
Explore more about details of our approach, analysis of learned rethinking behaviors, and insights in medical LLM training within our paper!
@article{alhamed,
title={Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL},
author = {Che Liu and Haozhe Wang and Jiazhen Pan and Zhongwei Wan and Yong Dai and Fangzhen Lin and Wenjia Bai and Daniel Rueckert and Rossella Arcucci},
journal={arXiv preprint arXiv: 2505.17952},
year={2025}
}