🌞 Intro

AnesBench is designed to assess anesthesiology-related reasoning capabilities of Large Language Models (LLMs). It contains 4,427 anesthesiology questions in English. Each question is labeled with a three-level categorization of cognitive demands and includes Chinese-English translations, enabling evaluation of LLMs’ knowledge, application, and clinical reasoning abilities across diverse linguistic contexts.

For dataset access, please refer to our Hugging Face repository: AnesBench, AnesQA and AnesCorpus.

For the overview of the dataset, including usage examples and code, please refer to the AnesBench GitHub repository.

🔍 Overview

⭐ Citation

If you find AnesBench helpful, please consider giving this repo a ⭐ and citing:

@article{AnesBench,
  title={AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology},
  author={Xiang Feng and Wentao Jiang and Zengmao Wang and Yong Luo and Pingbo Xu and Baosheng Yu and Hua Jin and Bo Du and Jing Zhang},
  journal={arXiv preprint arXiv:2504.02404},
  year={2025}
}