LeaderBoard - AnesBench
The following leaderboard presents the performance of over fifty different models on AnesBench and AMCQA. The “Average” refers to the mean score across both AnesBench and AMCQA.
LLM | AnesBench system1 |
AnesBench system1.x |
AnesBench system2 |
AnesBench overall |
AMCQA system1 |
AMCQA system1.x |
AMCQA system2 |
AMCQA overall |
Average |
---|---|---|---|---|---|---|---|---|---|
Qwen3-0.6B | 0.39 | 0.31 | 0.26 | 0.36 | 0.37 | 0.35 | 0.31 | 0.37 | 0.36 |
gemma-3-1b-it | 0.33 | 0.26 | 0.21 | 0.3 | 0.26 | 0.25 | 0.21 | 0.26 | 0.28 |
DeepSeek-R1-Distill-Qwen-1.5B | 0.32 | 0.28 | 0.27 | 0.31 | 0.26 | 0.26 | 0.22 | 0.26 | 0.28 |
Qwen3-1.7B | 0.48 | 0.37 | 0.3 | 0.44 | 0.54 | 0.5 | 0.44 | 0.53 | 0.48 |
Qwen3-4B | 0.6 | 0.46 | 0.34 | 0.54 | 0.54 | 0.48 | 0.48 | 0.53 | 0.53 |
gemma-3-4b-it | 0.46 | 0.36 | 0.35 | 0.42 | 0.43 | 0.41 | 0.37 | 0.43 | 0.42 |
chatglm3-6b | 0.37 | 0.28 | 0.25 | 0.34 | 0.36 | 0.36 | 0.34 | 0.36 | 0.35 |
Qwen2.5-7B-Instruct | 0.56 | 0.44 | 0.36 | 0.52 | 0.68 | 0.63 | 0.58 | 0.67 | 0.59 |
HuatuoGPT-o1-7B | 0.56 | 0.45 | 0.38 | 0.52 | 0.71 | 0.65 | 0.63 | 0.7 | 0.61 |
Baichuan2-7B-Chat | 0.39 | 0.31 | 0.3 | 0.37 | 0.44 | 0.41 | 0.41 | 0.43 | 0.4 |
BioMistral-7B | 0.43 | 0.3 | 0.32 | 0.39 | 0.26 | 0.25 | 0.28 | 0.26 | 0.32 |
DeepSeek-R1-Distill-Qwen-7B | 0.4 | 0.34 | 0.28 | 0.38 | 0.33 | 0.32 | 0.34 | 0.33 | 0.35 |
Qwen3-8B | 0.65 | 0.5 | 0.4 | 0.6 | 0.68 | 0.53 | 0.5 | 0.65 | 0.62 |
Meta-Llama-3-8B-Instruct | 0.54 | 0.42 | 0.39 | 0.5 | 0.49 | 0.47 | 0.47 | 0.49 | 0.49 |
Llama-3.1-8B-Instruct | 0.58 | 0.45 | 0.36 | 0.53 | 0.53 | 0.53 | 0.55 | 0.53 | 0.53 |
Llama-3.1-8B-UltraMedical | 0.63 | 0.47 | 0.41 | 0.57 | 0.54 | 0.52 | 0.5 | 0.54 | 0.56 |
HuatuoGPT-o1-8B | 0.58 | 0.46 | 0.39 | 0.53 | 0.57 | 0.53 | 0.57 | 0.56 | 0.55 |
Llama3-OpenBioLLM-8B | 0.44 | 0.35 | 0.3 | 0.41 | 0.26 | 0.25 | 0.19 | 0.25 | 0.33 |
FineMedLM | 0.4 | 0.35 | 0.27 | 0.38 | 0.3 | 0.34 | 0.38 | 0.31 | 0.34 |
FineMedLM-o1 | 0.43 | 0.34 | 0.26 | 0.39 | 0.34 | 0.38 | 0.4 | 0.35 | 0.37 |
Bio-Medical-Llama-3-8B | 0.53 | 0.41 | 0.38 | 0.49 | 0.48 | 0.47 | 0.49 | 0.48 | 0.48 |
internlm3-8b-instruct | 0.6 | 0.43 | 0.4 | 0.54 | 0.85 | 0.76 | 0.77 | 0.84 | 0.69 |
glm-4-9b-chat | 0.48 | 0.36 | 0.36 | 0.44 | 0.61 | 0.6 | 0.56 | 0.61 | 0.53 |
gemma-2-9b-it | 0.53 | 0.4 | 0.36 | 0.49 | 0.54 | 0.49 | 0.41 | 0.52 | 0.51 |
gemma-3-12b-it | 0.56 | 0.46 | 0.36 | 0.52 | 0.59 | 0.55 | 0.51 | 0.58 | 0.55 |
Baichuan2-13B-Chat | 0.42 | 0.31 | 0.34 | 0.39 | 0.48 | 0.47 | 0.46 | 0.48 | 0.43 |
phi-4 | 0.69 | 0.57 | 0.41 | 0.64 | 0.57 | 0.57 | 0.56 | 0.57 | 0.6 |
DeepSeek-R1-Distill-Qwen-14B | 0.64 | 0.51 | 0.4 | 0.59 | 0.62 | 0.66 | 0.61 | 0.63 | 0.61 |
Qwen2.5-14B-Instruct | 0.61 | 0.52 | 0.41 | 0.57 | 0.74 | 0.7 | 0.62 | 0.73 | 0.65 |
Qwen3-14B | 0.7 | 0.57 | 0.45 | 0.65 | 0.77 | 0.72 | 0.68 | 0.76 | 0.7 |
gemma-3-27b-it | 0.63 | 0.52 | 0.4 | 0.58 | 0.65 | 0.62 | 0.62 | 0.65 | 0.61 |
gemma-2-27b-it | 0.6 | 0.43 | 0.36 | 0.54 | 0.57 | 0.52 | 0.48 | 0.56 | 0.55 |
Qwen3-30B-A3B | 0.73 | 0.6 | 0.48 | 0.68 | 0.73 | 0.71 | 0.7 | 0.73 | 0.7 |
Qwen3-32B | 0.72 | 0.64 | 0.48 | 0.68 | 0.8 | 0.77 | 0.73 | 0.8 | 0.74 |
DeepSeek-R1-Distill-Qwen-32B | 0.67 | 0.56 | 0.45 | 0.63 | 0.66 | 0.71 | 0.65 | 0.67 | 0.65 |
Qwen2.5-32B-Instruct | 0.65 | 0.55 | 0.44 | 0.61 | 0.77 | 0.73 | 0.69 | 0.76 | 0.68 |
QwQ-32B-Preview | 0.69 | 0.58 | 0.44 | 0.64 | 0.74 | 0.7 | 0.68 | 0.73 | 0.68 |
Yi-1.5-34B-Chat | 0.54 | 0.44 | 0.35 | 0.5 | 0.65 | 0.64 | 0.64 | 0.65 | 0.57 |
Llama-3.3-70B-Instruct | 0.74 | 0.63 | 0.51 | 0.7 | 0.69 | 0.66 | 0.63 | 0.68 | 0.69 |
Llama-3-70B-UltraMedical | 0.73 | 0.6 | 0.47 | 0.68 | 0.72 | 0.68 | 0.62 | 0.71 | 0.69 |
Llama3-OpenBioLLM-70B | 0.68 | 0.55 | 0.44 | 0.63 | 0.65 | 0.6 | 0.6 | 0.64 | 0.64 |
Citrus1.0-llama-70B | 0.71 | 0.6 | 0.52 | 0.67 | 0.71 | 0.69 | 0.67 | 0.71 | 0.69 |
HuatuoGPT-o1-70B | 0.7 | 0.58 | 0.48 | 0.65 | 0.7 | 0.7 | 0.64 | 0.7 | 0.68 |
DeepSeek-R1-Distill-Llama-70B | 0.77 | 0.68 | 0.56 | 0.73 | 0.64 | 0.64 | 0.59 | 0.64 | 0.68 |
HuatuoGPT-o1-72B | 0.71 | 0.61 | 0.48 | 0.67 | 0.82 | 0.78 | 0.78 | 0.81 | 0.74 |
Qwen2.5-72B-Instruct | 0.72 | 0.6 | 0.48 | 0.67 | 0.82 | 0.77 | 0.76 | 0.81 | 0.74 |
Qwen3-235B-A22B | 0.78 | 0.67 | 0.57 | 0.74 | 0.76 | 0.73 | 0.69 | 0.75 | 0.74 |
Llama-4-Scout-17B-16E-Instruct | 0.77 | 0.66 | 0.55 | 0.72 | 0.8 | 0.73 | 0.68 | 0.78 | 0.75 |
deepseek-v3 | 0.77 | 0.69 | 0.55 | 0.73 | 0.79 | 0.77 | 0.7 | 0.78 | 0.76 |
deepseek-r1 | 0.85 | 0.78 | 0.7 | 0.82 | 0.88 | 0.85 | 0.81 | 0.87 | 0.85 |
gpt-4o | 0.81 | 0.72 | 0.59 | 0.77 | 0.78 | 0.77 | 0.68 | 0.78 | 0.77 |