LeaderBoard - AnesBench

The following leaderboard presents the performance of over fifty different models on AnesBench and AMCQA. The “Average” refers to the mean score across both AnesBench and AMCQA.

LLM AnesBench
system1
AnesBench
system1.x
AnesBench
system2
AnesBench
overall
AMCQA
system1
AMCQA
system1.x
AMCQA
system2
AMCQA
overall
Average
Qwen3-0.6B 0.39 0.31 0.26 0.36 0.37 0.35 0.31 0.37 0.36
gemma-3-1b-it 0.33 0.26 0.21 0.3 0.26 0.25 0.21 0.26 0.28
DeepSeek-R1-Distill-Qwen-1.5B 0.32 0.28 0.27 0.31 0.26 0.26 0.22 0.26 0.28
Qwen3-1.7B 0.48 0.37 0.3 0.44 0.54 0.5 0.44 0.53 0.48
Qwen3-4B 0.6 0.46 0.34 0.54 0.54 0.48 0.48 0.53 0.53
gemma-3-4b-it 0.46 0.36 0.35 0.42 0.43 0.41 0.37 0.43 0.42
chatglm3-6b 0.37 0.28 0.25 0.34 0.36 0.36 0.34 0.36 0.35
Qwen2.5-7B-Instruct 0.56 0.44 0.36 0.52 0.68 0.63 0.58 0.67 0.59
HuatuoGPT-o1-7B 0.56 0.45 0.38 0.52 0.71 0.65 0.63 0.7 0.61
Baichuan2-7B-Chat 0.39 0.31 0.3 0.37 0.44 0.41 0.41 0.43 0.4
BioMistral-7B 0.43 0.3 0.32 0.39 0.26 0.25 0.28 0.26 0.32
DeepSeek-R1-Distill-Qwen-7B 0.4 0.34 0.28 0.38 0.33 0.32 0.34 0.33 0.35
Qwen3-8B 0.65 0.5 0.4 0.6 0.68 0.53 0.5 0.65 0.62
Meta-Llama-3-8B-Instruct 0.54 0.42 0.39 0.5 0.49 0.47 0.47 0.49 0.49
Llama-3.1-8B-Instruct 0.58 0.45 0.36 0.53 0.53 0.53 0.55 0.53 0.53
Llama-3.1-8B-UltraMedical 0.63 0.47 0.41 0.57 0.54 0.52 0.5 0.54 0.56
HuatuoGPT-o1-8B 0.58 0.46 0.39 0.53 0.57 0.53 0.57 0.56 0.55
Llama3-OpenBioLLM-8B 0.44 0.35 0.3 0.41 0.26 0.25 0.19 0.25 0.33
FineMedLM 0.4 0.35 0.27 0.38 0.3 0.34 0.38 0.31 0.34
FineMedLM-o1 0.43 0.34 0.26 0.39 0.34 0.38 0.4 0.35 0.37
Bio-Medical-Llama-3-8B 0.53 0.41 0.38 0.49 0.48 0.47 0.49 0.48 0.48
internlm3-8b-instruct 0.6 0.43 0.4 0.54 0.85 0.76 0.77 0.84 0.69
glm-4-9b-chat 0.48 0.36 0.36 0.44 0.61 0.6 0.56 0.61 0.53
gemma-2-9b-it 0.53 0.4 0.36 0.49 0.54 0.49 0.41 0.52 0.51
gemma-3-12b-it 0.56 0.46 0.36 0.52 0.59 0.55 0.51 0.58 0.55
Baichuan2-13B-Chat 0.42 0.31 0.34 0.39 0.48 0.47 0.46 0.48 0.43
phi-4 0.69 0.57 0.41 0.64 0.57 0.57 0.56 0.57 0.6
DeepSeek-R1-Distill-Qwen-14B 0.64 0.51 0.4 0.59 0.62 0.66 0.61 0.63 0.61
Qwen2.5-14B-Instruct 0.61 0.52 0.41 0.57 0.74 0.7 0.62 0.73 0.65
Qwen3-14B 0.7 0.57 0.45 0.65 0.77 0.72 0.68 0.76 0.7
gemma-3-27b-it 0.63 0.52 0.4 0.58 0.65 0.62 0.62 0.65 0.61
gemma-2-27b-it 0.6 0.43 0.36 0.54 0.57 0.52 0.48 0.56 0.55
Qwen3-30B-A3B 0.73 0.6 0.48 0.68 0.73 0.71 0.7 0.73 0.7
Qwen3-32B 0.72 0.64 0.48 0.68 0.8 0.77 0.73 0.8 0.74
DeepSeek-R1-Distill-Qwen-32B 0.67 0.56 0.45 0.63 0.66 0.71 0.65 0.67 0.65
Qwen2.5-32B-Instruct 0.65 0.55 0.44 0.61 0.77 0.73 0.69 0.76 0.68
QwQ-32B-Preview 0.69 0.58 0.44 0.64 0.74 0.7 0.68 0.73 0.68
Yi-1.5-34B-Chat 0.54 0.44 0.35 0.5 0.65 0.64 0.64 0.65 0.57
Llama-3.3-70B-Instruct 0.74 0.63 0.51 0.7 0.69 0.66 0.63 0.68 0.69
Llama-3-70B-UltraMedical 0.73 0.6 0.47 0.68 0.72 0.68 0.62 0.71 0.69
Llama3-OpenBioLLM-70B 0.68 0.55 0.44 0.63 0.65 0.6 0.6 0.64 0.64
Citrus1.0-llama-70B 0.71 0.6 0.52 0.67 0.71 0.69 0.67 0.71 0.69
HuatuoGPT-o1-70B 0.7 0.58 0.48 0.65 0.7 0.7 0.64 0.7 0.68
DeepSeek-R1-Distill-Llama-70B 0.77 0.68 0.56 0.73 0.64 0.64 0.59 0.64 0.68
HuatuoGPT-o1-72B 0.71 0.61 0.48 0.67 0.82 0.78 0.78 0.81 0.74
Qwen2.5-72B-Instruct 0.72 0.6 0.48 0.67 0.82 0.77 0.76 0.81 0.74
Qwen3-235B-A22B 0.78 0.67 0.57 0.74 0.76 0.73 0.69 0.75 0.74
Llama-4-Scout-17B-16E-Instruct 0.77 0.66 0.55 0.72 0.8 0.73 0.68 0.78 0.75
deepseek-v3 0.77 0.69 0.55 0.73 0.79 0.77 0.7 0.78 0.76
deepseek-r1 0.85 0.78 0.7 0.82 0.88 0.85 0.81 0.87 0.85
gpt-4o 0.81 0.72 0.59 0.77 0.78 0.77 0.68 0.78 0.77