LLM Benchmark Arena

View rankings across various LLMs on their mathematical reasoning and Traditional Chinese understanding capabilities.

Last Updated: Aug 16, 2025Total Models: 23Active Models: 23Top Score: 88.55%Average: 52.89%
RankModelScoreCategoryHardware
1
G
Gemma3-27B-IT-bnb-nf4 (huggingface)
HuggingFace 模型官方儲存庫
88.55%
gemma3RTX4090
2
G
Gemma3-27B-IT-QAT-Q4_0 (ollama)
ollama 官方儲存庫
85.90%
gemma3RTX4090
3
G
Gemma3-12B-IT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
85.75%
gemma3RTX5090
4
G
Gemma3-12B-IT-FP16 (ollama)
ollama 官方儲存庫
82.71%
gemma3RTX4090
5
G
Gemma3n:E4B-IT-FP16 (ollama)
ollama 官方儲存庫
77.26%
gemma3nRTX4090
6
G
Gemma3-27B-PT-bnb-nf4 (huggingface)
HuggingFace 模型官方儲存庫
77.03%
gemma3RTX4090
7
G
Gemma3-4B-IT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
74.45%
gemma3RTX4090
8
L
Llama-3-Taiwan-8B-Instruct-BF16 (huggingface)
HuggingFace 模型官方儲存庫
73.77%
llama3RTX4090
9
G
Gemma3-12B-PT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
71.34%
gemma3RTX5090
10
G
Gemma3n:E4B-IT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
71.11%
gemma3nRTX4090
11
G
Gemma3n:E2B-IT-FP16 (ollama)
ollama 官方儲存庫
69.07%
gemma3nRTX4090
12
G
Gemma3-12B-PT-bnb-nf4 (huggingface)
HuggingFace 模型官方儲存庫
66.87%
gemma3RTX5090
13
G
Gemma3-4B-IT-FP16 (ollama)
ollama 官方儲存庫
61.26%
gemma3RTX4090
14
G
GPT-OSS-20B-MXFP4 (llama.cpp)
HF: bartowski/openai_gpt-oss-20b-GGUF-MXFP4-Experimental, ThinkLevel: medium - 運行約 1 小時半
43.44%
gpt-ossRTX4090
15
G
GPT-OSS-20B-MXFP4 (ollama)
ollama 官方儲存庫, MXFP4, ThinkLevel: medium - 運行約 1 小時
39.50%
gpt-ossRTX4090
16
G
Gemma3-4B-PT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
37.45%
gemma3RTX4090
17
G
Gemma3-1B-IT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
31.54%
gemma3RTX4090
18
G
Gemma3-1B-IT-FP16 (ollama)
ollama 官方儲存庫
29.80%
gemma3RTX4090
19
G
Gemma3n:E2B-BF16 (huggingface)
HuggingFace 模型官方儲存庫
24.11%
gemma3nRTX4090
20
L
Llama-3.1-TAIDE-LX-8B-Chat-BF16 (huggingface)
HuggingFace 模型官方儲存庫
19.41%
llama3.1RTX4090
21
G
Gemma3-1B-PT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
2.35%
gemma3RTX4090
22
G
Gemma3-270M-IT-BF16 (huggingface)
HuggingFace 模型官方儲存庫
1.90%
gemma3RTX4090
23
G
Gemma3-270M-BF16 (huggingface)
HuggingFace 模型官方儲存庫
1.82%
gemma3RTX4090

Test Environment

Hardware and software specifications

OS

Ubuntu 22.04

FW

lm-evaluation-harness

GPU1

RTX4090 24GB

GPU2

RTX5090 32GB

Evaluation Benchmarks

Comprehensive assessment methodology

MATH

GSM8K: Expert-written math benchmark covering multi-step elementary-level word problems (English), ~8,500 questions (7.5K train, 1K test).

LANG

TMML+: Traditional Chinese multiple-choice cognitive benchmark, 66 domains (elementary to professional), ~22,690 questions, 6x larger and more balanced than original TMMLU.

METHOD

Flexible: Lenient answer extraction from output
Strict: Requires output to match specified format

License Information

Open source licensing details

LICENSE

MIT

COPYRIGHT

Copyright 2025 Xuan-You Lin

TERMS

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files.