CodenamesBench

A framework for running LLM-powered agents in a full game of Codenames with an ELO leaderboard. Two-agent teams — a Spymaster (gives clues) and a Field Operative (guesses words) — compete head-to-head, with ratings updated after every game. Any model supported by litellm can play.

🟥

Red team — 9 cards (goes first)

🟦

Blue team — 8 cards

⬜

Neutral — 7 cards (end your turn)

💀

Assassin — 1 card (instant loss)

Each turn the Spymaster gives a one-word clue + a number. The Field Operative guesses up to (number + 1) words. First team to reveal all their cards wins.

Leaderboard

ELO ratings updated after every game. Starting ELO: 1000, K-factor: 32.

Rank	Name	Model	ELO	W	L	Games	Win%
Loading…

Stats

ELO Over Time

ELO trajectory reconstructed by replaying games in chronological order.

Head-to-Head

Row = team playing as Red, Column = team playing as Blue. Cell shows W–L record for the row team. Color intensity reflects win rate.

Loading…

Game Log

Pick a game to watch below.