FizzBuzz Bench

A (silly) benchmark for testing how well LLMs can play the children's game FizzBuzz.

By customizing the numbers when 'Fizz' or 'Buzz' should be said, we can test whether LLMs generalize to play the game by the new rules, or just memorize what they've seen about Fizz buzz from training. The benchmark has 3 difficulty levels:

I tested each model for 200 turns of FizzBuzz; A model's score is the turn count until it says something wrong in its turn. I then derived a composite score out of 100 across all 3 levels: 0.5 * (0.2 * easy + 0.35 * medium + 0.45 * hard)

See system instructions for task
You are playing FizzBuzz with the following rules:
• If a number is divisible by {fizz_num}, say 'fizz'
• If a number is divisible by {buzz_num}, say 'buzz'
• If a number is divisible by both {fizz_num} and {buzz_num}, say 'fizzbuzz'
• Otherwise, say the number itself

I will give you a number, and you must respond with the NEXT number (or word) in the sequence following these rules. Respond with ONLY the answer — just the number, 'fizz', 'buzz', or 'fizzbuzz'. No explanations, no additional text, no punctuation.
Difficulty
Easy
Medium
Hard
Provider
OpenAI
Anthropic
Google
Open weights
Model Easy Medium Hard Overall Score