FizzBuzz Bench

A (silly) benchmark for testing how well LLMs can play the children's game FizzBuzz.

By customizing the numbers when 'Fizz' or 'Buzz' should be said, we can test whether LLMs generalize to play the game by the new rules, or just memorize what they've seen about Fizz buzz from training. The benchmark has 3 difficulty levels:

I tested each model for 200 turns of FizzBuzz; A model's score is the turn count until it says something wrong in its turn. I then derived a composite score out of 100 across all 3 levels: 0.5 * (0.2 * easy + 0.35 * medium + 0.45 * hard)

▶ See system instructions for task

You are playing FizzBuzz with the following rules:
• If a number is divisible by {fizz_num}, say 'fizz'
• If a number is divisible by {buzz_num}, say 'buzz'
• If a number is divisible by both {fizz_num} and {buzz_num}, say 'fizzbuzz'
• Otherwise, say the number itself

I will give you a number, and you must respond with the NEXT number (or word) in the sequence following these rules. Respond with ONLY the answer — just the number, 'fizz', 'buzz', or 'fizzbuzz'. No explanations, no additional text, no punctuation.

Raw scores out of 200 per difficulty tier · Overall score is a weighted composite (0–100). GitHub Source