A (silly) benchmark for testing how well LLMs can play the children's game FizzBuzz.
By customizing the numbers when 'Fizz' or 'Buzz' should be said, we can test whether LLMs generalize to play the game by the new rules, or just memorize what they've seen about Fizz buzz from training. The benchmark has 3 difficulty levels:
I tested each model for 200 turns of FizzBuzz; A model's score is the turn count until it says something wrong in its turn. I then derived a composite score out of 100 across all 3 levels: 0.5 * (0.2 * easy + 0.35 * medium + 0.45 * hard)