Towards a Mathematically Optimal Scrabble Bag

There aren’t a lot of Ses in a Scrabble bag. Despite being the seventh most common letter in English, it is the joint-eighth rarest in Scrabble.

Excluding the two blank tiles, there are 98 letters in the bag, and for the most part, the number reflects the amount that appear in English text. Z, Q, X and J are over-represented, but only because they wouldn’t appear at all otherwise. The other end of the scale is a bit more haphazard, though: S and T are roughly one third less common in Scrabble than in English text, while H, for some reason, is two thirds less likely – 6% of any given book is the letter H, but Scrabble players get but two of them.

The original document that led to Scrabble letter bags

It’s not really clear why that should be the case. The original letter distribution was devised in 1938, based on the frequencies of each letter in different length words in newspapers and a dictionary, so perhaps that methodology shafted H, or perhaps it was the texts chosen. Maybe H just wasn’t used much in 1938. Who knows.

This is interesting, because it suggests the dearth of Ses is not an artificial limitation to counter their obvious usefulness: you can stick an S on the end of 77,392 different words, making them instantly valuable – but the similarly underrepresented T only goes on the end of 1,393 words. The second most useful letter for this is D, which appends 8441 words. (The least useful letter for this is Q, which goes on the end of TALA and TZADDI. J goes on HA, TA, BEN, HAD and HAJ, although there’s only one in the bag so HAJJ doesn’t come up much. There’s no particular pattern to what you can put on the start of a word.) That means that we can ignore usefulness of letters, and apply the same model as Butts did in 1938 to create an optimal bag: two blanks, a Z, a Q, a J, an X, and then 94 other tiles in proportion to their frequency in English text. Right?

No. It’s not, in general, possible to do that. Consider the case of three letters, A, B and C, which occur equally, and you want to put them in a two-tile Scrabble game. You can’t. The errors get smaller as you get more tiles, but they don’t go away. It’s the same problem faced by electoral systems that use proportional representation: you have to agree on an algorithm before you start or else you’ll end up in an impossible situation – if the vote is 60/40, and there are four seats, do you split them 50/50, or 75/25? But you can get it near enough for Scrabble.

Is that a good idea, though? After all, while the word “at” appears a lot in newspapers, it only comes up once in the Scrabble dictionary. And in fact, these things don’t correlate well at all: there are 5 times more Zs in CSW12 than English, 13 times as many Js, and 7 times as many Ks and Bs; whereas English has 15 times as many Es as CSW12, and 13 times as many Is. Should we be playing with a tile bag containing 10 Ses, 8 Cs and 8 Ps – but only 4 Es? It sounds unthinkable, although at least you’d be less likely to get a rack full of Old McDonald lyrics.

Alternative letter distributions, based on CSW12, normal English, and the current Scrabble bag:

CSW12 English Scrabble
A 6 7 9
B 5 2 2
C 8 3 2
D 6 4 4
E 4 11 12
F 4 2 2
G 3 2 3
H 4 6 2
I 3 6 9
J 1 1 1
K 2 1 1
L 3 4 4
M 5 3 2
N 2 6 6
O 3 7 8
P 8 2 2
Q 1 1 1
R 5 6 6
S 10 6 4
T 5 8 6
U 3 3 4
V 2 1 2
W 2 2 2
X 1 1 1
Y 1 2 2
Z 1 1 1
Blank 2 2 2

Of course, players don’t know all 270,163 words in CSW12, and it’s likely that the obscurity of many of them will shift the letter distribution back towards that of normal English. In that respect, the ideal bag of tiles probably depends to some extent on the skill of the players. In the hardcore CSW12 bag, there are 19 vowels rather than 42, so it would change the game a lot.

This would probably also require different scores for each letter: X and Y currently score 8 and 4 respectively, but both are rarer in CSW12 than Z or Q which score 10. You get 1 for an N, but it’s rarer than G and D (2 each), B, M, P and C (3 each), and H and F (4 each). Thing is, scaling between 0 and 10, then rounding up, this is the closest fit you can get:

CSW12 English Scrabble
A 1 1 1
B 1 1 3
C 1 1 3
D 1 1 2
E 1 1 1
F 1 1 4
G 1 1 2
H 1 1 4
I 1 1 1
J 2 5 8
K 1 1 5
L 1 1 1
M 1 1 3
N 1 1 1
O 1 1 1
P 1 1 3
Q 3 8 10
R 1 1 1
S 1 1 1
T 1 1 1
U 1 1 1
V 1 1 4
W 1 1 4
X 10 5 8
Y 4 1 4
Z 3 10 10
Blank 0 0 0

I think that looks a bit boring, to be honest. I think the game will be more fun if the scores are more varied. Maybe some kind of gamma function would help increase the contrast on the low scores and reduce the power of X. Still, though, I’d like to have a game with the alternative distribution and see if it works at all. I feel sure it can’t, with so few vowels, but then, maybe that’s just because I’m so used to the current, vowel-heavy version that I don’t use the vowel-light words much.

Of course, if we really wanted to optimise the game, we would invent a new set of words to play with that were designed to be fun rather than similar to English. But that would be taking things a bit far, don’t you think?