All I Want for Christmas is a Random Number

Using LLMs to Make Truly Random Decisions

Dec 23, 2024

It can be tough to pick the perfect gift. You’ve got it narrowed down to a few choices but you can’t quite decide which is the right one. Wouldn’t it be nice to ask Santa’s Little Helper (in our case a friendly LLM) to pick a gift for you? Let’s try the following prompt (against gpt-4o):

I need to bring a gift to a Secret Santa party. Please pick a gift at random from the following list:

1. Teddy Bear
2. Remote-controlled Car
3. Red Ryder BB Gun
4. Crayon Set

After 200 iterations, here are our results:

That’s pretty far from random! What’s going on here?

The Moral of the Story

Despite the instructions, the LLM can’t really pick an item at random. They're not random number generators – they're prediction engines. When asked to "pick randomly," they instead fall back on their training data and learned patterns. In this case, the model seems to have a strong preference for recommending remote-controlled cars and crayon sets, likely because these are commonly given as gifts, it’s assuming children are involved, and these are generally considered safe, age-appropriate choices.

The extremely low selection rate of the Red Ryder BB Gun (only 3.5% of responses) is particularly telling. Anyone familiar with "A Christmas Story" knows the famous line "You'll shoot your eye out!" This bias towards "safer" options demonstrates how LLMs encode not just frequency of associations, but also implicit value judgments from their training data.

How far do we have to go to get a truly random selection regardless of any judgments the LLM might want to make about our options? Is it even possible?

Working Around Moral Bias

Let’s try telling it not to consider the what kind of toys they are:

Please pick a random number between 1 and 4. Once you have that number, select the item that matches that number from the following list:

1. Teddy Bear
2. Remote-controlled Car
3. Red Ryder BB Gun
4. Crayon Set

That definitely changes the results:

We were able to get around its Christmas Story bias, but our results don’t look any more equally distributed. This second distribution reveals another interesting pattern about how LLMs handle randomness. Even when instructed to pick a random number first, the model still shows strong biases - just different ones than before. The dramatic shift toward Red Ryder BB Guns (60% of selections) and Remote-controlled Cars (36%) suggests that the model may be drawing on numerical patterns in its training data when asked to generate numbers between 1 and 4.

Rather than producing truly uniform random selections, it appears to favor certain numbers in this range, which then map to specific gifts. This demonstrates how difficult it is to get genuine randomness from a system designed for pattern recognition and prediction, even when attempting to break the task into seemingly objective numerical steps.

Prompt Engineering to the Max

Okay, one last attempt at using prompt engineering to convince the LLM to give us some randomness:

First pick a random number between 1 and 1000. Then divide it by 4 and keep the remainder.

Based on that number, please pick a gift from the following list:

0. Teddy Bear
1. Remote-controlled Car
2. Red Ryder BB Gun
3. Crayon Set

This seems to help a bit but we’re still a long way from an even distribution:

Even mathematical abstraction fails to solve the LLM's randomness problem. Despite using a larger initial number range and a modulo operation to map to the gift options, we still see clear biases in the distribution.

Move the Computer Closer to the Fireplace

Maybe we need to get above the prompt. LLMs incorporate randomness when predicting tokens, controlled by a parameter called temperature. At high temperatures, the model is more likely to select from a broader range of possible tokens, even if they have lower probability scores, which introduces more variety in the outputs. At low temperatures, it sticks more closely to the highest probability choices.

Temperature seems like it’s just what we’re looking for! But will it really work? Let’s try increasing the temperature against our previous example. The limit of coherence seems to be at a temperature of about 1.5, at which point we occasionally get nonsensical output caused by the excessively high temperature:

19 Teddy Bears
81 Remote-controlled Cars
29 Red Ryder BB Guns
69 Crayon Sets
1 bytge.gvข.runטר":[],\r\nconstants dalamரோḷnatur.

Better still, but we’re still quite a ways from a truly fair distribution. How can we get the LLM to ignore its training data and pick something truly random?

Tipping the Scales

There are a few ways to manipulate the token weights to get what we want, like OpenAI’s logit_bias parameter and Structured Outputs. In this case, we’ll use structured outputs – as they’re a little more friendly to work with. Structured outputs are Open AI’s means of restricting the LLM output to match a provided JSON schema. They accomplish this by looking at the schema, figuring out all of the possible valid next tokens, and forcing any invalid tokens to have a probability of zero.

We can leverage this for our use case by asking the LLM to generate a number, but in the schema require that the output be a letter corresponding to one of our gifts. The LLM wants to choose a digit, so the probabilities it assigns to letters are all near-zero. However, the schema requires the next token to match one of the gifts, so it has to choose a letter. The tokens matching the gift options get artificially raised in probability, and since they all began nearly equally close to zero, they end up with roughly matching likelihood.

Our result:

Not bad! But while we finally achieved a relatively balanced distribution, we've just invented the world's most expensive and convoluted random number generator. We’re using a supercomputer to flip a coin!

The Moral of the Story

Even when explicitly instructed to make random selections, these models naturally gravitate toward patterns learned during training. This has important implications for applications requiring true randomness - from games and simulations to scientific sampling. While we’ve seen we can force the model to ignore its learned preferences with some work, there are better ways to spend your token budget than asking a state-of-the-art AI model to pick between a teddy bear and a crayon set.

For truly random selections, maybe just stick to Math.random() - it's faster, cheaper, and won't try to talk you out of the BB gun.

Learn more about LOGIC, Inc. at https://logic.inc

BITS of LOGIC, Inc.

Discussion about this post