Three Years Of Predicting Football with LLMs
With the Super Bowl this weekend, what better time to reflect on this season's picks.
I don’t know much about American football. But this year, a bot I wrote outperformed 98.5% of the ~250,000 humans playing ESPN’s NFL Pick’ems against the spread.
For three years, this open-source project has sat near the top of the leaderboards. In previous seasons, I only tracked “straight” winners. This year, I let it handle the spread and confidence rankings too. It finished in the top 5.9% for straight picks, 5.5% for confidence picks, and 1.5% against the spread.
As a Seattleite, it pains me to report the bot’s prediction for Super Bowl LX: The Patriots win and cover the +4.5 spread.
Humans Still Got It
My friend and colleague, Mark, coordinates our local league. He lives and breathes the game.
This year, Mark did something extraordinary: he outperformed 99.7% of ESPN against the spread. He has the kind of intuition that comes from a lifetime of lived experience, the one thing an LLM doesn’t have.
Interestingly, when looking at straight picks and confidence rankings, the roles flipped:
Straight Pick: Bot (94.1%) vs. Mark (86.9%)
Confidence: Bot (94.5%) vs. Mark (86.2%)
The machine is better at forecasting the macro direction of the game. But Mark won out, meaningfully, when the macro was neutralized and only the minutiae mattered.
It suggests that LLMs are excellent at predicting the broad strokes, and still quite good with the nuanced strokes as well, but humans can incorporate nuance for specific matchups that an LLM just doesn’t “get” yet1.
It’s a single data point, so take the result in that context, but I think it’s a fascinating outcome.
Did We Just “Show Up”?
It’s easy to look at the ranked results and conclude that LLMs are performing at superhuman levels here, but there are meaningful confounding factors we should at least acknowledge out loud.
Survivorship Bias
In any season-long competition, there is a massive drop-off rate of participation. People get busy, their teams start losing, and they stop making picks.
Simply “showing up” every week and making rational picks likely puts you in the top 30% automatically.
We have no reliable way to infer the rate of drop-off on ESPN, though it seems reasonable to assume that being in the top 1.5% is too high to be explained by mere participation.
Emotional Picking
Most human players are “heart-first” predictors. They refuse to pick against their home team, or they over-index on what they hope will make for a good story.
The LLM, conversely, is a cold machine2. It doesn’t care about a team’s legacy or grit. It only cares about the data3.
Is it fair to call a machine’s performance “superhuman” if a median human could look at the same stat block and news articles and come to the same conclusions, if only they were blind to the teams involved?
Actual Performance
Ranking well is fun and all, but what does it mean to actually be in the top 1.5% or 0.3% of humans?
Well, it means you picked the spread correctly in 53.3% and 55.7% of the games, respectively. A bit better than a coin toss, but statistically significant.
The best picker on ESPN got the spread correct 62.2% of the time.
For straight picks, 94.1% translates to predicting 64% of games correctly. The best on ESPN was 71.5%.
What the Machine Actually Sees
That brings us to what we actually provide to the agent.
I intentionally keep the bot “blind” to public voting ratios. It doesn’t know who the “favorite” is. It consumes only two streams:
Team Stats: Anything on the team stats page that ESPN has.
Latest News: We crawl major NFL news on ESPN, classify each article, and assign them to the relevant teams.
In aggregate, we don’t provide the LLM too much data. This is mostly a vestige of the era of tiny context windows though. Next year, we’ll likely give it a lot more data to reference: explicit injury reports, per-player stats, and anything else we can grab.
⚠️ Sidebar: The Ethics of a Winning Bot
I originally built this bot as a litmus test for LLM versatility. It is valuable as a clear case of the LLM performing a quantifiable and relatable task that it couldn’t possibly have been trained on.
But last year, the outreach turned into an influx of messages from people wanting assistance with gambling via the bot.
We are in the midst of a gambling epidemic, with “gambling addiction” searches surging in legalized states and a third of adolescent boys gambling before they turn 18.
I’m not yet sure if I’ll push this year’s improvements to GitHub yet.
Looking Ahead
For the next season, I’m interested in “de-biasing” the agent and increasing autonomy. I’m thinking about:
Anonymizing Teams: Feeding the bot “Team A” vs “Team B”, along with anonymized news articles, to remove training data bias (e.g. it knows the Patriots are historically a winning franchise).
Consensus Voting: Running a best-of-N across OpenAI, Gemini, and Anthropic models. Everyone has great models these days. Why not use them?
True Agency: Instead of a pre-formatted data dump, I’m thinking about giving the bot tool-calling abilities to research games and pull data that it cares about most.
Intuitively, giving LLMs more degrees of freedom can often hurt performance, so I would expect some of these ideas to lead to regressions. But we’re doing this all for fun, and there’s something magical about letting a machine make all of the decisions.
Maybe they are capable of getting it, if given the right context.
Actually very warm.
Though there is almost certainly human favoritism in both the trained and in-context data.




