|
Horizon Fund Update: America's Next Top Model
|
|
|
Humans are extremely good at adapting. The idea of ChatGPT would have seemed fanciful as recently as 2021, as would the idea that you could conjure up a working app or videogame, consisting of hundreds of lines of code, with no programming knowledge - just a single, one sentence prompt in natural language.
There were two years between OpenAI's GPT-3 and GPT-4, and now, two years after that, we finally have GPT-5. The level of improvement between generations has accelerated, but, to many, it feels like the opposite, because we have become accustomed to interim model updates nearly every month, with frontier labs constantly leap-frogging each other to claim the top spot. Plotted on a graph, progress is still on an exponential curve in the key metrics that matter to those bought in to the AI bull case.
OpenAI has had a busy week, not only re-establishing America's lead over China on the open-source AI model leaderboards (or at least drawing even), but also displacing Grok 4 at the frontier after less than a month at the top.
|
|
|
OpenAI's 120B parameter open-source model edges out Alibaba's Qwen 3 30B on intelligence, but sits below the larger 235B version. It scores higher than DeepSeek's R1, despite having 5x fewer parameters
|
|
|
|
Source: Artificial Analysis
|
|
|
GPT-5 only just takes the top spot compared to the other frontier models released in recent months, but is a huge leap over the original GPT-4 released in March 2023
|
|
|
|
Source: Artificial Analysis; Green Ash Partners. Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR
|
|
|
Plotting model releases from the four frontier labs shows AI's advance as a continuum, rather than single leaps with each new model generation. xAI has the had the steepest rate of improvement
|
|
|
|
Source: Artificial Analysis; Green Ash Partners. Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR
|
|
It's clear from the chart above that there isn't much daylight between the four models at the frontier, in terms of intelligence. This is partly due to benchmark saturation - the widely used public evals have been largely conquered, and todays leading models are essentially PhD-level experts in every domain. But also, they are all of a similar generation in terms of compute. GPT-5 is rumoured to have been trained on 180-200k H100 GPUs - about the same as the 200k cluster used to train Grok 4.
There is one crucial point of difference though, which is that OpenAI seems to have made significant progress on reducing hallucinations. Not only does this immediately increase model utility in higher value knowledge work (especially healthcare, law and financial services), but it also advances the arrival of AI agents, the highest value unlock of all. Error-rates compound exponentially over multiple steps, and so reducing these is critical to realising the potential of asynchronous AI agents performing tasks over longer time horizons.
|
|
|
GPT-5 has significantly fewer hallucinations than OpenAI's previous state of the art model
|
|
|
|
Source: Scala
|
|
|
GPT-5 is comparable to or better than human experts in roughly half the cases in their internal benchmark measuring performance on complex, economically valuable knowledge work (spanning over 40 occupations including law, logistics, sales and engineering)
|
|
|
|
Source: OpenAI
|
|
|
GPT-5 conforms to the exponential trendline of AI models' ability to complete long time-horizon tasks doubling every 7 months
|
|
|
|
Source: METR
|
|
|
There is a cohort of knowledge workers now that have integrated AI into their work, and pay close attention to new model releases - constantly evaluating the jagged frontier of intelligence and testing each release to see what new capabilities might have emerged that can further augment their productivity.
But this is a relatively small subset of the worlds c.1 billion knowledge workers, or of ChatGPT's 700 million users, the vast majority of whom have only played around with the ChatGPT 4o base model. For this majority, GPT-5 will be a major revelation, necessitating a update to their priors on what AI can do today. Even amongst the early adopters, few will have explored the very latest models like o3 Pro or Gemini 2.5 Deep Think, which use parallel compute, or even Grok 4, as these are all locked behind ~$200 per month subscription tiers.
This is perhaps, the largest update at all - not GPT-5's performance on this or that benchmark, or its progress on along time-horizon the curve towards AI agents, but the sudden availability of frontier AI to hundreds of millions of people, for free. This feat was only possible due to massive AI datacentre investments over the last year - one of OpenAI's infrastructure engineers tweeted that OpenAI has built 60+ clusters in last 60 days, adding 200k GPUs. OpenAI's total compute has increased by 15x this year, versus 2024.
And for investors, the takeaway is that everyone is still compute constrained. From hyperscalers like Microsoft Azure, Google Cloud and AWS, to frontier AI research labs, to neo clouds like Nebius - all could go faster and do more were there not bottlenecks in chips, energy and the time it takes to build the physical structures that house AI servers. It is too early to say whether GPT-5 represents the cost/performance trade off to begin truly reshaping the economy, but it is certainly a major step along that path. And with each step comes higher conviction in the need for massive AI datacentre capacity, and the energy to power it.
|
|
|
|