|
Generated with ChatGPT
|
|
On the Horizon #8: The Age of Agents
|
|
"This is the way enterprise will work in the future: we'll have AI agents which are part of our digital workforce. There are a billion knowledge workers in the world - there are probably going to be 10 billion digital workers working with us side by side. 100% of software engineers in the future, there are 30 million of them around the world, 100% of them are going to be AI assisted, I'm certain of that. 100% of NVIDIA software engineers will be AI assisted by the end of this year" - NVIDIA CEO Jensen Huang, GTC Keynote March 2025
'Agent' is the latest buzzword in AI. It aims to describe the transition from turn-based chatbots to more task-orientated models capable of taking action in the world. Put simply, agents are models that use tools in a loop. There are some prerequisite model capabilities for agents to work - they must understand complex inputs, engage in reasoning and planning, use tools reliably, and be able to recover from errors. It is only in the last few months that frontier models have started to cross these thresholds of capability.
The implications of agents are profound. Anthropic CEO Dario Amodei expects ' a country of geniuses in a datacentre' to enter the global labour market by 2027. Millions of AI agent 'instances' can be spun up arbitrarily, limited only by the world's computation infrastructure and energy production. OpenAI see a future of 'agent swarms' collaborating to accelerate scientific progress and forming new types of organisation. Many in the field now think we may see such agents contributing to AI research by the end of this year, potentially leading to recursively self-improving intelligence.
This future also has important implications for investors. After 18 months of outperformance in AI infrastructure-related themes, everyone is acutely focused on the risk of overbuilding. Agents are central to the case for explosive demand for inference capacity over the next few years.
This piece is a little technical, and intentionally so. The voices from prominent frontier AI labs are getting louder, and for many, these can be easily dismissed as hype. Self-interest. Marketing. Maybe they're right, but the alternative take is that those at the leading edge of the field are extrapolating clear trends and seeing the potential for large-scale disruption to the economy just a couple of years out - something society is not adequately preparing for.
|
|
OpenAI's Stages of Artificial Intelligence
|
|
|
Source: OpenAI
|
|
The application of reinforcement learning in post-training has driven rapid improvements in areas where the answers can be easily verified. This has put software engineering at the forefront of AI agent progress. In the words of OpenAI's Chief Product Officer, Kevin Weil, "this is the year that AIs get better than humans at programming, forever".
|
|
In just nine months, leading LLMs' ability to solve real world software engineering issues has doubled to 70%
|
|
|
Source: Green Ash Partners
|
|
However, when adding a time dimension - i.e. comparing model performance to human task completion times, there is a clear drop off in performance for tasks that take a human longer than an hour
|
|
This is improving rapidly - the length of tasks AIs can do is doubling every 7 months, and is on course to extend to days by 2028 and weeks by 2030
|
|
In terms of knowledge work, while software engineering is at the higher-value end of the scale, it only represents around 3% of the estimated 1 billion knowledge workers globally. By far the most general agent is one that can use a browser. This gives access to to the internet, as well as countless entreprise SaaS applications and databases. An agent that can reliably use a browser can effectively perform any task that a human remote worker can. Computer use agents (CUAs) are under development by Anthropic (Computer Use), OpenAI (Operator) and Google DeepMind (Project Mariner), however these suffer from the same reliability issues over longer time horizons (for now).
The earliest glimpse of 'web agents' has been in the form of deep research products that have been released in some form or other by OpenAI, Google Deepmind and xAI. These pair reasoning models with web search capability to trawl through dozens of websites and produce detailed, high-quality analysis in a report that contains citations to the source material and can extend to many pages in length. OpenAI's CEO Sam Altman claims their deep research model can accomplish "a single-digit percentage of all economically valuable tasks in the world".
|
|
Adding 'agentic' capabilities such as web search and tool use to reasoning models more than doubles performance on the hardest benchmarks
|
|
|
* Model is not multi-modal, evaluated on text-only subset
** with browsing + python tools
Source: Humanity's Last Exam; Green Ash Partners.
|
|
Google DeepMind are now taking this further with a multi-agent 'co-scientist' system, which organises seven agents into various roles to work on research problems. In one case study, co-scientist independently created a hypothesis regarding antimicrobial resistance, which the team were able to independently confirm with wet lab results from unpublished work, in partnership with the Fleming Initiative and Imperial College London (co-scientist took 2 days to reach the hypothesis, versus ten years for the human experimental research team).
|
|
Illustration of the different components in the AI co-scientist multi-agent system and the interaction paradigm between the system and the scientist.
|
|
AI co-scientist system overview. Specialised agents (red boxes, with unique roles and logic); scientist input and feedback (blue boxes); system information flow (dark gray arrows); inter-agent feedback (red arrows within the agent section).
|
|
There are two main efforts underway to usher in the Age of Agents - scaffolding and scaling.
|
|
Scaffolding in the context of AI agents refers to frameworks written in code to chain together LLM outputs in ways to create autonomous behaviours, improve tool use and reduce errors. Start-ups have been working on this for some time, aiming to build useful systems, despite some of the reliability shortcomings of earlier foundation models.
The frontier labs are helping support this ecosystem:
Just recently, Manus AI released an agent that combines browser and tool use with deep research-style functionality. It has been fairly well received, and, because it comes from a Chinese start-up, is often referred to as another 'DeepSeek Moment' for the industry. This isn't quite accurate, as the system is built on top of Anthropic's Claude 3.5 Sonnet, and therefore relies on API calls to AWS or Google Cloud to serve (there are currently 2 million people on the waiting list). These kinds of developer successes are the bull cases for hyperscale cloud infrastructure, and the frontier labs offering foundation models via APIs.
Scaffolding has almost unlimited potential to increase in complexity and sophistication. Multi-agent systems, as previewed by Google DeepMind's co-scientist, could be assembled in their thousands or even millions, with countless agent roles and specialisms operating within. This is the 'Level 5' scenario in OpenAI's Stages of Artificial Intelligence.
|
|
We have written a lot about scaling over the last two years, though more recently it has taken on a few new dimensions:
- Scaling Pre-training - This has been doing the heavy lifting, from 1.75 billion parameter GPT-2 in 2019, through to GPT 4.5 and Grok 3, which are likely to have trillions of parameters, and require multi-billion dollar datacentres to train. From a technical perspective, this 'law' - that training loss falls/performance improves linearly with a logarithmic increase in parameter count and training tokens - still holds, however pursuing the next iteration of this requires upping the ante from a 100k GPU cluster to a 1 million GPU cluster (tens of billions of dollars)
|
|
Pre-training FLOPs has been increasing by 10x every 2 years and will require 1GW training clusters for the next generation
|
|
|
Source: Green Ash Partners
|
|
- Scaling Post-training - Post-training has been scaling in numerous ways, from the generation of synthetic data to self-critique/self-improvement loops. Reinforcement learning techniques are being applied to teach models to generate 'chains of thought' before responding. This works particularly well in easily verifiable domains such as maths, coding and science and is both easy to implement and much cheaper than pre-training (DeepSeek's RL phase likely only cost ~20% of the base model pre-training). Researchers are increasingly designing ways to expand RL into more open-ended domains like computer use and web search
|
|
DeepSeek achieved comparable competition maths performance to OpenAI's o1 reasoning model with just 8,000 reinforcement learning steps
|
|
- Scaling Inference - This has been a hot topic since the launch of o1, OpenAI's first reasoning model last autumn. Research from OpenAI's Noam Brown has shown that similar performance gains can be achieved by scaling 'thinking time' by 10x as can be achieved by scaling pre-training by 10x - a big deal given a model output at inference is 10 orders of magnitude cheaper in terms of compute than a pre-training run
|
|
OpenAI showed o1 performance improves smoothly with scaling of both train-time and test-time compute
|
|
- Scaling Search - this one is brand new. Google Research show that even an old model (Gemini 1.5 Pro from February 2024), without any RL reasoning steps in post-training, can achieve o1 level performance in competition maths by generating many answers (200 in this case), and then using another instance of the same model to pick the correct one. This is basically a parallel approach, versus the serial approach taken by reasoning models, and, if compute is abundant, the two approaches can be combined. In a more realistic scenario with cost considerations, there are trade-offs between scaling inference versus scaling search - to quote the lead researcher on the paper, "in my experience scaling Gemini 2.0 thinking [inference] from 60k -> 600k tokens is not as efficient doing search by sampling 60k tokens 10x, but scaling [inference] 8k->60k is worth it"
|
|
Generating 200 Gemini 1.5 outputs per question and then using another instance of Gemini 1.5 to select the best answer increases performance on reasoning benchmarks by +20ppts and competition maths (AIME) by +50ppts
|
|
What we hope convey with all of the above, is that, from a research and engineering perspective, there are numerous avenues being pursued simultaneously to drive rapid progress. All of these are additive and complementary, compounding on each other to improve model performance. None have yet shown signs of a plateau. However, as all of these involve scaling or parallelism in some form, agents will require huge amounts of computation to train and run.
"Well, this last year almost the entire world got it wrong. The computation requirement the scaling law of AI is more resilient, and, in fact, hyper-accelerated. The amount of computation we need at this point as a result of agentic AI, as a result of reasoning, is easily a hundred times more than we thought we needed this time last year" - NVIDIA CEO Jensen Huang, GTC Keynote March 2025
This is why the 'DeepSeek Moment', seen by some as a threat to US AI dominance and potentially AI capex ROI, was actually a gift to the industry, likely bringing the Age of Agents forward by 1-2 years. NVIDIA has already incorporated some of DeepSeek's and others' efficiency innovations into an AI engine called Dynamo, which optimises inference infrastructure by balancing short queries from many users with long reasoning queries from individual users (a very complex task). Dynamic optimisation of AI workloads democratises infrastructure efficiencies that might otherwise have only been possible for the hyperscalers. It also widens NVIDIA's moat to defend against ASICs - custom chips that could be designed for inference at lower cost. These lack the generality of GPUs and so are not well suited to dynamic optimisations via software.
|
|
At GTC, NVIDIA claimed at 30x improvement in inferencing efficiency is possible when combing their latest generation of chips with lower number precision and Dynamo's dynamic workload optimisations
|
|
|
Source: NVIDIA
|
|
To some, this looks like a race to the bottom, and from the perspective of text-based chatbots, perhaps it is. Small, open-source models can often provide 'good enough' answers to casual users, and run on consumer GPUs. Model distillation has proven extremely effective at shrinking giant foundation models into smaller formats, while retaining much of their capability. Meanwhile, at the frontier, LLMs are conquering PHD-level questions and ranking in the top hundred globally at competitive coding - but how many people can really evaluate this level of improvement in new generation of model?
|
|
LLM inference prices for a given level of performance are falling rapidly
|
|
|
Source: Epoch AI
|
|
Smaller, distilled models aside, the infrastructure requirements to serve casual/prosumer-type usage at scale should not be underestimated. An NVIDIA 8xH200 DGX server (costing ~$300k) might only be able to serve the full-size DeepSeek R1 model to a few hundred users concurrently, depending on the complexity of the queries. Scaling this to 100 million users would require $300 billion of investment in NVIDIA hardware alone, not to mention the other 50-60% of capex that goes into datacentres and all of the associated energy consumption. Furthermore, it implies 8 million Hopper GPUs - NVIDIA only shipped 1.3 million Hopper GPUs to the four cloud hyperscalers last year.
Unlike human users, who might only have a few LLM queries a day, agents run all the time. Expanding beyond text, to give agents multi-modal capabilities like vision, introducing reasoning (Jensen said this can increase compute requirements by 100-150x), and adding parallelism via scaffolding loops or many specialised agents working together, can all add orders of magnitude to demand for compute infrastructure.
The crux of the AI infrastructure debate is that there is no upper bound to demand for intelligence. Around ~$45 trillion of the world's ~$100 trillion GDP relates to human labour, while NVIDIA's datacentre revenues this year will be around $200 billion, and the total revenues of the entire global IT hardware and software combined will be single-digit trillions.
|
|
When technological progress moves so quickly, there are wide error bars around any future predictions. AI capabilities have been following their log-linear trend lines reliably for several years now (and actually accelerating in some areas), but there are things that could bring these trends to an abrupt halt:
- At the moment, the industry is 100% reliant on Taiwan for AI semiconductors. There are efforts underway to re-shore production to the US (see TSMCs announcement to invest an additional $100BN and NVIDIA's announcement to invest 'half a trillion dollars' in US semiconductor manufacturing), but these efforts will take years, and geopolitics could disrupt the Taiwanese supply chain at any time
- Market conditions could deteriorate, choking off capital for giant investment projects and extending the time between new foundation model generations
- Models could reach a level of capability that elicits a forceful immune reaction from governments and society at large, halting further development
In the absence of these disruptions, we seem to be on a fairly clear path to millions, or even billions, of agentic AIs that are better than most humans and most things. With no roadblocks currently visible from a research and engineering perspective, the speed of this journey is governed only by capital, and its allocation to compute infrastructure and energy production.
|
|
|
|